Apache Spark Survey Reveals Increased Growth in Users and New Workloads Including Exploratory Data Science and Machine Learning

Print Friendly, PDF & Email

Spark_logo_featureIn order to better understand Apache Spark’s growing role in big data, Taneja Group conducted a major market research project, surveying approximately 7,000 people. The sample was made up of technical and managerial job roles from around the world directly involved in big data. The survey, which received an overwhelming response, explored experiences with and intentions for Spark adoption and deployment, current perceptions, favored vendors, and the future of Spark itself. Cloudera, the provider of a fast, easy to use, and secure data management and analytics platform built on Apache Hadoop and the latest open source technologies, which sponsored the market research project, announced the findings of the study.

An integrated part of CDH and supported with Cloudera Enterprise, Spark is the open standard for flexible in-memory data processing that enables batch, real-time, and advanced analytics on the Apache Hadoop platform.

Apache Spark has grown rapidly into one of the leading big data open source projects,” said Mike Matchett, senior analyst and consultant at Taneja Group. “We found that across the broad range of industries, company sizes, and big data maturity levels represented, over one-half of respondents are already actively using Spark. It is proving invaluable as 64% of those currently using Spark plan to notably increase their usage within the next 12 months. With an increasing number of workloads requiring real-time data streaming for analytics, the emergence of machine learning applications and data science use cases, Spark is clearly here to stay.”

Cloudera’s Leadership in Spark

Cloudera became the first Hadoop vendor to ship and support Spark in early 2014 when it was quickly becoming the framework of choice for faster batch processing. Cloudera invested in its development early. Today many Cloudera users have transitioned data processing workloads from MapReduce to Spark in their production systems, drastically reducing their data processing windows. According to the survey this trend is accelerating.

Cloudera’s customers require Spark to be delivered at enterprise scale, backed by experts that have been involved in the genesis of making it the de-facto data processing engine for Hadoop. Cloudera continues to innovate via the One Platform Initiative aimed at enhancing Spark’s capabilities around management, security, scale, streaming, and cloud. Through the initiative, Cloudera is committed to helping the ecosystem adopt Spark as the default data execution engine for analytic workloads.

Cloudera works with partners to certify new solutions built on Spark and provides the resources and support needed to bring these differentiated solutions to market more quickly, ensuring customers can solve new and challenging use cases.

Survey Results

Key findings of the Apache Spark Market Research Study include a high level of growth and momentum for Spark usage beyond expected data processing/engineering ETL workloads and a future transition to cloud deployments. Other noteworthy findings include:

  • Nearly one-half of all respondents, 54 percent, are already actively using Spark. Of those presently using Spark 64 percent say it’s proving invaluable and they intend on increasing usage of Spark within the next 12 months.

  • New Spark user adoption is also growing with 4 out of 10 people familiar with the big data project saying that they plan to deploy Spark in the very near term.

  • 57 percent rely on Spark, as provided by Cloudera, for their most important use cases, over twice the next three Apache Hadoop vendors combined. Customers that chose Cloudera over other solutions noted its regulatory-ready security and governance model, its stability and performance, its cloud portability and its integration with a complete suite of data processing, query, analytic and machine learning services as key factors.

  • Aside from the expected data processing/engineering/ETL workloads which make up 55 percent of reported Spark use today, the top active Spark initiatives include real-time stream processing, exploratory data science, and the emergence of Spark for machine learning. These are all areas where Cloudera continues to invest.

  • Barriers to adoption and challenges remain the same however, and are largely attributed to the big data skills gap and the ability to consume relevant training in a variety of formats (online, in-person, conference or tradeshow). Cloudera trains more Apache Spark professionals than any other Hadoop vendor and supports them through professional services, value consulting, and a wide breadth of partners.

Our focus is on enterprise leadership at Cloudera and we provide the critical security, data governance and compliance that our customers need,” said Mike Olson, founder and chief strategy officer at Cloudera. “The results of the survey validate the importance placed on being fully enterprise-ready today and also well prepared to support future Spark use cases. It is the key reason that customers overwhelmingly choose Spark from Cloudera over other commercial vendors.”

The survey also details the elevated role of the public cloud and Spark: “Interestingly, while on-premises Spark deployments dominate today there is a strong interest in transitioning many of those to cloud deployments going forward,” said Matchett. “Overall Spark deployment in public/private cloud (IaaS or PaaS) is projected to increase significantly from 23% today to 36% in the future.”

Sign up for the free insideBIGDATA newsletter.

Speak Your Mind



  1. the MapR Converged Data Platform supports the full Spark stack. Additionally MapR provides free complete online Spark On Demand Training (ODT) courses via MapR Academy, and Spark Certification .