In this special guest feature, Khushbu Shah, Asst. Manager – Technical Content, at DeZyre highlights how Hadoop and Spark enterprise adoption is gaining prominence to drive business value by getting close to real time data analysis. DeZyre helps you stay updated in your job-skills. DeZyre’s online platform helps professionals learn job-skills by working on live online projects and get certified by leading employers like IBM.
To stay competitive in the big data world today, it is imperative to design strategies around data analysis. The faster a business can ask questions, the faster they can get answers from their data. Organizations cannot wait for the data to be transported to a central processing location and then analyse it, because it is time consuming. The main goal today, for any organization, is to get as close to real time data analysis as possible – to drive revenue. Big data technologies like Hadoop and Spark from Apache are making it easier for organizations to achieve this. The secret power of Hadoop and Spark technologies is – they promote economic scaling, flexible development patterns and can be easily adapted to business requirements.
According to a new survey on the state of Hadoop’s maturity, the findings show that Hadoop has crossed the hype cycle and is delivering tangible business value to half the enterprises that have deployed Hadoop in production with 10 to 500 plus nodes.
According to Gartner’s 2015 Hadoop adoption study released in May 2015, 26% of the organizations are deploying, piloting and adopting Hadoop. 11% of the organizations plan to invest in Hadoop in the next one year and another 7% plan to invest in the next 2 years.
Increasing demand for big data has powered exponential rise in the adoption of technologies like Hadoop and Spark. It is necessary for organizations to react to real-time events and decision making because of the volume and veracity of big data. Adoption of various tools like Hadoop, Spark and NoSQL in production helps enterprises handle big data effectively.
Organizations are using Hadoop, across various domains in large scale big data projects for capturing single view of customers, deep data discovery and helping data scientists perform predictive analytics. Hadoop adoption helps companies predict the shifting market dynamics, understand consumer behaviour to current needs and test business hypothesis which helps them gain a competitive edge.
The 2015 Hadoop Maturity Survey by At Scale (Industry’s Largest Study on Hadoop Maturity) found that among the organizations that use Hadoop, Tableau is the most extensively used BI tool. The survey also found that among the organizations that plan to deploy Hadoop-Microsoft Excel is the leading BI tool.
Hadoop being an open source framework, one might think that cost is a major factor for increased hadoop adoption in the enterprise but that is not true. Most of the companies have cited scale-out needs as a major reason for adopting Hadoop. Hadoop adoption considerably varies across various industries. Industry surveys on Hadoop maturity have found that many online companies that are building their businesses on unstructured data are the most mature adopters of Hadoop. Apache Hadoop is widely adopted in financial services, telecommunication industries, consulting and is gaining traction in healthcare and agriculture.
What’s exciting about [Hadoop] isn’t the opportunity it’s giving to vendors like us. It’s the value that Hadoop is unlocking in big data for the industry and for society at large.”- said Cloudera Co-founder Mike Olson
Hadoop has opened doors to big data applications that need to process huge volumes of data across thousands of clusters to generate valuable business insights to drive revenue. The beauty of Hadoop is that it allows applications to store all kinds of data unstructured, semi-structured or structured data from multiple sources like social media channels, IoT sensors, devices and more. However, after the inception of Hadoop in 2006, Apache foundation has developed various other advanced programming models to be used along with Hadoop for -resource management (YARN), parallel processing (MapReduce) and several other related projects like Spark (in-memory data processing), HBase, Pig, Hive.
Hadoop is a platform of choice basically for 2 different types of workloads –
- ETL processes that require gathering data from various sources, transforming it based on business logic and uploading it to the data warehouse or database. 74% of the companies use Hadoop for ETL.
- Data Science that requires querying the data to find out valuable business insights. 62% of the companies use Hadoop for data science while 65% use Hadoop for Business Intelligence.
The major roadblocks in the adoption of Hadoop amongst the enterprises are-
- Lack of Hadoop expertise. 57% of the organizations claimed that lack of hadoop skills is a major hadoop adoption inhibitor.
- 49% of the organizations cite that understanding how to drive business value from Hadoop serves as a major challenge in its adoption.
Hadoop is Hot but Spark is Hotter in the Big Data World
Hadoop MapReduce cannot be used to analyze small data sets rapidly i.e. it does not support effective event-stream processing. MapReduce architecture and programming model is put off in favor of Spark that meets fast data needs of the mid-sized organizations whilst supporting large scale batch processing.
Spark is seeing exponential growth in enterprise adoption with lot of excitement among the big data community. Spark has inspired contributions from 400 developers across the world, since its inception in 2009, for its –
- Ability to process event streams
- Enhanced processing performance over Hadoop MapReduce
According to a survey by Databricks conducted on 1417 users from 842 organization, 40% were using Spark on Hadoop YARN, 48% were using only Spark and 11% were using Spark on Mesos. The survey report also reveals that there are 56% more Spark users now than in 2014.
The continued growth of Spark has been highly encouraging, as companies are going into production to obtain real business value, and they are doing so in a wide range of environments beyond Hadoop clusters.”-said Matei Zaharia, creator of Apache Spark.
Spark is the most remarkable Apache project that helps developers run programs 100 times faster than Hadoop through an execution engine that uses Directed Acyclic Graph. Spark supports in-memory processing and cyclic data flows. Enterprises can leverage Spark with Java, Python, R and Scala with 80 high level operators making it easy for developers to build parallel applications. Spark when used with SQL offers amazing compatibilities with several tools making it a prominent choice for running analytics against multiple data sources. Apache Spark is widely adopted by enterprises for streaming, graph analysis and machine learning.
According to the 2015 Spark User Survey, there is a 4% increase (11% in 2014, 15% in 2015) in the production use of Spark technology for advanced analytics like GraphX for graph processing and MLib for machine learning.75% of the spark users make use of 2 or more spark components in production deployments.
Spark adoption is growing quickly among organizations as users are finding Spark reliably fast, easy to use and deploy and on-par with the future growth in advanced analytics. However, there are still some major roadblocks leading to a steady adoption rate of Spark in the enterprise, which are –
- Lack of experience in using Spark. Users say that there is need for more detailed documentation particularly for developing advanced application scenarios and for performance tuning the applications.
- Lack of commercial support is also a concern for enterprises in adopting Spark in production.
Hadoop and Spark Complement Each Other
You’re not locked into either ecosystem, and because of that Spark can be both complementary to and competitive to Hadoop.”- said Gartner Research Director, Nick Heudecker
It might sound like Spark is in competition with Hadoop but that is not true. It is good to say that Spark complements Hadoop. Spark can compete with any of the new big data technologies and is not similar to them. Spark runs on Hadoop but it is not limited to Hadoop. Spark just needs a ‘resource manager’ and any big data tool like Mesos or YARN that provides this – can be used with Spark.
Adopting any new technology takes times whether it is Hadoop or Spark. This does not mean that they are unproductive tools. Enterprises are still on the verge of figuring out how to best use Hadoop and Spark for solving specific business problems. Enterprises adopting Hadoop and Spark may always be the winners in big data, however the vendors that make it easy for these enterprises to use complex big data technologies like Hadoop and Spark will win big in the big data revolution.
Sign up for the free insideBIGDATA newsletter.