In this special guest feature, Ian Lumb of Bright Computing examines the incredible ascent of the Apache Spark distributed computing platform and provides what he sees as the top 8 reasons why Spark is so hot right now. Ian Lumb’s role is Product Marketing Manager at Bright Computing with focus on Bright Cluster Manager for Apache Hadoop.
By the final quarter of 2014, the verdict was in: Interest in “Apache Spark” surpassed “Apache Hadoop”. About a quarter later, the gap is even wider. While interest in Hadoop is increasing steadily, Spark continues to enjoy a nonlinear ramp of the hockey-stick variety. Is there substance behind the interest Spark has ignited? Here are 8 data points suggesting there is.
- Spark replaces MapReduce. Although MapReduce retains status as the canonical programming model for Big Data, its inefficient handling of iterative algorithms as well as interactive data mining tools served as the impetus for developing alternatives. Spark excels at programming models (pdf) involving iterations, interactivity (including streaming) and more.
- Spark can use HDFS. “Can” is the operative word here, as Spark can make use of the Hadoop file system (HDFS) provided by the Apache Foundation (vanilla HDFS), Cloudera (CDH), Hortonworks (HDP) and others. Use of CDH or HDP illustrates one of two integration points between these data platforms and Spark. Spark, however, does not require HDFS. For example, integrations with OpenStack Swift and Amazon S3 already exist. And of particular interest for those having an existing investment in Lustre, Intel has plans to make its enterprise-grade version of Lustre amenable to use by Spark. Given the momentum behind Spark, and its inherent flexibility in making use of parallel, distributed file systems, you can already find or likely expect numerous open source and commercial HDFS-alternatives to follow in the not-too-distant future. This is about a lot more than just customer choice. It’s about inertia – on business and technical fronts. For organizations having existing investments in solutions not based on HDFS, the ability to leverage their incumbent solution significantly lessens the effort required in investigating and ultimately adopting Spark. The Cypress supercomputer at Tulane University provides for a recent case-in-point. Cypress makes use of Intel Enterprise Edition for Lustre (IEEL) to deliver a single file system for both the University’s High Performance Computing (HPC) and Big Data Analytics (BDA) needs. Bright Cluster Manager deploys, monitors and manages the entire HPC and BDA stacks on Lustre – and this even includes CDH on Lustre, not HDFS. As soon as IEEL incorporates support for Spark, Bright will be able to deploy, monitor and manage Hadoop and Spark on Lustre (not HDFS) as well as HPC. Alongside Lustre, possibilities appear to exist for GPFS, as IBM has already developed a GPFS Hadoop Connector. In a fashion analogous to the Hadoop-on-Lustre deployment in place at Tulane, those with an existing investment in GPFS could deploy Hadoop on GPFS. Investigations involving Spark and GPFS appear to be underway. A 2013 survey conducted by the 451 Group concluded that “… adoption of alternatives to HDFS is limited at this stage …” They added that enterprise adoption of Hadoop, however, is likely to fuel interest in HDFS alternatives. When coupled with its analytics capabilities, file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives.
- Spark can use YARN. Can, again, is the operative word here, as Spark can make use of Hadoop’s YARN. Notably, this is the second of two integration points between Spark and the Hadoop distributions provided by Cloudera and Hortonworks. This is also a point of integration for IBM Platform Symphony and YARN; in other words, Spark workloads can make use of Symphony scheduling policies and execute via YARN. From a standalone setup that does not even involve use of a workload manager (WLM) to use of Apache Mesos, however, viable and interesting alternatives to YARN exist. Ideally suited to low-latency parallel workloads, emerging decentralized alternatives like Sparrow (pdf) also merit consideration. HPC has a storied legacy in the area of WLM. Given that some of the developed-for-HPC WLMs are already serving duty in the Hadoop arena (e.g., PBS Professional at Clemson University), it won’t be long before they are all adapted for use with Spark. IBM Platform Symphony with an Application Service Controller, for example, can deploy Spark clusters via Docker containers.
- Spark can be deployed, but not fully monitored or managed. Regardless of your choices relating to file systems and workload managers, in most cases deploying Spark remains a manual process. In the worst-case scenario, Spark needs to be installed and configured manually – and this, of course, assumes you don’t need to build it from source, and that you have an existing cluster. Cloudera Manager has improved on this situation somewhat through use of its software-management construct known as parcels; it also assumes an existing cluster, however. Using a SaaS-based approach, Spark creators Databricks intend to ease the deployment challenge by offering Spark via the cloud. Apache Ambari, the management toolkit used by Hortonworks, does not address this need. Bright Cluster Manager deploys Spark on bare metal – in other words, Bright does not require an existing cluster. A Bright management construct known as a role is being developed for Spark; along with Spark-specific metrics, Bright will soon have a solution for deploying, monitoring and managing Spark. Other than the capabilities provided by the project itself, monitoring Spark remains an outstanding need.
- Spark enables analytics workflows. From its library for machine learning (MLlib) and API for graph analytics (GraphX), to support for SQL-based queries and streaming applications, Spark delivers a converged analytics platform. Convergence means, for example, that you can write your own code using Java, Scala or Python, that makes use of one or more of these components in crafting an analytics workflow. Workflows can be executed in a batch mode or in real time using the built-in interactive shell support available in Scala and Python. Because the notable stats package R is already one of the supplemental projects, Spark’s analytics stack is quite comprehensive. Spark can access any Hadoop data source – from HDFS (and other file systems) to databases like Apache HBase and Apache Cassandra. Thus data originating from Hadoop can be incorporated into Spark applications and workflows.
- Spark uses memory differently and efficiently in benchmarking studies involving in-memory storage (i.e., a diskless HDFS instance) of binary data, Spark outperformed Hadoop by a 20x factor. Efficiencies aside, Spark must be doing something differently – very differently. Spark owes its lightning-fast reputation to Resilient Distributed Datasets (RDDs). RDDs are a relatively new abstraction for in-memory computing resulting from a research project in the AMPLab at UC Berkeley. As the name implies, RDDs are fault-tolerant, parallel data structures ideally suited to in-memory cluster computing. Consistent with the Hadoop paradigm, RDDs can persist and be partitioned across a Big Data infrastructure ensuring that data is optimally placed. And, of course, RDDs can be manipulated using a rich set of operators. Ultimately, RDDs comprise the primary justification for the exponentially escalating interest (aka. hype) in Spark – this is a sound example of tech transfer R&D at its finest.
- Spark uptake is significant. Spark 1.2.0 was released in mid-December 2014. Over 1,000 commits were made by the 172 developers contributing to this release – that’s more than 3x the number of developers that contributed to the previous release, Spark 1.1.1. Although Spark’s ability to engage the community is critical to ultimate success, also impressive is the ever growing list of on-the-record organizations making use of Spark. There are almost certainly many other organizations that have not gone on the record … for a variety of reasons. Spark cannot be ignored and, in part, this is forcing some of the well-established players in Hadoop ecosystem to cooperatively compete. Whereas integrations involving data sources, workload managers, and even the management of Spark are relatively uncontentious, synergies involving analytics apps for graphs, machine learning and streaming will remain problematical. As enterprise-software juggernauts HP, IBM, Microsoft (e.g., Microsoft Cosmos), Oracle, SAP (e.g., SAP HANA) and others continue to recontextualize themselves for Big Data Analytics, the increasingly disruptive impact of Spark is likely to be increasingly influential.
- Spark’s results are jaw-droppingly impressive. Spark bests Hadoop by a factor of 20 in a best-case scenario for Hadoop – involving use of binary data and an in-memory HDFS instance. Spark even beats Hadoop by a factor of 10 when memory is unavailable and it has to use disks. In this latter case, it’s again the RDD abstraction for in-memory computing along with an efficient execution engine that accounts for the difference. With results like these, Spark is being put through its paces by numerous people at numerous organizations all the time. You can easily Google for the latest results. The inescapable conclusion is that Spark really is lightning fast. And isn’t this the most-demonstrable substance to back up the interest in Spark? Because it is boasting performance comparable to Spark’s, Apache Flink deserves mention. Further research is required.
From the comp sci R&D lab to the highly coveted implementation known as Spark, this converged platform for Big Data Analytics workflows delivers compelling results. In the case of Spark, there is substance to support the rapidly escalating levels of interest. Spark may cause Google Trends to introduce a logarithmic scale.
Sign up for the free insideBIGDATA newsletter.