Hadoop for HPC—It Just Makes Sense

Print Friendly, PDF & Email

“Sponsored Post”

Big Data is getting bigger as the Internet of Things (IoT) continues to intensify to an expected 25 billion connected devices by 2020, according to Gartner[i]. Between now and 2020, EMC estimates data will double every two years, resulting in every man, woman, and child on the planet generating about 5,200 gigabytes in 2020 alone[ii]. Adding to the various types of data enterprises already have access to, the IoT will erupt with an unending stream from wearables, smart vehicles, smart appliances, smart everything that can potentially reveal entirely new insights through analytics. These are the types and sizes of data sets ripe for Hadoop workloads.

Hadoop_elephantsAn increasing number of companies that already use High Performance Computing (HPC) clusters running a Lustre file system for simulations sees the value of their existing data and future data. They are interested in what that data might reveal running Hadoop analytics on it. But building out a Hadoop cluster with massive amounts of local storage and replicating their data on the Hadoop Distributed File System (HDFS) is a considerably extensive and expensive undertaking, especially when the data already resides in a POSIX compliant Lustre file system.

Today, these companies can adopt analytics written for Hadoop and run them on their HPC clusters. With the work Intel has done in their Intel® Enterprise Edition for Lustre Software, Hadoop makes sense for these organizations. They can try Hadoop and then add new analytics without scaling out Hadoop clusters.

“Everyone associates Big Data with Hadoop, which is an important component, and the one we addressed first,” says Brent Gorda, General Manager of Intel’s High Performance Data Division.  “However, there is much more to Big Data and we are busy working on that much more.”

With Intel’s version of Lustre, Hadoop jobs look like any other HPC job due to a specially written interface to the widely used HPC job scheduler, Slurm. The Intel connector is called HAM, which stands for HPC Adaptor for MapReduce. Additionally, Intel developers have written a file system interface to Hadoop called HAL (Hadoop Adapter for Lustre) that replaces HDFS with Lustre. With HAL, Hadoop behaves as if it’s reading from and writing to HDFS. HAL removes the need for replicated local storage and the work necessary to pour data from Lustre to HDFS. Thus, Map/Reduce can run on an HPC cluster using data pulled from Lustre, potentially giving enterprises with HPC cycles new possibilities for their extant data.

Gorda points out that when you look deeper into the architecture, Hadoop is really a workload manager for MapReduce, which spreads the work out over a cluster, maps the data, shuffles it horizontally across the nodes, reduces it down, and digests the results. The shuffle phase, possibly moving hundreds of terabytes of data, can be time consuming, even over a 10 gigabit Ethernet wire. With HAM, HAL, and Lustre, Hadoop writes all the data to a globally accessible data store at up to 2 TB/sec, removing the need to communicate sideways to share the results. The shuffle part of MapReduce simply disappears. For Reductions, MapReduce just reads the results back. That delivers a significant savings in terms of time to solution. Running evaluations in their Swindon, UK BigData Lab, Intel has seen a 3X speedup on MapReduce running on Lustre over HDFS[iii], partly due to eliminating the shuffle.

Speedup aside, there is still an additional reason that Hadoop on HPC makes sense. With Hadoop and HDFS, data is fed into the file system, and then copied two more times for redundancy. Compute is done on all three copies. Processing data from essentially only one-third of a storage array is both inefficient and expensive. In addition to its performance, Lustre is an efficient, resilient file system. There’s no need for data in triplicate, saving a large investment in hardware for Hadoop workloads, when those workloads are big.

When Intel’s HPC customers began requesting a way to run Hadoop problems using their existing cluster cycles, Intel developed their HAM and HAL adapters and best practices for enabling Hadoop on Lustre. For these and other companies, Hadoop on Lustre just makes sense.

[i] http://www.gartner.com/newsroom/id/2905717

[ii] http://www.emc.com/leadership/digital-universe/2012iview/executive-summary-a-universe-of.htm

[iii] http://cdn.opensfs.org/wp-content/uploads/2014/10/8-Hadoop_on_lustre-CLUG2014.pdf

Speak Your Mind