Sign up for our newsletter and get the latest big data news and analysis.

Efficiency: Big Data Meets HPC in Financial Services

According to Wikipedia (wikipedia.com), “Efficiency is the (often measurable) ability to avoid wasting materials, energy, efforts, money, and time in doing something or in producing a desired result. In a more general sense, it is the ability to do things well, successfully, and without waste.”

Converging High Performance Computing (HPC) and Lustre* parallel file systems with Hadoop’s MapReduce for Big Data analytics fits very well into this definition by eliminating the need for Hadoop’s infrastructure and speeding up the entire analysis. Convergence is a solution of interest for companies with HPC already in their infrastructure, such as the Financial Services Industry (FSI).

“The insight FSI companies can receive from outside, non-structured sources can have an important impact on their businesses,” says Brent Gorda, General Manager of Intel’s High Performance Data Division. “They need to be able to correlate this content with their SQL data.” Now, they can, as a recent study by Tata Consulting Services (TCS) illustrates.

“We provide IT services on a large scale across a wide range of industries,” says Rekha Singhal, Senior Scientist with TCS. “We wanted to see how real financial services applications, not benchmarks, and real data would perform in a Hadoop framework on top of an HPC architecture.”

Ms. Singhal and her colleagues were dealing with two problems in their analysis. Data sets for financial and insurance can be massive—up to four terabytes and larger—so moving that much data between a Hadoop cluster and Lustre file system is inefficient. And, creating a new Hadoop cluster with local storage just to run MapReduce jobs would be expensive for customers.

“Generally, companies who want to do Big Data analysis are adopting the Hadoop platform,” adds Ms. Singhal. “But, if they have HPC, they have to move data from Lustre to HDFS, do the analysis using MapReduce processing, and then read the data back to Lustre and the HPC applications to control the financial simulations. Our objective was to come up with a platform for Hadoop data analysis using an HPC cluster that would give us good performance,” says Singhal.

TCS ran their workloads on an HPC cluster at Intel’s HPC Lab in Swindon, England using Intel® Enterprise Edition for Lustre software with its Hadoop Adapter for Lustre (HAL) and HPC Adapter for MapReduce (HAM) components. Intel and TCS worked together to optimize their financial services and insurance applications for Lustre on HPC.

“We used the adaptors in Intel® Enterprise Edition for Lustre software to connect Hadoop to the Lustre file system and run MapReduce in an HPC environment,” comments Singhal. “We ran two very complex queries using real applications, some with joining operations, java code, and SQL, on each of the financial and insurance data sets. To exercise the test fully, we used different data sizes and levels of concurrency. When we did the evaluations with Hadoop with Lustre as well as Hadoop with HDFS, we found the solution with Lustre ran three times faster.”

With data sets as small as 200 gigabytes, HDFS does not impact overall performance where Hadoop output must feed the HPC cluster for post-Big Data simulations. But today, companies have access to petabytes of social media data. When the data scales to terabytes, Lustre is faster, according to Gabriele Paciucci, a Solutions Architect with Intel’s High Performance Data Division.

Part of the performance gain is because the two adapters create an efficiency not achievable in a traditional Hadoop framework—it eliminates MapReduce’s shuffle phase. With HAM and HAL, Hadoop writes all the data to a globally accessible Lustre data store at up to 2 TB/sec, removing the need to communicate sideways across the cluster to share the results. MapReduce’s shuffle simply disappears. It just reads the results back from Lustre for reductions. “Lustre, with the adapters, is the only file system that can scale with those kinds of data volumes and serve it up efficiently for Hadoop Big Data analytics,” adds Gorda.

“Intel and Tata have shown how actual FSI applications can run fast and efficiently on top of an HPC architecture,” says Ms. Singhal. “We are sharing this information with our customers.”

Learn more about Intel® Solutions for Lustre software here.

Leave a Comment

*

Resource Links: