Interview: Replacing HDFS with Lustre for Maximum Performance

Print Friendly, PDF & Email

Is it time to turn the Hadoop Elephant upside down? In this video from the 2014 Lustre Administrators and Developers Conference, Gabriele Paciucci from Intel describes how the company has enabled Hadoop users to maximize their performance using the Lustre File System.

When organizations operate both Lustre and Apache Hadoop within a shared HPC infrastructure, there is a compelling use case for using Lustre as the file system for Hadoop analytics, as well as HPC storage. Intel Enterprise Edition for Lustre includes an Intel-developed adapter which allows users to run MapReduce* applications directly on Lustre. This optimizes the performance of MapReduce operations while delivering faster, more scalable, and easier to manage storage.”

Full Transcript

insideBIGDATA: I wanted to ask you about this new connector you have for Hadoop. How did this come about?

Gabriele Paciucci

Gabriele Paciucci

Gabriele Paciucci: We have released two software connectors for Hadoop. One is the Hadoop adapter for Lustre that gives you the possibility to run Hadoop job on top of Lustre without any HDFS framework. Another connector is the HPC adapter for MapReduce that give you the possibility to integrate, in any existing scheduler, MapReduce jobs.

insideBIGDATA: Let’s start with the first one. Why would I want to connect Lustre to Hadoop? Why not just use HDFS?

Gabriele Paciucci: We are doing this as a request from our customer who wants to integrate Lustre in Hadoop in an existing Lustre cluster. After that, we saw several enterprise customers that want to put in production their Hadoop project, and the return of investment on a three times replication for the system is very bad, because you have only 25% of the raw capacity of your multi-petabyte system with three time replication. With Lustre you can achieve more return of investment, and the full speed of an InfiniBand system. So there are several reasons why enterprise customer and HPC labs are looking with interest with this software.

insideBIGDATA: So, what kind of performance difference could I expect, or what have you seen out there – the difference between Lustre and HDFS on a deployment?

Gabriele Paciucci: It’s difficult to say. It’s something that you have to do case by case. But with the full bandwidth that you can have from InfiniBand, you can take a lot of advantage from Lustre compared to HDFS, that is designed for the cloud, not for an environment with a low latency high-speed network.

insideBIGDATA: What about the other Big Data software packages, such as Spark, and Yarn, and all these other things– does that come to play here?

Gabriele Paciucci: Yes. The first version of the connector always support MapReduce jobs. Now, in the next version that we will unveil doing the next supercomputer, we will support Pig. And we have plan to also support the Spark. But as you know, Intel has done a good investment in Cloudera, so we are aligned with the roadmap of Cloudera in order to support Spark. Hadoop is a very huge ecosystem.

insideBIGDATA: Coming from Italy, you’ve dealt with a lot of enterprise customers, whereas a lot of us H{C guys have stayed in the research area. Is this exciting for you to see these two worlds coming together?

Gabriele Paciucci: Yes, it’s very exciting, because I’ve seen that many enterprise customers are looking at HPC-style workloads. Something is coming to mix together these two environments. But it’s also a challenge for us, because we have to push into Lustre enterprise-ready functionality. Intel is doing that, because we are implement ing several enterprise feature in our distribution of Lustre.

See more talks in the LAD’14 Video Gallery.

Sign up for our insideBIGDATA Newsletter.

Speak Your Mind