HPC Storage Performance in the Cloud

Print Friendly, PDF & Email

When it comes to feeding high-performance computing (HPC) and enterprise technical computing clusters with data, Lustre, the open source parallel file system, provides the performance and scalability to meet the demands of workloads on these systems. Lustre is used in well over half of the top 100 supercomputers in the world and nine out of the top ten, and its popularity in enterprise is growing—not only for technical computing, but also for converged technical and office infrastructures. Now, with the growing adoption of HPC in the cloud, Lustre performance is readily available for HPC users on Amazon Web Services (AWS) Marketplace and Microsoft Azure with Intel® Cloud Edition for Lustre* software (Intel® CE for Lustre* software).

The core benefits of Lustre are its amazing performance and massive scalability,” said Micah Bhakti, Product Manager for Intel CE for Lustre software. “Lustre is essentially a software-defined storage system that’s incredibly efficient. You get maximum performance out of the hardware. This is true on-premise and in the cloud.”

With Lustre on Amazon,” added Robert Read, one of the software developers behind Intel CE for Lustre software, “we can saturate the network on a compute cluster on Amazon. So, a C4.8xlarge instance on AWS can support up to 500 megabytes per second to Elastic Block Store. We can do that with Lustre. Additionally, Intel CE for Lustre software integrates other features, such as support for using IPsec for securing the file system data over the network.”

Intel CE for Lustre is essentially the Intel® Foundation Edition for Lustre* software with the appropriate tools to adapt to the cloud environment,” stated Michael MacDonald, a colleague of Read’s and fellow developer on Intel CE for Lustre software. “We take the software services that make up the cluster file system for Lustre and put them on a virtualized stack rather than on physical hardware,” added Bhakti. “Our extra tools orchestrate the cluster being built and installed on the cloud service.”

HPC with Lustre in the cloud enables important computing work to get done without the costs and logistics associated with building out a large on-premise cluster. Scientists and engineers needing to complete their research or simulations have easy access to massive resources for their projects. Plus, it allows them to rightly scale to the level of resources they need at any given moment to achieve the throughput they need for workloads, such as whole genome sequencing (WGS). A user can stand up a compute cluster in a matter of minutes instead of months to years of procurement, installation, and validation of an on-premise solution, complete with a Lustre storage cluster capable of handling terabytes of data. “I’ve stood up 50 to 60-node Lustre clusters on Amazon in less than ten minutes,” said Read.

“Running workloads in the cloud on Lustre also takes away the challenges of maintaining an on-premise cluster,” said Bhakti. “There are a lot of things that Amazon and Microsoft are doing for you, like encryption, backup, and server maintenance for the Lustre deployment. Taking that work out of your support team and giving it to an organization that is used to doing it at scale is a real benefit. The company’s resources can focus on other important initiatives that need attention.”

According to Intel in their Reference Architecture white paper on building Lustre on AWS, Intel CE for Lustre supports several advanced AWS capabilities, including Amazon Virtual Private Cloud (VPC), which lets users “provision a logically isolated section of the AWS Cloud where you can launch AWS resources in a virtual network that you define.” Upon launching Lustre, AWS also automatically sets up Amazon EC2 AutoScaling for high availability. If a Lustre instance starts to fail, AutoScaling detects the condition and starts a new instance, reattaching  the orphaned target’s resource, and keeping data accessible for the compute cluster.

“HPC customers on the cloud need performant storage,” said Bhakti. “Amazon and Intel worked closely together to develop and launch the first offering of Lustre on AWS Marketplace to support their HPC services.” Since then, Microsoft and Intel have worked together to add Intel CE for Lustre to Azure for their HPC customers.

Intel points out, that Lustre is ideal for running IO-demanding workloads during the compute stage and not recommended for storing data long term. They suggest using a CSP’s storage service, such as S3 on Amazon, for long-term data retention.

SAS offers software services on AWS using their SAS Grid Manager to run clusters for high-performance data analytics (HPDA). This allows SAS customers to move to the cloud. SAS, who is looking for Lustre to meet the IO demands of SAS Grid, completed some significant performance testing on AWS with Intel CE for Lustre software. The testing is detailed in an SAS white paper. SAS noted that, for performance, they focused on throughput, not capacity, when configuring the storage architecture.

As a comparison, in an on-premise, single-instance SAS (non-SAS Grid) deployment, users can see very high throughput from their workloads. In the testing SAS performed, ramping from 900 MB/second to 2.0 GB/s and peaking at 4+ GB/s would be a ‘good average’ for an on-premise, single-instance using the workloads they ran on AWS with Lustre. This is typical throughput from all three SAS file systems their software utilizes: SASDATA (permanent storage), SASWORK (working file space), and UTILLOC (utility file space).

For testing on AWS, SAS engineers configured a SAS Grid with Lustre and used two different storage configurations: 1) they deployed a SAS Grid cluster that tapped fast, local SSDs on the compute nodes for SASWORK and UTILLOC, while using Lustre on AWS for SASDATA; 2) they ran all data from the shared Lustre file system.

According to SAS, “Testing results showed good processing efficiency and that the workload was not I/O bound. This means that the SAS workload was able to use the compute power fully for the 64 cores (4 i2.8xlarge instances x 16 cores per instance) under test.” They achieved 1.5 Gbps of throughput from their Lustre cluster, which offered all the performance needed for their workloads. In other testing, they saw up to 3 Gbps of uncached performance from Lustre on Amazon. Their work is summarized in a post on Amazon.

Intel offers its Lustre cloud edition software in various value-added offerings on AWS Marketplace. Supported offerings come with help from Intel experts. “There’s some learning curve in getting familiar with Lustre and how to launch and maintain it,” said Bhakti. “So, it’s offered with support and without support to accommodate the experience of the users.” Intel also provides several AWS Cloud Formation templates for launching a Lustre cluster. The SAS authors of the Amazon post expressed their amazement at being able to get Lustre running and available in 10 to 15 minute using the template.

Ken StrandbergContributed by: Ken Strandberg, a technical story teller. He writes articles, white papers, seminars, web-based training, video and animation scripts, and technical marketing and interactive collateral for emerging technology companies, Fortune 100 enterprises, and multi-national corporations. Mr. Strandberg’s technology areas include Software, HPC, Industrial Technologies, Design Automation, Networking, Medical Technologies, Semiconductor, and Telecom.

 

Sign up for the free insideBIGDATA newsletter.

Speak Your Mind

*