Alluxio Abstracts Underlying Storage from Big Data Applications for Petabyte Scale Computing at In-Memory Speeds

Print Friendly, PDF & Email

Alluxio_logoAlluxio (formerly known as Tachyon), the first virtual distributed storage for Big Data, announced its open source version 1.0 release and shared its vision for how Alluxio aims to become the storage abstraction layer for Big Data in the same manner that Apache Spark became the computation layer. This memory-centric architecture breakthrough allows developers to interact with a single storage layer API without worrying about the configurations and complexities of the underlying storage and file systems. Co-created by Haoyuan Li, a founding committer of Spark, Alluxio is an open source software project that was born at UC Berkeley’s AMPLab.

The vision of the Alluxio project is a virtual distributed storage layer between big data computation frameworks and underlying storage systems that delivers data at memory speed to any target framework from any storage system regardless of its location. Historically, in-memory has been viewed as “cache-only”, but Alluxio’s technology breakthrough is its separation of the function layer from the persistent storage layer. Organizations can run any big data framework (Apache Spark, Apache MapReduce, Apache Flink, Impala, etc.) with any storage system or filesystem underneath (Alibaba OSS, Amazon S3, EMC, NetApp, OpenStack Swift, Red Hat GlusterFS, and more), running on any storage media (DRAM, SSD, HDV, etc.).

Only three years in existence, Alluxio has gained broad industry support as an open source project. With more than 200 contributors, more than 12,000 commits, and over 50 commercial organizations contributing, Alluxio runs in production at some of the largest cloud providers for Petabyte scale workloads, in financial services to meet government regulations, for research by leading universities, and at technology vendors globally.

In financial services, Alluxio brings many advantages. It helps banks make faster and better trading decisions through dramatic performance improvements and also helps satisfy regulatory requirements. Barclays, the global financial services firm with 48 million customers and clients, recently published a report entitled “Making the Impossible Possible with Tachyon: Accelerate In-Memory Processing with Spark from Hours to Seconds,” about how it uses Alluxio to boost big data analytics performance without duplicating confidential customer information to disk.

IBM Research recently published a blog about using Tachyon for “ultra-fast Big Data processing” to overcome “critical bottlenecks for system workloads.” Intel recently published its findings on the diverse range of Big Data storage challenges that Alluxio can address.

For some of the world’s cloud computing giants, Alluxio is allowing analysts to discover insights interactively by analyzing petabytes of data in near real-time to improve customer experience.

As one of the largest Internet company in the world, Baidu constantly faces the challenges of managing data at multi-petabyte scale. By adopting innovative technologies like Alluxio we are able to help our users extract meaningful and useful data almost instantly,” said James Peng, Chief Architect at Baidu. “Our deployment of Tachyon cluster has already reached 1,000 workers, which is one of the largest Alluxio clusters in the world. The tiered storage of Tachyon has provided us great flexibility in managing data in large-scale. We are seeing an average 10-fold, and up to 30-fold performance improvement in supporting interactive query system and other types of workloads. This greatly improved the speed in making important business decisions.”

Background

As a PhD candidate at UC Berkeley, Haoyuan Li saw Spark adoption driving the requirements for more developer-friendly methods for how big data frameworks access persistent data at in-memory speeds. Formerly known as Tachyon, the Alluxio system quickly gained prominence in use cases that required in-memory storage speeds for Spark computation and received early backing from enterprise software and storage leaders, including EMC and Pivotal. Where storage and file systems have historically required high customization and tuning, Alluxio brings a unified interface that’s intuitive for developers, easy for operators, and delivers unprecedented speeds for data access to support the broadest range of Big Data use cases such as machine learning, real-time analytics and streaming data.

Enterprise storage has been long overdue for the next-generation storage interface that simplifies the interaction between today’s Big Data applications and frameworks with storage systems,” said Haoyuan Li co-creator of Alluxio and founding CEO of Alluxio, Inc. “Alluxio has enabled this innovation in storage by separating the function layer from the persistent storage layer. Our community has leveraged the power of memory-centric architecture to enable any framework to access any data, from any storage.”

To protect the project from potential trademark litigation and to preserve the intellectual property of the open source software community contributions internationally, the community elected to change the project name from Tachyon to Alluxio. A newly-created Alluxio Open Foundation will be the steward of the project.

In 2015, Andreessen Horowitz invested $7.5M in Alluxio, which has since assembled a team of some of the world’s leading distributed computing experts from Carnegie Mellon University, Google, Palantir, UC Berkeley AMPLab and VMWare to foster the adoption of Alluxio and support large-scale production enterprise users.

AMPLab has created some of the most important open source technologies in the new Big Data stack, including Apache Spark,” said Michael Franklin, Professor of Computer Science and Director of the AMPLab at UC Berkeley. “Alluxio is the next project with roots in the AMPLab to have major impact. We see it playing a huge disruptive role in the evolution of the storage layer to handle the expanding range of Big Data use cases.”

With the release of open source version 1.0, Alluxio added many new key features to simplify developing new distributed applications for Big Data that can bring in-memory performance speeds to any file or storage system. To learn more, visit our Alluxio 1.0 release notes page HERE.

 

Sign up for the free insideBIGDATA newsletter.

Speak Your Mind

*