Sign up for our newsletter and get the latest big data news and analysis.

Reflecting on Ten Years of Hadoop

ashishthusooIn this special guest feature, Ashish Thusoo, co-founder & CEO of Qubole, discusses how he’s seen Hadoop evolve over the past decade, what his experience was with it when it first hit the scene, where he thinks it fits in the data ecosystem today and what he believes the future holds for Hadoop. Before co-founding Qubole Ashish ran Facebook’s Data Infrastructure team; under his leadership the team built one of the largest data processing and analytics platforms in the world. This platform achieved not just the bold aim of making data accessible to analysts, engineers and data scientists, but drove the “big data” revolution. In the process of scaling Facebook’s Big Data infrastructure, he helped drive the creation of a host of tools, technologies and templates that are used industry wide today.

This year marks the 10th anniversary of Hadoop, a technology that has come to represent a major transformation in the enterprise computing industry. I began working with Hadoop around its inception, and have seen it become a central platform in big data analytics today. As we celebrate the anniversary of a technology that is so fundamental to so many, I want to shed some light on my own experience with the development and growth of Hadoop.

My earliest experience with Hadoop began in 2007, less than a year after the open-source technology’s release, during my time as part of the original data service team at Facebook. Before then, we had been using a mix of home-grown software and a traditional legacy data warehouse. However, neither approach was capable of meeting the company’s data processing demands, and we soon found ourselves at a point where processing a full day’s worth of data on the platforms actually took longer than 24 hours. We had an urgent need for infrastructure that was capable of scaling along with our data, and it was at that time that we began exploring Hadoop. The fact that it was an open-source project already being used at petabyte scale and providing scalability using commodity hardware was a very compelling proposition for us. Moreover, the same jobs that had taken more than a day to complete could now be completed within a few hours using the platform.

Our first implementation of the open-source platform focused almost exclusively on batch-processing tasks. At the same time, using early Hadoop was often difficult for end users on the data service team, especially for those who were unfamiliar with writing MapReduce programming models. It lacked the articulation of more popular query languages like SQL, and many of us spent hours writing programs for even simple analytical processes.

It was clear that in order to effectively analyze Facebook’s growing trove of data, we needed to improve the query capabilities of Hadoop. That was what inspired us to stack SQL on top of Hadoop to create Hive. While there was still a heavy amount of batch processing with the platform, for the first time, our analysts were able to conduct ad-hoc data analysis in HDFS. Much of the next few years were spent expanding and refining this infrastructure to meet growing usage as the tools made big data accessible to ever-larger groups of Facebook employees, and much of that big data architecture we built with Hadoop is still in place at the company today.

I attribute much of Hadoop’s early success to its ability to fill the missing technical capabilities of parallel data processing systems that were available at the time. Most of these systems had limited scalability, and would invariably resort to making each computation node in a cluster more and more powerful so that data teams needed fewer of them. As the nodes became less of a commodity, they also became more expensive, driving up the cost of computation. The Hadoop architecture, on the other hand, was built to scale out using commodity nodes and was able to bring down the cost of large-scale data processing by an order of magnitude.

Perhaps an equally important aspect of Hadoop’s success was the role of open-source. While there were a number of open-source options available for data systems to power their applications (such as MySQL and PostgreSQL), the ecosystem for data analysis, data warehousing and data processing was very much dominated by a few proprietary vendors. Hadoop was the first true system to be created by a community of web-scale companies, and as a result, it became a platform open to new insights and innovations from a number of leading industry practitioners. These two factors, scalable architecture that commoditized large-scale data processing and open-source technology, were the key ingredients in Hadoop’s success.

Hadoop has played a crucial role in the development of enterprise computing over the last 10 years, maturing from its early days as an open-source batch processing platform to the complex analytics architecture we see today. And while it is true that no one architecture system can handle the full spectrum of analytics use cases companies now require, the ecosystem of projects under the Hadoop umbrella will continue to provide key data-processing capabilities for a number of different engines. That is the real success of Hadoop – it has galvanized the open-source community to create so many new and powerful solutions for analytics infrastructures.

Hadoop also remains the lynchpin for many ETL processes, and for production-ready workloads. At the same time, however, the technology must strike a balance with new projects finding their way in the enterprise-computing space. Hive is being used for complex SQL, Spark has emerged as a great data science and machine learning engine, and Presto architecture has found its niche in rapid ad-hoc SQL analysis. Newer technologies such as Flink and Heron are also emerging on the real-time analysis side, while Quark looks to take advantage of the unique capabilities of various query engines by building out SQL federation on top of the systems. All of these technologies wouldn’t be possible without Hadoop, and they all take advantages of the core Hadoop platform to enable their abilities. Needless to say, we have much to look forward to in the next 10 years of Hadoop.

 

Sign up for the free insideBIGDATA newsletter.

Leave a Comment

*

Resource Links: