In observance of Hadoop’s recent 10 year anniversary, I recently up with Doug Cutting, Chief Architect of Cloudera and creator of Hadoop, to reflect back on the past decade of Hadoop as well as take a look to the future. Doug is the founder of several successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera in 2009, after previously working at Yahoo!, Excite, Apple, and Xerox PARC. Doug holds a Bachelor’s degree from Stanford University and is the former Chairman of the Board of the Apache Software Foundation.
Daniel – Managing Editor, insideBIGDATA
insideBIGDATA: In the early days of Hadoop when you were working together with Michael Cafarella, did you foresee today when Hadoop takes a commanding role in the big data industry?
Doug Cutting: Honestly, no. I thought that if we could get Hadoop to be somewhat stable and reliable then engineers and researchers would want to use it, and that would make it a successful open source project. What I didn’t predict was that mainstream enterprises would be willing to adopt opensource software. The people who founded Cloudera saw that possibility and began the steps to make Hadoop acceptable to enterprises. I
joined Cloudera a year later as I began to see the magnitude of this possibility. Watching Hadoop evolve from one project to an entire ecosystem is not something I could have ever imagined back in those early days.
insideBIGDATA: You must be energized when you hear of specific Hadoop use cases where it is being used for good, for helping people in meaningful ways. Can you talk a little about your favorites?
Doug Cutting: There are so many it’s hard to pick just a few examples. This ten year milestone has really made me think about all the companies that are powered by Cloudera and Hadoop; you have health care where we’re improving the lives of children, automotive manufacturing is transforming dramatically and where self-driving cars are a reality, retail is another example where the customer experience is being shaped in real-time, and even agriculture. Who would have imagined sensors on tractors to improve crop production!
insideBIGDATA: Can you say a few words about the genesis for placing Hadoop in the open source distribution model? How do you feel it’s turned out?
Doug Cutting: What I wanted was to build an open source project that would survive. My goal was to work on software that would get used and keep being used and to create something not just for myself or the current company I was working for. I learned through the Apache Lucene project that open source was a great way to do this. It gives you an advantage toward adoption. People would readily adopt it and use it because they didn’t have to pay anything, and they could even help fix it when they encountered problems.
To me the surprising part is the cultural shift that happened. There was a culture of enterprise database software, and most people would only trust things that came from very established companies, such as IBM, Oracle, and Microsoft, etc. And I had always worked on this software that didn’t fit within that traditional circle. For the things I worked on, we didn’t use relational database software, nor did we expect people in the enterprise to ever use the software we worked on.
The biggest surprise is that, to a large degree, these two communities have merged. Big banks, insurance companies, railways, and retailers now accept that open source is a valuable source for technology and they are willing to adopt it. And the open source community is now respecting big enterprises as a valuable destination for the technology and are collaborating with each other and helping deliver products that meet their needs. For example, taking security and reliability much more seriously. That’s the change that I didn’t see coming: these two communities of software development have come to accept one another and build from one another and to some extent have merged.
insideBIGDATA: Now that Hadoop has reached this 10 year milestone, what stage in its lifespan do you feel we’re in right now? More years ahead?
Doug Cutting: My thought on software is, as long as it’s evolving it’s alive, and as soon as software stops evolving it’s dead. Hadoop is clearly still evolving and the ecosystem around it is thriving. We’ve now got Spark, and other open source projects, that are driving many new uses cases. We are going to continue to see this trend for a long time. This is the new world of enterprise software: a loose confederation of open source projects, with some standards – either APIs or file formats or so on – which developers can integrate, so that you can replace and add new ones to the stack and the stack itself can evolve.
In the past, you had these big companies that control core platforms and sold them, and they had little incentive to even try to develop anything different or to evolve that platform fundamentally. Now the platform is controlled by the community of users. They are the innovators who develop new products that are essentially hypotheses of whether the market might adopt them. Then we see what gets adopted and then that becomes the new standard part of the platform. Any particular component may not last forever, some might last a long time, some might not, but this new style, this Darwinian style, of creating software for enterprises and industries, I think, is here to stay.
insideBIGDATA: How do you see Hadoop evolving in the future alongside with new execution engines like Spark?
Doug Cutting: It’s hard to know what’s going to strike the right nerve and what people are going to find useful at any given time. Most of the systems we have today are designed around the performance characteristics of existing hard drives and existing DRAM. When you change something that fundamental, the software systems need to evolve. I’m sure we’ll see some entirely new database systems built that really take advantage of this. So
that’s a big area to watch. Spark is a great execution engine, and that’s where we see most Spark adoption, as an execution engine on top of HDFS.
Spark is definitely interesting but there are a lot of things Spark isn’t. For instance, it isn’t a fulltext search engine; Solr assumes that role in the Hadoop world. You can run SQL queries against Spark, but it isn’t designed to be an interactive query engine; for that, there’s Impala. If all you need is streaming programming or batch programming, and you need an execution engine for that, Spark is great. But people typically want to do more than that they want to do interactive SQL, they want to do search, they want to do various sorts of realtime processing involving systems like Kafka. I think anyone who says Spark is the whole stack is doing a limited number of things.
I also think Kudu is very exciting and still evolving. It’s a new storage engine that offers a lot of lowlatency, randomaccess capabilities that HDFS doesn’t while still permitting the fast analytics that you can do on the flat files in HDFS. I think that’s going to be incredibly popular and it’s being embraced by a lot of different opensource projects already.
Sign up for the free insideBIGDATA newsletter.