Sign up for our newsletter and get the latest big data news and analysis.

Interview: Justin Kestelyn, Technical Evangelism & Developer Relations at Cloudera

Justin KestelynWhile at Hadoop Summit 2016, I had the opportunity to catch up with Justin Kestelyn, Technical Evangelism & Developer Relations at Cloudera, to discuss all the progress his company has made in the past year and what’s in store for the future. Justin is head of technical evangelism and developer relations at Cloudera. Prior to joining Cloudera in 2012, he was responsible for developer programs at Oracle, covering Java, Linux, and the Oracle platform.

Daniel D. Gutierrez – Managing Editor, insideBIGDATA

insideBIGDATA: Please give me a quick update for what’s been happening at Cloudera in the past year.

Justin Kestelyn: In terms of what we’ve been working on for the last year, if I had to characterize it, I would say one of the important things to be very, very focused on hardening and quality assurance and stability and reliability, and continuing to push enterprise grade quality into the product. As you know, open source of course is now almost the default choice in the enterprise, it’s not the exclusive choice, it’s the default one. I think that up until recently people had this perception that open source software was somehow of poorer quality than enterprise software.

The fact of the matter is customers have exactly the same requirements for open source requirements and expectations, as they have for proprietary software. We have been super focused on making our customers happy, keeping them happy, in terms of having just a superb enterprise grade product.

In the last year we brought on a new VP of Engineering, Daniel Sturman, Ph.D. from Google, where he led development of cloud products including Google Compute Engine, Google App Engine, and Kubernetes as well as the internal cluster management systems that manage all computation across Google’s fleet of servers. That’s been one of his primary directives. I think it has been super effective.

Aside from that, in terms of product focus, it’s sort of the same stuff that we’ve been focusing on for the last couple of years. One would be performance. Ensuring that, for example, Apache Impala is as performant as possible on whatever data-source you’re using, whether it’s HDFS, or Amazon S3, or what have you. We’re also very, very focused on security. Cloudera’s platform is the only PCI complaint platform that’s available, and as you might expect both those things are enterprise level requirements. So we’re super focused on that.

insideBIGDATA: How are you addressing all the noise in the industry today about things like data lakes, Internet of things, and streaming analytics.

Justin Kestelyn: That’s big. It’s only going to become it’s more widespread, more mainstream, more significant and as I’m sure you know it crosses multiple industries – whether it’s like the consumer device industry, or manufacturing, or logistics, or the medical device industry, or what have you. I mean, it’s really becoming mainstream. Plus, I think a variety of technologies in the stack address these things in different combinations. Some examples would be Spark Streaming, Apache Flume, Apache HBase, Apache Kafka – these are all parts of our platform that can be used in various combinations to meet some of those goals. And then of course we work with partners like StreamSets, and Informatica, and various companies that are more in the data-ingest side of things to make sure that that end of the funnel is as performant, as reliable as possible for our customers.

The other part of it is once you have that data, what’s the best way to get value from it? And that’s where the discipline of data science or advanced analytics is going to become more and more mainstream. Machine learning applications – which will help customers do some of this analysis in more of an automated way – will become more prevalent as well. And I think that’s one of the reasons why you see Apache Spark being so popular. It’s because people are recognizing that it is really the best option right now for building those kinds of applications.

insideBIGDATA: What can you tell me about Apache Kudu?

Justin Kestelyn: Yes Kudu, which you probably didn’t hear about a year ago. Kudu, which is now incubating at the Apache Software Foundation (ASF), is the first major Hadoop storage that’s been available since HBase in four or five years. So basically it’s a columnar data store that is designed to facilitate applications that involve what we call “fast data.” You’re doing fast analytics on fast data. So basically, the ability to do analytics on data that’s constantly changing, that’s streaming in from those sources that you’ve mentioned. And historically, HDFS is not a great storage medium for that because you can’t update. You can only append. HBase is the other option, but HBase offers poor performance in terms of scans and analytics. But it does give you the random reads and writes that you need to do. So Kudu fits right in between those two things. So we noticed that there were a lot of customers building these complicated architectures that combine the two in a hard-to-manage and complex way. And what Kudu does is replace those complex architectures with something that’s much simpler, easier to manage, and is as performant as either one of those things.

Kudu is a beta. It’s an Apache incubator project. There’s been a beta available for five or six months now. Eventually it will ship, it will be shipped and supported in our platform. We don’t know when that will be yet, but we’re making progress.

insideBIGDATA: Have you heard some compelling new use cases that have come by in the last year that Cloudera is particularly proud of?

Justin Kestelyn: Yes, in the healthcare sector we have a customer called Children’s Healthcare of Atlanta. We work with them to create an application that allows them to monitor preemies, prematurely born babies, in their NICU, and there’s all this instrumentation that monitors their heart rate, and their breathing rate and their blood oxygen, blood saturation, etc. They’re using Hadoop, and Spark specifically, to basically identify patterns to help them predict when a preemie will become endangered. They’re able to take proactive action before those things happen. This is a great use case also in in the Internet of things area, in terms of gathering the sensor data and making sense of it.

Another interesting use case is CERN in Switzerland. They’re using Hadoop because they have more data than anybody on the planet. They’re generating massive amounts of data that they make available to researchers around world with various data centers. They want to be able to predict, based on usage patterns, which data sets are the most popular so they can optimize for those. And they’re using Hadoop and Apache Spark to help them do that. It’s almost like a website optimization analog in that they’re analyzing sort of traffic to understand future patterns so they can optimize for those. That’s super cool.

But it’s not really any one particular use case. I think it’s the fact that we’re seeing more and more use cases in really mainstream industries. Hadoop basically grew up in the online advertising industry, and now you’re seeing it in healthcare, manufacturing, retail, shipping and logistics, governments. It’s really everywhere.

insideBIGDATA: What do you say about these doom-and-gloom articles people write about Hadoop failures – where there are more failures than successes?

Justin Kestelyn: We’ve seen this pattern before. I remember reading articles about failed data warehouse projects back in the early ’90s, failed BI projects, like insert term here. I think that’s just something frankly, as you pointed out, they like to write– it’s something that’s easy to write about. I don’t think it’s particularly specific to Hadoop. Yes, Hadoop is a complex platform, and the knowledge level is relatively low compared to traditional technologies. But I can pretty much guarantee you that 90% of those so-called failures have nothing to do with Hadoop’s technology but more about—corporate culture. Corporate culture– do you know why you wanted to use Hadoop? The question to ask is do you understand your own use case? Did you have business objectives? What was your plan for educating users? It’s all the same “IT project 101” stuff. We still have to push forward in terms of making the platform more consumable and easier to adopt. There’s definitely work that needs to be done there by the whole industry. But I think that failure reports frankly are kind of overblown.

insideBIGDATA: How about the future? This year, Hadoop’s ten years old. What does Cloudera look for the future for the product and the platform?

Justin Kestelyn: Well again, I think we need to push forward in terms of making the product as consumable as possible. We need to push forward on those enterprise requirements for speed and performance, security, usability, user experience – those are all super important things that we continue to work on and push forward. And you’re going to see a combination of use cases we’re familiar with that ripple across more and more industries, combined with new use cases that are enabled by things like Kudu and Apache Spark, and some components that we probably don’t even know about yet that are waiting in the wings that the users will tell us are going to become standards because– that’s really the unique thing about Hadoop as an ecosystem, is that the ecosystem is moved forward by users. It’s not run by vendors – there’s no single point of control, there’s no governing body, there’s no consortium that’s deciding directions.

This is a very good thing. They don’t decide what Hadoop is – the users and customers are deciding what Hadoop is. And if they decide that Spark is a better option than MapReduce for data processing, well then, that’s what will happen, and we’ll get behind that. So that’s what makes the ecosystem they’re talking about so interesting because it’s so vibrant and it’s changing all the time. You literally don’t know what’s around the corner. Now it can be, I think, a little frustrating for customers and users because they want to be able to pin down something. But I think the foundations, at this point, of Hadoop, are very secure in terms of the APIs and the components that you would use for the majority of use cases out there. HDFS is very mature. Sqoop and Flume are very, very mature components at this point. So, the foundation is there and it’s really about taking things to that next level.

insideBIGDATA: I think corporate decision makers are no longer turning down Hadoop, just because it is Hadoop – they’re probably doing the reverse. They’re hearing and reading– that’s kind of the stuff I like to write. It’s just sort of like acquaint thought leaders about the technology and what it really is and isn’t. I think proposing of Hadoop projects isn’t as much of a challenge as it was, let’s say, 5 years ago.

Justin Kestelyn: Yeah, definitely. But again, still you have got to know– it’s just like anything else – do the basics. Understand what your problem is, understand how Hadoop is going to solve that problem, and approach it just like you approach any other complex IT solution. It’s not magic.

 

Sign up for the free insideBIGDATA newsletter.

Leave a Comment

*

Resource Links: