I recently caught up with Max Herrmann, CMO at Cask, to discuss his company’s CDAP platform and Hydrator and Tracker applications. Max has more than 15 years of enterprise software experience across a range of product marketing, corporate marketing and business development functions, and has been instrumental in positioning several highly innovative companies for market success. Prior to Cask, he served as EVP Marketing for GridGain Systems. He also spent six years at Microsoft in senior product marketing roles for the company’s datacenter and cloud solutions. Max holds a master’s degree in Aerospace Engineering from the University of Stuttgart, and an MBA from the Technical University of Munich. Russ Savage, Application Engineer at Cask, also contributed to this interview.
Daniel D. Gutierrez – Managing Editor, insideBIGDATA
insideBIGDATA: Please give me a run down on Cask and where the company fits in with the big data ecosystem.
Max Herrmann: I’d be happy to. If you think about the traditional 3-tier or app server story, you have tools like unit testing and CI testing and you have libraries available that you can pull into. You don’t have to code everything yourself. These things don’t typically exist in a Hadoop environment. You start coding against the low-level APIs. You would start coding as a developer who wants to create application logic. You start coding basically against infrastructure APIs. But you really would like to focus on the app logic. Wouldn’t it be nice if you could just assume the infrastructure is available to you, and focus on the application itself? Well, that’s essentially what CDAP is doing as an application integration platform. We create abstractions, basically, that are on top of the low-level APIs.
For example, for data sources – different data types, different data sources – we have an abstraction. It’s called “data sets.” For us, everything is a data set. That means it also can be moved between applications easily. We have an abstraction that’s called “programs.” A program could be literally anything. It could be a stream, it could be a MapReduce job, it could be a Spark job, and it makes it pluggable between different applications. Then we have built these abstractions called Hydrator and Tracker, which are really purpose-filled applications that take advantage of the capabilities that are in the CDAP platform, along with a nice drag and drop UI on top.
Let’s consider Hydrator first. A lot of people, when they’re trying to get into production, try to figure out how to get the data in the door in the first place, so instead of coding through Flume or Kafka and end up with different streams, they can use Hydrator’s data ingestion tool that makes it very easy to build a data pipeline. Think of it as an “ETL plus,” so it’s not just ETL, but we allow you to do aggregations, you can do machine learning, you can do modeling, we provide you with a set of standard sources, transformations, and syncs. But because everything – the platform and the extensions are 100% open source, if you don’t find the source that you wanted to use, if you don’t find the sync that you want to write to, you can build your own plug-ins. It’s all open source.
Next is Tracker. The idea with Tracker is that people are trying to figure, particularly in the more compliance-sensitive industries – financial services, healthcare, is what happens with the data once they enter Hadoop. Once they enter, a data lake for example, they want to know the origin of the data as they get transformed, who has access to the data. So we provide audit trails, we provide metadata tracking for all the metadata in the system, and we provide data lineage. Once the data enters CDAP, it doesn’t matter where they go from there, we always have a record of where the data originated first and where they went. That’s it in a nutshell. It’s a platform that run on top of the distros. We’re certified with all the leading distros. Cloudera is actually an investor in Cask, so we work with them very closely in the market. We’re also certified with Hortonworks and MapR. Plus, we also support the cloud distros. It’s one of the nice things when you build an application on CDAP, and it gives you examples for the portability benefits that this kind of platform gives you.
Let’s say you start out with one of the on-premise distros, and you decide at some point you decide you want to run that same application in the cloud. With CDAP it becomes very easy, because we already are abstracted against different versions and components that the packages have. So you can very easily move your on-premise application to the cloud. A similar idea applies to some of our customers that are trying to support multiple platforms, multiple Hadoop platforms or distros with their customers. So they’re writing applications and services against customer A who is using MapR, and customer B is using Cloudera. So now they find themselves in a situation where they have to support all these different environments. We see to it that they can support it very easily, because to them, the application just looks the same, whether it ultimately runs with MapR or Cloudera. That’s another benefit of having an abstraction layer to work between the distros and the actual application.
One of the things I keep hearing more and more is the idea that with open source, people feel like they’re not locking themselves in with the application or with the code, which is obviously a big benefit. But what’s happening in the last 18 months or so, is that more and more of those distros are forking, creating pieces, components that are not necessarily reflected in other distros, in terms of their own IP. One distro may be using Spark 1.5 and the next might be using Spark 1.6. You have these differences in versions. It becomes very complicated, actually, for people to figure, what to do if, in the future, I want to run in a different type of environment. We’ll take care of all of this. We certify against all these distros so that when a new version of Cloudera or MapR comes out, or a new version of CDAP comes out, that we are certified and run in all these environments. So we have an increasingly larger support matrix in terms of the different versioning and components that we support in each of the distros.
insideBIGDATA: Who is your solution really positioned for? Is it for a small Big Data engineering team, or can it be scaled to a very large deployment?
Max Herrmann: Yes, it can be scaled, and I will say that 90%+ of our pipeline, both in terms of existing customers and prospects, are large enterprises. The typical scenario I’m finding is the customer who have been kicking the tires of Hadoop for the last seven years. They have made investments in Hadoop infrastructure, and they have not been successful getting into production because you need skill sets. You need people, typically that need to know the low-level APIs. They need to be able to code in Scoop and Pig and Hive, etc. We take that requirement away in terms of having to have that skill set in the organization. I shouldn’t say take away entirely, but when we reduce the requirements because we give you a set of Java APIs that any Java developer can code against, they don’t have to worry about low-level infrastructure pieces anymore.
So the prospect is typically a large enterprise, and in terms of who’s using it, or who’s buying it, we appeal with multiple audiences. The buyers we find ultimately – the people that are writing the checks still tend to be largely IT, and it’s IT because IT ultimately is responsible to support that environment, and they’re also typically buying the infrastructure. They’re buying the distros, the support, and they’re also typically the ones that end up becoming our buyers. However, the developer is a very important constituency, and we also appeal to data scientists. So we have typically multiple groups that get involved, in particular when it is at a strategic level. So we have some very large customers who are looking to deploy us very broadly as their platform for doing Hadoop. So you get multiple groups involved in those decisions.
insideBIGDATA: Can you drill down into some CDAP functionality to show how it improves productivity?
Russ Savage: CDAP itself is a big data application platform. It’s basically a platform that’s sole purpose is to help speed up application development on Hadoop. So, we run on all the major distributions. We expose endpoints and APIs to help developers develop applications faster. You can actually build and develop big data applications on your own laptop without access to a cluster. We have a couple different versions of our platform. So, one of them is distributed, that will run in a thousand-node cluster. The other one is standalone, which will run directly on your laptop. And the third one is embedded, which you can actually embed into your testing frameworks.
So, for example, for continuous integration, or any sort of JUnit testing of your applications prior to launching, you can build that into your workflow, and so we kind of speed up that entire process, because you can build and test and develop on your laptop, and work on a lot of those kinks before you actually put it into a cluster and start running those jobs. So that’s just one way, the other way, we have a series of APIs and rest endpoints on our platform for application management, so you can start, stop, deploy, and create applications via rest endpoints. That lets you easily fit this technology into your existing workbook. So as long as what you’re using, whether it’s a scheduler or whether it’s your own custom scripts, as long as they can make rest endpoints or rest falls, you can work that into your existing process.
Because we have this kind of platform and these APIs and all these– this restful interface, we developed internal applications to solve common problems that we see out there in the market. The first thing that we saw is that people were having trouble getting data from outside of their cluster into their cluster. They were having to write a lot of custom code and a lot of boilerplate code to do that. So we developed something we call Cask Hydrator. And Cask Hydrator is really a data pipelining tool. I’m sure you’ve seen a lot of different examples of this, but the whole idea is basically, create a logical view of how your data is moving from outside your cluster into your cluster. Through a couple of configurations, you can actually have some up and running without actually writing any code.
So if you’ve got a custom transform that you really need and your team leverages frequently, you can load that in here and build that into a new pipeline that you built. We also sync to a lot of popular systems, but again, you can always add your own sync into this platform, and you basically just connect these up. You can connect these up and configure them, and then actually deploy your application. So, this is a time-partitioned file set which has a database name, you can set the time zone, and the partition format, etc. It’s not meant to capture every use case, but the whole idea it is to be flexible and expandable for anything.
This is your standard ETL process. We also have tools to aggregate and build machine learning modeling into the system. So, in this case, you can aggregate and deduplicate data coming in. That’s going to basically operate on an entire set of data coming through. And it’ll create all the temporary tables and all the storage mechanisms in order to do that. We’re starting to build out more of these compute machine learning models. So, this is now your base classifier. So, you can build or process the train’s a model, and then incorporates that model into a decision.
Next, you can preview a pipeline that was already configured. When you run it, you’ll actually see how many records are coming out, and you can see where they process. So if you had a filter that’s filtering out invalid records (say invalid zip codes), you’ll see that a thousand records came in and then it filtered out a handful for example. You can also take an existing pipeline and make a copy of it so I can then edit and change it. You can look at any individual plug-in and see the input and the output so you know that all of that is working before you actually deploy it.
insideBIGDATA: Could you also insert a step that does some sort of impute, so if it wasn’t like a zip code – let’s say it was something that could be calculated based on other data – and then you could actually add that as a step?
insideBIGDATA: Can you use a language that data scientists use a lot, like Python?
For Python, since all this behind the scenes is translated into Java, I think we use jConn converter for some of that stuff behind the scenes. We’re not actually executing the Python. We use a lot of interpreters behind the scenes. There’s some sleight of hand there. There’s some nuances between jConn and Python which is something to keep in mind. I think if you were thinking about using this in your team or in your production environment, any of those transformations that you’re doing in Python, if you really want to get full value and the full speed, you may think about converting those into their own custom plug-in. That way you can ensure that they’re working in the way that they should.
insideBIGDATA: How about machine learning algorithms like naive Bayes classifiers. Where does that functionality come from?
Russ Savage: Yeah, this is actually handled behind the scenes as Java code in our framework. Not to get too deep into the details, but for instance, since this is all open source, you can download and take a look yourself. So if you have Apache Spark MLlib, and you want to use different models from there, you can import them. Most of what you’re doing is just wrapping it in some code to use them as a plug-in. So a lot of the effort here is more around configuring the pipeline, and validating schema. That’s why the actual code to execute is one code only.
We just have the naive Bayes algorithm as an example. This code is available for you to take and manipulate and if you want to switch out the models, you can do that. You can load your plug-ins directly in the UI and have them available for the rest of your team.
insideBIGDATA: One of the things I would put in a data pipeline like this is – and I can’t remember if MLlib has this – but a PCA (principal components analysis) algorithm to do feature engineering where you have upward of 500 features and then PCA determines which ones are valuable and which aren’t – say 12 of them are what you need to maintain the variation in the data.
Russ Savage: Yes, definitely. Basically, all the magic happens in the run function which is called by our platform, and that’s where you would put all of this stuff. It has the context that it’s executing in and then the input data that it’s going to run the training. So that’s the pipeline, Hydrator, the pipeline engine. We have a couple of examples that we’ve put together. We have a spam classification example, so this is actually leveraging that naive Bayes classifier to make decisions on the data, whether something is spam or not spam.
Remember, the whole idea here is separating the logical flow of how your data is going through the system versus the engine that it’s actually executing on. We can go behind the curtains a little bit and see how CDAP is interpreting your pipeline and how it’s running. In this case, it’s taken that visual pipeline and converted it into three MapReduce jobs and a Spark job. In instances where Spark make sense, we choose Spark. But that doesn’t stop you from actually turning everything into Spark jobs.
Because you’re defining that logical view versus the actual code you can quickly migrate some new engines as they become available. And we store all of the items in a giant configuration file which is exportable. You can put in version control and move it to different places. So when another processing engine comes in and it becomes popular, our engineering team is dedicated to making sure this platform is usable and workable. You take your existing job, clone it, change it to a new engine, and now you’ve generated a new job that you can even run them side-by-side and compare the performance. This is a way to make sure that those data ingestion jobs, you can move those really quickly and focus any sort of migration effort on the really custom applications that you’ve developed.
insideBIGDATA: What can you tell me about Cask Tracker?
Russ Savage: Yes, another thing that our customers came in and said, now that we have all this data in there, we really want to track where it came from, metadata and information about it, the lineage and the audit logs and all that. That’s where our second internal application, Cask Tracker, comes in.
Say you have a schema, and some additional metadata. We can add our own metadata for this information. So if I’m working in a team and I know that people that have data dictionaries, for example, and they want to describe, you can add something like description and you have a description of your data set or specific information about the fields or anything like that. You can store that right along with your data, so that anyone looking at it knows exactly what to look for. All of this comes in with lineage.
The other important thing is all of this is time-bound. For some of our medical, healthcare customers, they want to be able to come in at the end of the year when they’re being audited and see how their data sets were manipulated and which applications were running, and that’s where this time-bound feature really helps out.
insideBIGDATA: How would you differentiate the Cask platform?
Russ Savage: What I like to tell people who ask me how we’re differentiated is that you’re investing in our technology, you’re not investing in an application, you’re investing in an entire platform. A lot of companies, as they grow more mature in their usage of Hadoop, they tend to quickly outgrow applications. But the whole idea here is, it’s really difficult to outgrow a platform. We expand with the Hadoop ecosystem, which then expands with your knowledge and understanding of Hadoop as well. When you’re working with us, we’re an open source company. You can look inside of it. We try to be as transparent as possible because we want everything to be more of a partnership. We want your feedback to go into the platform. We want to be able to help you to build and grow as Hadoop builds and grows.
Sign up for the free insideBIGDATA newsletter.