A recent Gartner survey on Hadoop cited the two biggest challenges in working with Hadoop: “Skills gaps continue to be a major adoption inhibitor for 57% of respondents, while deciding how to get value from Hadoop was cited by 49% of respondents.” A recent study by the consulting firm, PwC supported the second point, “There is a clear business need to unlock the value tied up in information.”
In spite of these objections, enterprises increasingly need to ingest high volumes of data from a wide variety of both structured and unstructured sources in high velocity environments. When we talk with enterprise data project managers, they say, “Our biggest challenge is getting the data from the source into an application.” It doesn’t matter whether they are getting data from a batch process or a streaming source. We hear it over and over again, “I have a source (or sources) of data and I need to get it into a data lake so we can get to the value.”
When an organization decides to roll out Hadoop, they face additional challenges when creating solutions on Hadoop. In a typical scenario, an enterprise must engage costly Hadoop engineers to build data pipelines that have a high degree of complexity. The multiple technologies used in these projects add another layer of complexity, which typically equates to increased costs, longer time-to-production, and a high cost of maintenance. As a result, these pipelines can be difficult to test, often require a great deal of maintenance, and demand significant tuning to achieve performance goals, all of which lead to a high TCO.
Ideally, a solution would sit on top of Hadoop that would allow users to quickly drag-and-drop the appropriate components to build and test data pipelines. Such a solution would translate complex, loosely connected technologies, programming and scripts into easy to build and maintain data pipelines on Hadoop. As a result, the skills gap and business need objections would be obviated, and expensive and rare engineering resources would no longer be necessary to build data pipelines on Hadoop. Such a solution would also reduce the time-to-production from months to weeks, thereby cutting costs and delivering business value faster.
Also known as Cask Hydrator, this solution is an extension to the open source Cask Data Application Platform (CDAP). The solution simplifies the process of developing, running, automating, and operating data pipelines on Hadoop. As a result, Cask Hydrator allows users to rapidly build and run streamlined data refineries to support many use-cases.