ETL: The Silent Killer of Big Data Projects

Print Friendly, PDF & Email

Monte Zweben_SpliceMachineIn this special guest feature, Monte Zweben of Splice Machine outlines how traditional ETL processes are bursting at the seams with the volume of Big Data and ways to streamline the ETL pipeline. Monte Zweben is co-founder and CEO of Splice Machine. A technology industry veteran, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.

Moving data between databases for processing doesn’t exactly make headlines, but the ETL (extract, transform, load) process is a hidden burden within most IT departments. Anecdotally, it accounts for 70% of the effort for data warehouse setup and maintenance, and now Big Data is pushing those systems to their breaking point. The irony is that Big Data was supposed to make you smarter, but as ETL pipelines became overwhelmed with the volume of Big Data, it has made us slower.

Three Letters You Can’t Ignore

ETL is a prerequisite for data analysis. It’s a general rule of thumb that data preparation takes up 70% or more of the analytics process. Data must be cleaned up, organized, and moved to the right places before it can be analyzed. If not done properly, users become frustrated with the speed and accuracy of their reports, dashboards, machine learning models, and decision-making systems.

Typically, companies approach the ETL process with a set-it-and-forget-it mentality. Prior to the era of Big Data this worked, as data was pulled from a few, similarly structured repositories, transformed, and moved. Now, organizations are discovering that this approach needs to change, because the data pipeline is becoming more complex.

Data no longer comes from a few sources, but from numerous heterogeneous sources, and all these new sources drive increases in both the volume and velocity of the data coming in. Meanwhile, destination points are also proliferating. Companies no longer have just the classic data warehouse. They also have data marts and, increasingly, Hadoop. All of this complicates the ETL process. Companies are finding that they have a multi-spoke, multi-mode ETL process. ETL scripts need to be reworked as data changes, and the process itself breaks frequently.

Streamlining the ETL Pipeline

Rapidly growing data volume and variety demand a more fluid ETL process. ETL cannot be bypassed, but it can be streamlined to deliver faster time to value for analytics. By transforming the ETL process, organizations can improve data quality, data recency, and data availability, thereby increasing analysts’ productivity and leading to more data-driven business decisions.

Companies have tried ETL on Hadoop to alleviate Big Data bottlenecks with affordable scalability, but have found ETL on Hadoop to be brittle. Without the ability to update or delete records in Hadoop, any hiccup can lead to the ETL job restarting, putting extra hours into the daily process.

Now, Hadoop RDBMSs provide transactions to eliminate that ETL brittleness. ETL errors can now be rollbacked in seconds without restarting the whole ETL pipeline. With the ability to update even just a single record, incremental ETL updates can reduce data lag to minutes and seconds instead of days and hours. This translates to improved productivity and faster data-driven decisions.

A leading Hadoop RDBMS, Splice Machine provides the ANSI SQL, transactions, and joins needed to accelerate ETL pipelines and power real-time applications. Splice Machine’s Hadoop RDBMS allows organizations to scale 5-10 times faster at 75-80% less cost than traditional RDBMSs like Oracle or IBM DB2.

Conclusion

For some organizations, ETL isn’t a problem yet. But for the vast majority of organizations, sooner or later, it will be. ETL will take longer and longer, until it impacts the organization’s ability to generate reports and make more timely and informed business decisions. Moving to a scale-out architecture like Hadoop will be the logical step, but the ETL process will become fragile in the face of errors and data quality issues.

Using a Hadoop RDBMS like Splice Machine, companies can get the best of both worlds: the cost-effective scalability of Hadoop with the robustness of transactional updates to the ETL pipeline. Transactions can recover from errors in seconds and drive down data lag to minutes with incremental ETL updates. Now, you can be both faster and smarter with Big Data.

 

Sign up for the free insideBIGDATA newsletter.

Speak Your Mind

*

Comments

  1. Splice Machine and many other similar companies are bringing tremendous innovation to the data platform space. But I’m not sure I understand the strategy of contrasting this innovation with this red herring of “traditional ETL.” I don’t know of many organizations who are performing data integration in this supposedly monolithic and brittle way, and the ones who are performing data management functions in Hadoop are definitely not doing it this way. Modern and agile big data fabrics, like those from Informatica, have facilitated data proliferation and multi-persona data consumption for quite a while. Most organizations who use Informatica’s big data fabric do so with agile, iterative, and collaborative processes. This is the reality of modern and agile data integration. New data platforms like Splice Machine contribute even greater innovation to the modern data architecture. But I don’t think the value of Splice Machine is best represented by contrasting against the red herring of a data management world from 20 years ago. Modern data management is powered by big data fabrics that embrace the growing quantity/variety of data and growing demand for trusted information.