The Key to Future-Proofing any Enterprise Data Architecture: Create Optionality

Print Friendly, PDF & Email

In this special guest feature, Justin Borgman, Co-founder and CEO at Starburst Data, outlines how can IT future-proof the enterprise for the next big trend – by creating optionality. Specifically, he offers three strategic steps toward future-proofing the cloud data lake. Justin has spent the better part of a decade in senior executive roles building new businesses in the data warehousing and analytics space. Prior to co-founding Starburst Data, Justin was vice president and general manager at Teradata, where he was responsible for the company’s portfolio of Hadoop products. Before joining Teradata, Justin was co-founder and CEO of Hadapt, which was acquired by Teradata in 2014. Justin earned a BS in computer science from the University of Massachusetts at Amherst and an MBA from the Yale School of Management.

If you work in a large organization that’s been around a while, I’d bet I can briefly summarize the history of your IT department’s experience with data architectures.

In the 1980s, when personal computers were still new, a database company sold IT on the promise of relational database management systems (RDBMS) for organizing and analyzing all of your data. No more data trapped in paper records! A little later that decade, IT deployed its first enterprise data warehouse, which consolidated all those individual departmental databases in one place.

Then, in the late 1990s, IT deployed open-source databases to support web applications – after all, it’s hard to argue with free. But when the mid-2000s arrived, the immense volume of weblog data became difficult to analyze on a single machine. Enter a brand-new class of database systems driven by massively parallel processing (MPP). For the first time, the IT shop started using the phrase “big data.”

Unfortunately, by the time the 2010s arrived, costs were out of control, and IT lacked the flexibility required for modern analytics and data science. Hadoop came to the rescue, along with the hundreds of servers that now form the foundation of IT’s data lake.

Today, IT is moving that data lake to the cloud, where the separation of storage and compute provides tremendous flexibility for controlling both costs and performance. But there’s a big problem. Remember, IT still has database systems from every era of database history. Moving all that disparate data into your new cloud data lake is going to be difficult to accomplish without serious disruption. Plus, IT management is tired of this unending cycle of being locked into platforms that quickly become obsolete. How can IT future-proof the enterprise for the next big trend?

The answer is to create optionality. Here’s how:

Three Steps to Future-proofing the Cloud Data Lake

There are three steps IT should take to ensure the company doesn’t get locked in, yet again, to a proprietary solution while also providing stakeholders with fast access to the data they need:

  1. Embrace storage and compute separation
  2. Use open-data formats
  3. Future-proof the architecture with abstraction

Let’s take these steps one by one.

Storage and Compute Separation

The old “shared-nothing” model did have its advantages, particularly when it came to performance, but it also meant IT has to buy enough hardware to handle the maximum peak requirements, leaving the enterprise with far more hardware than it needs more than 99 percent of the time. By contrast, the cloud model separates storage and compute creating a new level of cost and performance efficiency, enabling IT to:

  • Pay only for what’s actually used
  • Gain complete control over cost/performance
  • Minimize data duplication
  • Eliminate data loading
  • Access the same data via multiple platforms

Open-source file formats

Traditional RDBMS use proprietary storage formats, and they’re the primary culprit complicating an organization’s cloud transformation strategy. Open-data formats such as ORCFile, Parquet, and Avro enable IT to store data in any file or object storage system while still providing incredibly fast performance when performing analytics via SQL.

Future-proof your architecture with abstraction

If the history that I recounted above sounds familiar, then IT is probably spending a lot of late nights figuring out how to move all this disparate data to the new data lake in the cloud. But the truth is, any organization’s user community doesn’t care where the data lives. In fact, they’d be happier if they didn’t even have to know.

The solution is to build a bridge from the current state to the desired state with an abstraction layer between the users and the data. There are a variety of terms for this: query federation, data virtualization, semantic layer, query fabric, or my favorite, the consumption layer.

This abstraction layer takes a SQL query as input and then handles the execution of that query as fast as possible. This layer should be highly scalable and MPP in design, able to push down predicates and column projections and only bringing into memory what’s absolutely required.

With the addition of an abstraction layer, analysts can access data from anywhere, without worrying about ETL (extraction, transformation, and loading) or data movement. IT now has the freedom to migrate data from on-prem to cloud or proprietary database to the data lake, at its own pace, because the user experience is now decoupled from where the data lives.

By giving IT options, the organization provides itself with the time and flexibility to move data to the cloud without preventing analysts and applications from working with that data. Or IT could simply choose to leave legacy data where it is and depend on the abstraction layer to access it, while new data is sent to the data lake.

In any case, it’s always good to have options. And that’s exactly what these three steps ensure.

Sign up for the free insideBIGDATA newsletter.

Speak Your Mind