Why Time-Value of Data Matters

Print Friendly, PDF & Email

SteveWilkesIn this special guest feature, Steve Wilkes, Founder and Chief Technology Officer of Striim, discusses the “Time Value of Data” – the notion that how much information is worth to the business drops quickly after it is created, and also the move to real-time analysis, although a growing trend, misses many nuances that are essential to planning an overall data strategy. Steve is a life-long technologist, architect, and hands-on development executive. Prior to founding WebAction, Steve was the senior director of the Advanced Technology Group at GoldenGate Software. Here he focused on data integration, and continued this role following the acquisition by Oracle, where he also took the lead for Oracle’s cloud data integration strategy. Steve has handled every role in the software lifecycle and most roles in a technology company at some point during his career. He still codes in multiple languages, often at the same time. Steve holds a Master of Engineering Degree in microelectronics and software engineering from the University of Newcastle-upon-Tyne in the UK.

Much has been written about the “Time Value of Data” – the notion that how much information is worth to the business drops quickly after it is created. Incorporated in this statement is the notion that, if capturing, analyzing and acting on that information can be made faster, the value to the business increases.

While this is often true, and the move to real-time analysis is a growing trend, this high-level view, however, misses many nuances that are essential to planning an overall data strategy.

A single piece of data may provide invaluable insight in the first few seconds of its life, indicating that it should be processed rapidly, in a streaming fashion. However, that same data, when stored and aggregated over time alongside millions of other data points, can also provide essential models and enable historical analysis. Even more subtly, in certain cases, the raw streaming data has little value without historical or reference context. So the real-time data is worthless unless the older data is available to it.

There are also cases where the data value effectively drops to zero over a very short period of time. For these ‘perishable insights,’ if you don’t act upon them immediately, you have lost the opportunity. The most dramatic examples are detecting faults in, say, power plants or airplanes before they explode or crash. However, many modern use cases such as fraud prevention, real-time offers, real-time resource allocation, geo-tracking and many others are also dependent on up-to-the-second data.

Empty chalkboard

Historically, the cost to the business to move to real-time analytics has been prohibitive, so only the truly extreme cases (such as preventing explosions) were handled in this way. However, the recent introduction of streaming analytics platforms has made such processing more accessible.

Data variety and completeness also play a big part in this landscape. In order to have a truly complete view of your enterprise, you need to be able to analyze data from all sources, at different timescales, in a single place. Data Warehouses were the traditional repository of all database information for long-term analytics, and Data Lakes (powered by Hadoop) have grown up to perform a similar function for semi-structured log and device data. If you wanted to analyze the same data in real-time, you needed additional systems, since both the Warehouse and the Lake are typically batch-fed with latencies measured in hours or days.

The ideal solution would collect data from all the sources (including the databases), move it into a Data Lake for historical analysis and modeling, and also provide the capabilities for real-time analysis of the data as it is moving.  This would maximize the time-value of the data from both immediate and historical perspectives.

But adding database data to a Data Lake is not a trivial matter. The notion of running queries against a production database for extraction purposes is often severely frowned upon. Expensive read-only replicas can be used for this purpose, but a better solution is to use Change Data Capture (CDC). This effectively turns what is happening in a database into a stream of changes that can be fed continually into the Lake. CDC works against the transaction log of the database in a non-intrusive, low-impact manner.

There is also the question of how do you make your Lake useful? Just dumping raw data into a Lake makes it very difficult to fish insights back out, and is the reason for recent reports of Lakes failing. If the raw data just contains a customer ID, but you need to query based on attributes of your customer (such as location, preferences, demographics, etc.) you will need expensive joins for every query.

A better approach is to be able to transform, filter and enrich the data in real time, as it is streaming, adding context and perhaps reference information (such as propensities) from historical models, before landing it in the Lake. This not only ensures the data in the Lake has the maximal time-value, it actually increases it by combining additional information needed to make sense of it.

Once you have pipelines in place feeding your Lake from enterprise sources in a streaming fashion, you can move to real-time analysis, and even predictive analytics, seamlessly for very little marginal cost. By adopting a streaming architecture as part of your overall data strategy, you can solve an immediate need to fix Data Lake problems, while also moving your organization closer to real-time insights. In this way you have maximized your data’s time-value from past, present and future perspectives.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind