High Volume and Multi-source Analytics

Print Friendly, PDF & Email

In the 1990’s most companies’ databases contained rows representing their customers, sales transactions, and inventories. This transactional data was usually stored in local on-premises OLTP SQL databases, and typically had, at most, millions of rows.

In the 2000’s, the web 2.0 revolution and the rise of e-commerce and ad tech challenged companies to not only store records of their transactions, but also of the interactions they had with their customers on web sites, by phone, or in person.  Some of these interactions, such as display advertising and marketing emails, were passive, even if they weren’t clicked on.  The addition of this interaction data quickly pushed row counts into the billions—scales that started taxing the largest companies’ traditional SQL databases.

In the later 2000’s, this led to the rise of Hadoop, NoSQL, search, graph, and other types of new and big data technologies, all attempting to do things that the old guard database systems couldn’t do well or at all.

While some of these approaches, projects, and products have turned out to be successful and some haven’t, one thing that has happened is data is now stored in more distinct systems than it used to be.  Entropy has kicked in, and data is in more places now than before.

The next shift was in the 2010’s when IoT started to take off.  In addition to the billions of transaction and interaction rows, billions and sometimes trillions of rows from sensors, machines, applications, logs, and IoT devices started to stream in.  The addition of this observational data again increased the demands on data architectures by several orders of magnitude beyond what interaction and transaction data had required.  And, this data is often semi-structured, further increasing data entropy.

The last piece of the puzzle up was the stampede to the cloud, and in 2018 few are seriously resisting full-scale migration to the cloud.

In today’s computing and storage landscape all that data, in all those sources is either in or on its the way to the cloud.  New hyper-scalable cloud-native SQL databases can both handle this level of data scale and provide analytical and transactional concurrency increasingly well.

But that alone doesn’t solve the issue of the data being siloed in different storage and data engines. So far we’ve only contemplated a company’s own data—what about data from company’s partners, supplies, and customers? Or, what about data you might want to buy from data brokers and syndicators?  Data from third parties brings yet more disparate data silos that don’t just magically co-locate with, line up with, or join to a company’s own data. This introduces yet more data entropy.

It’s a big challenge.  Data lakes attempted to solve this but they add a lot of complexity and have their own challenges.  You can use ETL or data federation to bring disparate data together.  But there are two faster approaches you can try today:

Solution 1: Data Sharing

The cloud database provided by Snowflake has a native capability called Snowflake Data Sharing, a.k.a. The Data Sharehouse, which lets companies that each use Snowflake share views of data with each other, but without actually copying or moving any data.  Data Sharing also allows one company to join their data with data from another company, which helps logically coalesce second- and third-party data..  This virtually collapses what would otherwise be data silos into each other, reducing dataset entropy without expending a lot of energy.

Solution 2: Multi-source BI analysis

Another approach requires going up a to the level above the data tier to end user tooling, such as BI tools.  Some BI tools have multi-source analytical capabilities that allow for data blending across disparate backend data sets, storage approaches, and query engines.  This approach also allows some of the more advanced end users to do some of this blending on their own while requiring less setup and data engineering.

What’s needed is advanced data blending that users can enable directly from their dashboard.  This includes cohort creation and application directly from reports to take a set of customer ids from the result of an analysis on one source and use it to filter an analysis on another source.  Users can then visualize them side by side.

By using these two approaches, together with modern cloud-native data storage, data engines, and BI tooling, relatively simple yet hyper-scalable approaches can start to help tame data entropy, handle the exponentially-growing size, speed, and potential sources of data, and still deliver solid value and insights to end users.

About the Author

Justin Langseth is Co-Founder & Chairman of Zoomdata. Justin co-founded Zoomdata in 2012. Previously, he co-founded and was CTO of Claraview (sold to Teradata in 2008) and then Clarabridge (spun off from Claraview). Prior to Claraview, he was co-founder and CTO of Strategy.com, a former real-time data monetization and insights subsidiary of MicroStrategy. He is the lead inventor on 16 granted technology patents related to data monetization, data personalization, and real-time, unstructured, and big data. He graduated from the Massachusetts Institute of Technology where he received a SB in Management of Information Technology from the MIT Sloan School of Management.

 

Sign up for the free insideBIGDATA newsletter.

 

Speak Your Mind

*