Sign up for our newsletter and get the latest big data news and analysis.

Data Catalogs are Half the Battle–We Need Answers to Real Questions

In this special guest feature, Kaycee Lai, Founder and CEO of Promethium, discusses the next steps required to bring unified data access to organizations. Kaycee has nearly 20 years of experience in the technology industry, and has led global operations & product management for both startups and Fortune 500 companies. A self-proclaimed “data geek”, Kaycee began his career as a business analyst working with data, databases, and business intelligence solutions at companies such as EMC, Microsoft and The Federal Reserve.

Data catalogs are a necessary step in the journey toward maximizing benefit from data in today’s distributed systems. However, when it comes to answering business questions with data, the ability to identify data sources that may be dispersed across diverse systems like Hadoop, SQL Server and Snowflake can only take you so far.

While data catalogs play a critical role in governance and also lay a solid foundation for data discovery, it’s also important to note that they weren’t necessarily built to answer business questions. As Harvard Business School Professor Theodore Levitt said, “People don’t want to buy a quarter-inch drill. They want a quarter-inch hole!”.  In the world of data, we must remember that the business doesn’t really want data, they want answers to questions like “what are the demographic characteristics of our target market in Latin America?” or “how have our sales to females in California in the 18-35 age range been impacted by COVID-19?”  Consider that if you ask a question of Google search, you get a real answer.  Asking a business question of a data catalog however may not be any more fruitful than posing a question to a circa 1985 card catalog at a local library.

Unless you already know the name of or can identify the contents in the data tables you’re looking for, it’ll take significant time to locate, assemble and prep the right ones for analysis–up to 50 percent of an analyst’s time, according to 451 Group’s Voice of the Enterprise report. Is it possible that data catalogues are giving us the Dewey Decimal System at a time when the speed of business requires Google Search?

As I mentioned earlier, the data catalog is a critical step in the movement towards data-driven business. But it’s also a step that needs to be bolstered by a few other things before it enables business questions to be answered. Let’s talk through a few of the roadblocks that analysts encounter to identify functionality that needs to be added to data cataloguing.

Dealing with multiple versions of the same data

Despite earnest attempts by data warehousing solutions to provide a ‘single version of the truth’, most data architectures are riddled with multiple versions of the same data. For an analyst or data scientist this creates a time drain, as it means they must investigate each table and manually gauge its viability. Sometimes, even though the tables provide the same data, it’s organized into different schemas, and such incoherence creates additional manual data prep work. For instance, the analyst might need to join one table that has “salary” and “department” data for employees with another table that has “hire date” and “department” information. If they have multiple tables for each, but some of them have different formatting schemas for their “department” column, trying to assemble the right tables can be a painstaking process of trial and error.

To avoid this, companies need to add capabilities that allow the data catalog to render details that not only include the description of the data (information that’s typically tagged manually by users), but also automated insight into the completeness of the dataset. Such information could then be used to compare the differences between datasets that contain similar or overlapping information. For instance, it might enable the analyst to quickly identify that, of two tables with data on employee retirement benefits, one has more rows with “NULL” values and is thus less useful.

Datasets need to be assembled from tables residing in different systems

Many business questions require assembling datasets out of tables that reside in different systems, and often in very different physical locations. Needing to move data to a warehouse, mart or other location to make it available for analysis greatly slows down the process, and on top of that may be risky.

The answer, therefore, is not in moving the data, but rather in using virtualization to create a fabric over your data architecture that abstracts the process so that no data needs to be actually moved before it is queried. The open source project ‘Presto’ has already laid the foundation for this, with software that enables SQL queries to be executed across data tables that may be distributed across radically different environments.

Mapping business questions to the data sets that most likely provide the answers

A data catalog can’t tell you if someone has already created a query for the same data you’re looking for. In most instances the analyst’s question is similar to something that has been asked in the past, but  without this history it is inevitable that the data team will waste time recreating it. Manually creating SQL queries that span data tables residing in different systems is, to say the least, no easy task.

To solve this issue, questions–and their associated SQL queries–need to be automatically chronicled and mapped back to the data sets that had previously been retrieved to answer them. This will enable analysts to skip several time-consuming steps. Additionally, in instances where the questions are similar but not exactly in line with each other, it will allow analysts to fine-tune queries to pull the exact information they need. For instance, they may need to add an additional column from one table (such as a feature describing “salary”), or select only rows that meet a certain criteria (like “female” or “salary greater than $50000”).   

In many cases, however, analysts will be asking entirely new questions that don’t relate to any previous query of the data. Current data catalog querying tools usually rely on a form of keyword search. But without any understanding of intent, the data catalog will in most instances give the business user more information than they actually need. It’ll retrieve every possible dataset that has some kind of linguistic connection to the question, rather than identifying a few datasets that actually answer the question.

This is where machine learning comes into play.  NLP algorithms can intelligently assemble the data from across multiple sources and suggest relationships between tables that would be very difficult to discover manually.  NLP has already broken tremendous ground in areas like chatbots and virtual assistants, which are now helping to guide customers through company bureaucracies to get the help they need. It’s really just a matter of applying these same principals to data cataloguing.

Visualization takes too long

On a final note, one of the first things an analyst does when exploring a dataset involves taking a precursory look at the general characteristics of the data it contains. This typically involves gathering descriptive statistics such as the mean, standard deviation, etc., and often entails creating an initial visualization, such as a simple scatterplot,  histogram or pie chart to determine the shape of the data. This helps the analyst determine what prep might be required, and generally whether the data is useful. Currently analysts have to export the data into a BI tool. However, as data catalogs are bolstered with the ability to provide quick and easy visualization, it will provide significant time savings–analysts only need to export data into a BI tool that’s truly ready for analysis.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1

Leave a Comment

*

Resource Links: