In this special guest feature, Paul Barth, PhD, CEO and founder, Podium Data, discusses how to create a “virtuous cycle” in which the quality of a data lake continually improves the longer it is used, and how data lake management platforms make this cycle possible. Paul has spent decades developing advanced data and analytics solutions for Fortune 100 companies, and is a recognized thought-leader on business-driven data strategies and best practices. Prior to founding Podium Data, Paul co-founded NewVantage Partners, a boutique consultancy advising C-level executives at leading banking, investment, and insurance firms. In his roles at Schlumberger, Thinking Machines, Epsilon, Tessera, and iXL, Dr. Barth led the discovery and development of parallel processing and machine learning technologies to dramatically accelerate and simplify data management and analytics. Paul holds a PhD in computer science from the MIT, and an MS from Yale University.
Making Better Data Available More Quickly by Empowering Users to Create and Change Metadata
Data lakes transform enterprise decision making by putting far more data at the hands of far more users far more quickly than ever before. Just as important as this faster data access is how data lakes can also improve the quality of the data they make available to analysts. Implemented properly, data lakes allow business users to continually cleanse bad data, and to share their insights about the data through tags, metadata and the creation of new data sets.
This article will discuss how to create a “virtuous cycle” in which the quality of a data lake continually improves the longer it is used, and how data lake management platforms make this cycle possible.
Not Just Quantity (Speed), But Quality (Better Data)
By eliminating the requirement that data be structured before it is loaded (like in the case of data warehouses), data lakes eliminate the time and effort associated with defining a schema and mapping data into that schema using conventional extract, transform and load (ETL) tools. Data lakes also eliminate the weeks or months of waiting for IT to provide data by pre-positioning large sets of data within reach of data analysts and business users. And, because they are built on Hadoop, data lakes provide the performance advantages of massively parallel processing at an attractive price.
What’s often overlooked, however, is the ability of a data lake to actually help an organization speed not just data delivery to analysts, but to speed improvements in the quality of that data. This faster “time to quality” is driven by three capabilities:
- A properly implemented data lake empowers the business users who understand the data best to find and fix problems with it when it is much earlier along the path from “raw” to “ready.” This early access allows them to more quickly and efficiently find, suggest and even make corrections to the data. This results in faster and more accurate analytics, and the ability to make better business decisions more quickly. We have found that users can make the best use of this early access when the data has been statistically profiled. Ideally, this profiling is conducted against all the data in a given data source – every record and field – to give business users an accurate picture of the character, conformity, and completeness of each new data source arriving in the lake. By informing users about what each piece of data is supposed to represent, this profiling allows them to use their familiarity with everyday business processes to quickly find and identify errors in that data.
- Implemented properly, enterprise data lakes ensure that any improvements to the data are available to everyone. That includes data cleaning, data preparation, the creation of new data sets or the addition of business metadata, thereby improving the consistency of data across the enterprise and eliminates duplicate data cleansing, transformation or profiling efforts. Users also share metadata, which can capture indicators of the quality of data and its appropriate use. By making it easy and automatic for users across the enterprise to tap into and leverage a shared data set, data lakes can ease collaboration and ensure accuracy as various users leverage the same data generated from a common consistent set of data cleansing, profiling and transformation procedures.
- Finally, data lakes accelerate improvements in enterprise data by creating a “virtuous cycle” of data improvement in which availability of more data in the lake attracts more users who improve that data, thereby attracting more users, more improvement and more data over time. This is because the more data is stored in the lake, the more opportunity users will have had to prepare and enhance it with user-ready data sets, as well as with crowd-sourced metadata and tagging. By comparison, data wrangling tools don’t provide this benefit because they only manage data sets for a department or small group of users. These benefits are also not possible from “roll your own,” custom-developed data lakes because they don’t make it easy enough for non-technical users to add new data to the lake, and enhance it with new datasets or metadata.
One of the benefits of data lakes is their ability to preserve multiple copies of a particular data set as it moves along the continuum from “raw to ready.” By retaining data at every stage in this process and giving users tools to search, select and retrieve data that is just right for their unique business needs and skills sets, data lakes can meet a diverse set of business requirements.
What to Look For
To improve the quality of their enterprise data faster, customers should look for data lakes with the following capabilities.
- A graphical user interface and a metadata layer with enough easily understandable detail to allow users without deep technical skills to find the data they need.
- The ability for business users to directly clean, enhance and prepare data. These activities could include, for example, the creation of new data sets, aggregates or derived measures. As users add this new data and data views, the data lake should always preserve a copy of the original, preserve lineage across data sets and preserve records of the data all the way from raw to ready.
- Tools that allow business users to create and maintain business metadata, including the use of crowdsourcing to add meaning and context through comments and tags.
- The governance and security required of an enterprise data hub that empowers all users (not just the limited number using a data wrangling approach) to share their insights and their improvements to the data. This means support for the easy and transparent creation and monitoring of enterprise-grade governance and security processes so the data lake can serve as a central, rather than only a local, source of analytic data.
Needed: Speed and Quality
Proper data transformation and management are essential to allow users to easily find and tap multiple data sources, improve it over time and keep it secure. Together, these capabilities enable seamless analytics self-service and collaboration to speed better business decisions. In the end, the payoff comes in faster, more accurate and insightful, business decisions.
Sign up for the free insideBIGDATA newsletter.