Data Lineage – The Key To Understanding Your Data Landscape

Print Friendly, PDF & Email

data-linegae-image“The biggest problem organizations face around data management today actually comes from within,” according to Thomas Schutz, SVP, General Manager of Experian Data Quality. (Carmody, 2016)*

According to a recent analysis, 90% of the world’s data has been generated in the last two years alone. This data outburst is due to the increasing number of systems and automation at all levels and in all sizes of organizations. While this data has made it easier to obtain detailed information in the working world, it has also created new sets of problems.

Most organizations face the complication of data residing on a confusing assortment of servers from various vendors that may support different platforms. These diverse big data ecosystems have been made to work together harmoniously, but often the connections between systems are poorly documented. Most organizations would probably be in a tight spot to say exactly where their data resides and how it interacts with upstream and downstream applications.

What’s Really Happening With Your Data?

Understanding your landscape’s data lineage and data relationships are the vital keys to getting a grip on what’s really happening with your data. Data lineage is similar to a data life cycle that helps us track data from its origin to its destination. It elaborates the data flow along with their dependencies. Information captured from data lineage makes it possible to track data back to its origins explaining also the data usage journey, a process that would be very time consuming without an automated data lineage solution. Simply put, data lineage answers questions like, “Where did this data come from?” or “How did you arrive at this reported number?”.

Knowledge of data relationships plays a key role in evaluating the impact of changes on other systems. This knowledge can be very useful for better data governance, improved data quality and integrity process, “hidden” data management and overall metadata management.

Mapping Data to Establish a Baseline

One of most essential benefits of mapping data flow and data lineage is that it establishes a baseline. Mapping data graphically helps in better visualizing the various data elements and their relationships.  These techniques are very helpful in identifying potential hidden vulnerabilities at different stages and help data managers take necessary corrective actions proactively.

Data lineage can help to provide a more holistic view of data which helps in better data compliance and easier diagnosis of business rules discrepancies.  The starting point to capture and represent your complete data lineage is accessing your Metadata. This information is usually already known by most databases and is the easy part.  The real work begins in having to discover and learn the ‘hidden’, undocumented data in your data environment.

Challenges of “Hidden” Data

The situation of ‘hidden data’ is quite common in older legacy and siloed systems, where there is often missing or absence of complete documentation. Discovering and tracking all the data elements and data relationships is a huge problem if a business is working on data management and analysis using only 20% of its visible (‘known’) data at its original database metadata level, and is not able to effectively make use of the other 80% of its ‘hidden’ data assets. Much effort is required to address this, causing delays in time-to-market and/or deployment with substandard product or incorrect information.  This puts the business at a significant competitive disadvantage to other more data savvy companies.

Data Lineage Through Data Transparency

To create a good data lineage solution, data transparency is a must. As a simple case study in a financial sector, regulators want to fully comprehend how banks arrive at their risk assessment numbers, such as a capital liquidity ratio. To achieve this, financial institutions must be able to explain to regulators in a timely fashion how they arrived at their reported numbers, including all of the source data used to calculate the number.  At a technical level, this requires banks to search into their enterprise databases to identify data items and trace data relationships between and within databases. The banks have to respond to their auditor’s requests on how the numbers were derived and their source data in a timely fashion, usually within 5 business days at the most.  The problem is that this is often a highly manual, tedious, time consuming process.

Desired Solution

Many business initiatives require that you know your data landscape. Unless you know your current data assets, it is hard to determine what you need to access or change to meet new business requirements.   Lack of knowledge of the firm’s data assets or inability to understand relations and data flow can lead to wasted efforts and incorrect conclusions. Database baselining is therefore a foundational activity that helps CDOs, CTOs, application architects, and data architects to:

  • Understand and leverage the organizations data and limit data debt.
  • Control IT costs, enable M&A due diligence and regulatory compliance

Data baselining without the right tools is frustrating, laborious and error prone. A tool is needed to provide an easy to use solution that automates the discovery of your hidden ‘undocumented’ data, saving time and removing silos by bringing people a unified view of the data assets across technologies. Insights will provide a chance to streamline systems, eliminate redundancies and discover new opportunities – making even the most complex data environment comprehensible and also providing actionable information to users to utilize the full value of your data.

(*) Carmody, B. (2016, Feb 1). Biggest Problem with Big Data Management in 2016. Retrieved from

stuart-tarmyContributed by: Stuart Tarmy, VP Sales and Marketing at ROKITT brings over 20 years of experience as a GM and head of sales, marketing and product management for leading global financial service technology, ecommerce, data management and predictive analytics (Big Data) companies.  He has held senior executive roles with Fiserv, Albridge Solutions (acquired by Pershing/BNY Mellon), MasterCard, and McKinsey & Company.  Stuart earned an MBA from the Yale School of Management, a MS in Electrical Engineering from Duke University, and a Sc.B. with Honors in Electrical Engineering from Brown University.

aarthi-sivasankaranAarthi Sivasankaran is a Business Analyst at ROKITT with a keen interest in big data, data discovery and data management. She has a master’s degree in International Business from Hult International Business School and a Bachelor’s degree with Honors in Electrical and Electronics Engineering from Anna University. Aarthi discusses big data, emerging technologies, and how Rokitt Astra can help in keeping pace with changes in your big data landscape.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind