Make Your Lazy Data Not So Lazy

Ever since information technology became the operational heart of organizations around the beginning of the 21st century, there has been an inclination to adopt the latest, greatest – and most-hyped – technologies and processes available. Not because there is a pressing need or stated goal, necessarily, but more because they think they should.

Just look at all the excitement over artificial intelligence (AI) and machine learning (ML). Don’t get me wrong; the hype is real — AI/ML are revolutionizing many industries. However, today, many healthcare organizations are willing to invest in new technologies with no plan for what they will do with it once acquired.

One of the most pervasive examples of this phenomenon in the big data era is often referred to as “lazy data.” This is data gathered for the sake of gathering data, with no real use case behind it.

Organizations that implement data analytics projects or invest in aggregation and normalization of data without a clear end goal often run into significant organization and data risks and challenges

One of the most significant issues with embarking on data analytics projects without a clearly defined end-game is the waste of time, effort and expense – three commodities in short supply in most healthcare IT departments – spent on data projects, especially when you consider the ongoing storage, maintenance and governance required to manage those data assets. Data that just sits in a warehouse does not provide real business value.

Another issue with embarking on data projects without a long-term view is, what happens if you acquire a new analytics technology (AI anyone?) that requires the data to be in a different format? Or what if you suddenly need to see your data as it existed at a specific point in time in the past?

Prematurely transforming data during the ingestion process, rather than maintaining it in its raw form until its use case is known, results in the loss of the original details and schema of the raw format. This can limit future uses and value of the data. It would be like Ray Kinsella deciding the voices in his cornfield were telling him to build a soccer field instead of a baseball diamond. Shoeless Joe would have been disappointed when he stepped onto the field with his bat and mitt.

The best approach to avoiding lazy data is simple but not necessarily easy. Put your data to work! Ensure your data has a purpose and supports your organization’s strategic objectives. When those requirements are met and data is going to be utilized, you need to establish best practices to govern the way you treat your data including:

Ingest source data into your warehouse or lake with limited initial transformation
Centralize your data management team, tools and governance
Adopt an ELT (extract, load, transform) focused approach – persist data in its raw state
Improve raw data at the time it is needed/queried
Store copies of data throughout the transformation process

These tactics offer several advantages. One of the most obvious is eliminating wasted effort on projects, storage, maintenance, and management of non-critical data and use cases.

Another advantage is the flexibility afforded as things change in the future. Requirements for value-based contracts and quality reporting often change from payer to payer and from year to year. If the data is transformed without proper planning, it may need to be reworked as things change. If it is stored in its raw form, it can be transformed as-needed, in whatever form is required, to enable reporting for revenue and reimbursement, provider compensation, Star and HEDIS ratings, or any other needs that surface.

Waiting to transform the data until it’s needed helps future-proof the organization. If your data is managed with consideration for tomorrow needs, you do not limit your organization’s future use of the data.

Another critical issue is version control. If data isn’t backed up throughout the transformation process, any changes to business logic or metrics applied to your data can wipe out historical business rules. In other words, it is more challenging to do longitudinal or point-in-time analyses of your data because the rules applied to the data at that point in history will be gone. It can also lead to skew if transformed too early; each time it is used, it gets further away from the original source until there is no remaining source of truth. Keeping the data in its raw form and then transforming it when needed ensures the organization can always go back to the original data if needed to settle disparities.

Take for an example, an organization I worked with in the past that stores different but overlapping clinical and financial datasets in different warehouses managed by different teams throughout the organization. They stored some of their raw data and sent others out to be manipulated by third party analytics companies before storing it. It was then up to their BI teams to tie all the data together in complex queries that fed their analytics. It was a complex undertaking that was repeated on a monthly cycle. This web ultimately worked but it was inefficient, expensive, and led to constant fire-drills when payors inevitably made changes to the data.

Through thoughtful planning and prioritization of data sources, diligent process improvement, and a lot of hard work, their data management processes was streamlined. This started with organization and automation of file transfers which enabled us to programmatically load and store all raw data in a common data model. We established cross-source indexes (including a unique patient index with a 97+% match rate), and today, queries and dashboards run faster and with much less up-front work or maintenance. This use case served as a model the client is now replicating across their data teams and serves as a guideline for future success.

As our client demonstrates, modernizing your approach to data management and governance is not an “all or nothing” approach. If your organization has robust infrastructure and tools, you can gradually introduce these methodologies and perhaps even run a hybrid approach until you are ready to make a full commitment.

About the Author

Kyle McAllister, Practice Director, Data Analytics for Pivot Point Consulting, is a healthcare IT, analytics and population health strategist with experience leading organizations and teams through complex data, analytics, and other IT projects from visioning to execution to optimization.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1

Comments

Dick Hacking says

February 18, 2021 at 10:24 am

This seems to ignore the new constraints around collection, use, and sale of data brought on by GDPR, CCPA, and others. Perhaps healthcare is exempted from some of those constraints, but overall it is incumbent on each organization only to collect data that it has a business use for, only to retain it for as long as the original use-case was disclosed to be, and not to process it or sell it in any way that was not provided for by the original use disclosures. Maryland and other states have prohibitions against retention unreasonable beyond the original retention disclosure. IMHO Lazy data is going to become a thing of the past as more and jurisdictions clamp down on privacy rights of the data subjects.

Make Your Lazy Data Not So Lazy

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Speak Your Mind Cancel reply

Comments

Featured RSS Feed

More News from insideHPC

Make Your Lazy Data Not So Lazy

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Join Us On Social Media

Speak Your Mind Cancel reply

Comments

Related Posts

Featured RSS Feed

More News from insideHPC