The Problem with ‘Dirty Data’ — How Data Quality Can Impact Life Science AI Adoption

Print Friendly, PDF & Email

Where AI models are concerned, you get out what you put in. You can’t expect to input poor-quality data and generate high-quality results. But all too often, that’s exactly what’s happening in life science. Successful AI models fail to deliver their full potential because the data they’re based on isn’t of sufficient quality. The challenge to effective AI adoption in life science doesn’t lie with AI itself but with life science datasets.

Life science data: unclean, unstructured, and highly regulated

Life science companies sit on vast quantities of data. The ‘data deluge’ has swamped all industries, but none more so than life science – where data floods in from patients, payers, and healthcare professionals via countless streams. For example, the patient’s voice has been increasingly amplified in recent years. While this is undoubtedly an excellent thing for patients, life science teams face a challenge keeping pace with the number of online channels where opinions are shared and information can be mined. “There is a lot of data to be harnessed, and top life sciences companies have noticed,” reports NTT Data. “With rapid reductions in costs of genome sequencing, the amount of genomic data has skyrocketed to over 40 exabytes over the past decade.

Quantity does not always equate to quality, and rarely is an enterprise’s entire data lake necessary to build an effective AI model. Instead, companies need to adopt a data-centric approach, thereby shifting away from large volumes of information to smaller samples with higher-quality data sets for training.

Data access and compliance

Data quantity is only one potential roadblock preventing the construction of high-quality life science datasets. Many industry data sources are subject to regulations such as the European GDPR or CCPA, among other regional laws, and may not be shared with other vendors or used to train AI models. Data access can be a real issue within highly-regulated industries such as life science, where regulatory requirements can change from region to region. “While most companies are embracing new technologies to deliver enhanced patient outcomes,” notes Deloitte, “the ambiguity of regulations related to converging and emerging technologies results in a myriad of compliance challenges.

When building life science AI models, it’s not uncommon to find that potentially valuable datasets are ringfenced by compliance issues, leading to models built on incomplete data.

Dirty data

Life science companies have access to a lot of data – in some cases, too much – and a lot of the more useful information is subject to strict regulatory processes and is effectively beyond reach. And to make matters worse, a significant proportion of life science data is ‘dirty’ – inaccurate, incomplete, or inconsistent – and not immediately usable.

Life science data is often unstructured, coming in the form of typed MSL reports and field team observations that can vary drastically in length, format, and even language. Many healthcare organizations have fully migrated to electronic medical records (EMRs), but some have only partially migrated, while others are yet to begin the transition. These disparate and often-inconsistent data streams mean that life science data sets must often be cleaned before they are used to train effective AI models.

Dealing with data bias

The appeal of data-based decision-making is rooted in objectivity – that data tells the truth, and choices based on data will be correct. But bias can still play a role. Machine learning models are influenced by both the diversity of datasets and the way the model is trained. Therefore, if the datasets contain biased data, the model may exhibit the same bias in its decision-making. “AI can help identify and reduce the impact of human biases,” reports HBR. “But it can also make the problem worse by baking in and deploying biases at scale in sensitive application areas.” 

How can machine learning models overcome biased data? Last year, a group of researchers at MIT discovered that how a model is trained can influence whether it is able to overcome a biased dataset. The authors of the study noted that it is possible to overcome dataset bias by taking care of dataset design. “We need to stop thinking that if you just collect a ton of raw data, that is going to get you somewhere,” said research scientist and study author Xavier Boix.

Effective AI adoption in life science

Thus far, AI adoption in life science has been a mixed bag. In many cases, projects have gone awry not because the technology is immature but because the data it’s based on is unclean, unstructured, or ringfenced by regulations. According to research from Deloitte, “As AI moves from a “nice to have” to a “must have,” companies and their leaders should build a vision and strategy to leverage AI, then put in place the building blocks needed to scale its use.”

Attempting to implement an AI model before the data is ready wastes time and resources. Data challenges leading to poor or biased models can impact the industry’s confidence in the potential of AI to deliver business value. To succeed in training and deploying AI models, life science companies need to develop a clear data strategy and spend sufficient time cleaning and harmonizing their data.

About the Author

Jason Smith is the Chief Technology Officer, AI & Analytics at Within3. He uses AI to understand the value of data and deliver products that empower our customers to make impactful decisions. Jason began his career at IBM and ATI Research while studying computer science at Harvard University, US. He is a leading-edge technologist and executive with over 20 years of industry experience.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter:

Join us on LinkedIn:

Join us on Facebook:

Speak Your Mind