Sign up for our newsletter and get the latest big data news and analysis.

Data Roadblocks for AI – Most common challenges and how to avoid them

In this special guest feature, Dr. Sigal Shaked, Co-founder and CTO, at Datomize, discusses the best approach to overcome data challenges and achieve strong data governance so you’re in line with regulations – especially important for the banking and healthcare industries. Sigal has extensive experience of more than 15 years working with data in different fields and for various needs, both as a researcher and as an implementer, with a deep understanding of the underlying issues behind working with data. Sigal aspires to lead the best solutions for existing needs, armed with learning machines and their super powers, and equipped with human creativity that no machine would compete with.

Artificial intelligence can add $15.7 trillion to the world economy by 2030, equivalent to China and India’s combined output.  However, without a steady stream of complete and reliable data, machine learning models can’t provide valuable and trusted insights.  Preparing data is a huge challenge. Data scientists often spend 80% of their time cleaning and managing data rather than training models. Here is a drill-down of the most common data challenges facing data scientists.  

AI/ML models are starved for data 

According to a McKinsey survey, out of 100 organizations that have piloted AI in at least one of their functions, 24 stated that the largest barrier in AI implementation is the lack of usable and relevant data.  Linear algorithms need hundreds of examples per class, while more complex algorithms need tens of thousands to millions of data sets.  When a model is trained with insufficient data, there is a high risk that it won’t work effectively when new data is added.

Even in cases where a large quantity of data available, there is still a chance that the data will not be usable due to personal privacy laws.  There are many regulations that prevent the use of sensitive data to feed machine learning models, including: General Data Protection Regulation (GDPR), Payment Card Industry Data Security Standard (PCI DSS), Health Insurance Portability and Accountability Act (HIPAA), Federal Information Security Management Act of 2002 (FISMA), Family Educational Rights and Privacy Act (FERPA), Gramm–Leach–Bliley Act (GLBA). It’s especially challenging to find safe data to feed AI/ML models for the medical and finance industries since the data is so sensitive.   

Data can also be biased.    A machine learning model will make assumptions based on whatever data it reads. If that data tells a skewed or incomplete story, the rules it creates will be fundamentally unsound.   Training data needs to be an accurate representation of the population, including data sets from every category. For example, even though black women are 42 percent more likely to die from breast cancer, machine learning models were fed by mammography images overwhelmingly from white women, making them inaccurate when reaching conclusions for the all women.

There is also the issue of low data quality.  Even if the data is safe and representative of every segment in the population, it can still be unusable because it’s incomplete, irrelevant, or out of date.    Many enterprises have an inconsistent data vocabulary because data resides in silos in different regions, business units, and geographies.  

Steps to Overcoming Data Challenges

In order to collect the data that’s needed it’s possible to create systems to harvest data from different sources.  If you know the tasks that a machine learning algorithm will perform, then you can create a data-gathering mechanism in advance to collect the data internally.   

However, this data may be unusable due to regulations, so it will have to be anonymized.  At the most basic level, you have techniques like data generalization, pseudo-anonymization, and data masking, but none of these methods are secure enough to deter hackers.  Data swapping reassigns data points from one person to another, making the data more secure, but less useful for analysis to glean insights about real-life scenarios.  Similarly, perturbation and differential privacy add random noise to obscure details that need to be kept confidential, making the data difficult to analyze.

Enterprises, government agencies, and academic institutions provide open-source data. But intellectual property issues can limit the use of data only for research and not for commercial purposes.  This data also tends to be useful only for industries that are not highly regulated.  Therefore it can be hard to find data used in medical and financial industries.

Another option is for enterprises is to generate their own synthetic data sets that contain the same schema and statistical properties as their “real” counterpart.  Enterprises can then have more control over data quality and can tweak the data set characteristics based on AL/ML objectives.  Generating synthetic data can also provide the scale global organizations need to create the quantity, variety, and granularity of data required so that the resulting models are unbiased, accurate, and complete.

Even after all the necessary is collected, a data governance system needs to be implemented to keep the data pipeline working. Without a firm data strategy and enterprise-wide supply of data on-demand, machine learning models will no longer be relevant. Data governance costs need to be built into projects to ensure machine learning models stay on track.

In our data-driven world, machine learning models are becoming a given for reaching insights that streamline operations, identify new revenue streams and provide engaging customer experiences.  But until enterprises have a healthy pipeline of quality data, the quality and wisdom of the insights gleaned by AI/ML models are at risk. 

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1

Leave a Comment

*

Resource Links: