Synthetic Data: The Cure to Data Drift?

Print Friendly, PDF & Email

Recent advancements in AI and computer vision capabilities have massively increased the scale and demand for training data. While real world data continues to dominate AI training, it is often becoming out of date in as short as six months. This is an area of concern as constantly evolving trends and the need for businesses to stay agile, leave little to no room for error in decision making.

It’s critical that organisations have available reliable, accurate training data more than ever before. Yet we recently found that almost two-thirds of organisations suffer from data drift in their training data.

Data drift is a discrepancy between the actual data processed by the deployed system and the training data used to train, validate and test the AI model that processes that real world input. This can arise as a result of various factors, including seasonal variations, climate change and even changes in fashion. Regularly monitoring the performance of a computer vision model is essential to successful deployment. If data drift is not identified in time, it can have serious implications on model performance leading to incorrect business decisions being made.

This phenomenon can be manageable if dealt with appropriately, usually requiring retraining of the model on new data but the effort needed will vary depending on the extent of the issue. This can be disruptive, causing ongoing problems for organisations and be a costly problem to solve. Therefore, detecting data drift should be a key part of the machine learning lifecycle. Ideally this should be an automated process supported by careful action. 

What actions can be taken?

Methods of dealing with data drift are often not mutually exclusive, meaning multiple strategies can and may need to be employed. An effective solution to minimising potential data drift has emerged in the form of synthetic training data. It is artificially generated from computer systems and provides the opportunity to produce greater volumes of accurate training data quickly and more cost-effectively than acquiring real world training data. But, beyond this, it can enhance the robustness of AI models by delivering training data for edge-cases that may be difficult or dangerous to repeat in the real world.

Systems that create synthetic training data allow users to generate training data on demand as opposed to waiting for real-world occurrences, enabling greater control over the training process and providing an opportunity to act before data becomes obsolete. 85% of organisations are already making use of synthetic data to train computer vision systems and of those who don’t, almost a third (29%) anticipate their organization will start using it in 2023.

How can synthetic data ensure data drift is a thing of the past?

Synthetic data offers a plethora of advantages. It’s fast to create, easy to update and cost effective when compared to acquiring real world training data. In particular annotation of real-world training data is labour intensive, time consuming, expensive and less accurate than annotation of synthetic data which is an automated and pixel accurate process. Synthetic training data can also be intelligently created in greater volumes, which is particularly beneficial in building more robust AI models. By filling in gaps and supplementing real-world data, the use of synthetic training data can alleviate the fundamental issues leading to data drift.

Another key advantage of synthetic data is the opportunity to optimise training efficiency. Large volumes of synthetic data can be generated much more rapidly than the alternative of collating real-world data. Users are therefore able to quickly gather training data for cases where new data is needed immediately.

For example, at the height of the pandemic, the mandate of face masks and social distancing meant that some AI systems were outdated, and needed to be retrained to recognise someone wearing a face covering. Another example is the deployment of electric scooters, which has also harnessed machine vision for harm detection and aids in preventing accidents. In addition to updating datasets to prevent data drift, data that is no longer relevant should be removed too. This can be done efficiently with the help of synthetic data training.

Training datasets containing private data present a risk of violating privacy regulations when used to train models. Synthetic data avoids this risk as it does not contain information traceable to individuals. Ensuring privacy compliance is essential to protecting individuals and businesses from legal and financial consequences, as well as aiding in building trust in AI. 

Overall, synthetic data provides robust and versatile datasets for AI training purposes. It does not rely on manual efforts and so, is quicker, comprehensive and more cost-effective to gather. With technological advancement and innovation, synthetic data is becoming richer, more diverse, and closely aligned to real world data. It can help to maintain user privacy and keep enterprises compliant, all of which furthers its ability to overcome the potential of data drift.

About the Author

Steve Harris, CEO of Mindtech, has over 30 years of experience in the technology market sector and holds a masters in Microprocessor Engineering from Manchester University. He has previously been instrumental in creating several European start-up organisations, with a proven track record of success in building strategic relationships and strong revenue streams with tier one companies worldwide. Prior to his current role, he has worked in a number of senior sales and business development positions at leading technology companies, such as: Imagination Technologies, Gemstar, Liberate, and Sun Microsystems, allowing him to bring a wealth of insight and expertise to Mindtech.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter:

Join us on LinkedIn:

Join us on Facebook:

Speak Your Mind