How We Use Synthetic Data to Improve Performance and Break Away from Dataset Constraints

Print Friendly, PDF & Email

In this contributed article, Jan Lunter, CEO & CTO of Innovatrics, highlights how synthetic data is an efficient technology to supplement datasets with types of data that are underrepresented. Graduated at the Télécom ParisTech University in France. Co-founder and CEO of Innovatrics, which has been developing and providing fingerprint recognition solutions since 2004. Jan is an author of the algorithm for fingerprint analysis and recognition, which regularly ranks among the top in prestigious comparison tests (NIST PFT II, NIST Minex). In recent years he is also dealing with image processing and the use of neural networks for face recognition.

The advancements made in recent years in generative adversarial networks (GANs) allow us to leverage the benefits of generating synthetic data for a wide range of machine learning (ML) applications. Several years ago, we started training neural networks for optical character recognition (OCR) tasks using synthetic data. We generated synthetic IDs to teach neural networks to read them reliably, even in suboptimal conditions, for example, with scratches, glare, and other factors. 

In the real world, we would never be able to gather a dataset as large as the technology requires. Even small countries don’t have enough citizens to provide us with the robust real dataset the model needs. That is why synthetic IDs fit the bill perfectly.

We also recently started research and development projects which generate synthetic fingerprints to improve algorithms and identify fingerprint fragments—known as latent fingerprints. Latent fingerprint analysis can support law enforcement agencies, as they are usually found on crime scenes. 

However, similarly to face recognition models, obtaining a dataset to train latent fingerprint algorithms is extremely difficult due to the high-quality, consent, and size of the dataset required for ML purposes. Now, with the ability to generate fingerprint fragments artificially that meet the required standards, we can expect the algorithms to improve identification performance, even for fragmented or low-quality fingerprints.

Last but not least, synthetic data is an efficient technology to supplement datasets with types of data that are underrepresented. This is especially true for facial recognition. Generating high-fidelity faces for ML has several other advantages besides decreasing bias; they do not infringe on personal rights, do not require consent, and can be customized to meet specific needs and goals. For example, for age verification applications, we can generate faces that are right at the edge of the critical ages—from 18 to 21—without breaching ethical standards or the legal rights that minors have.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter:

Join us on LinkedIn:

Join us on Facebook:

Speak Your Mind