The Decade of Synthetic Data is Underway

Print Friendly, PDF & Email

In the future, when we look back at the development of Artificial Intelligence, the 2020s will be remembered as The Decade of Data. After nearly a decade of model-centric thinking dominating the fields of machine learning and AI, we’re finally seeing a paradigm shift toward a data-centric approach. Rather than focus hundreds of hours on fine-tuning one’s AI algorithm or model, researchers have realized that they can boost AI performance much more effectively by focusing on improving their training data. In a relatively short period of time, this complete reversal has gained widespread acceptance across the research and enterprise communities. And it isn’t because scientists love to be wrong.  It’s because the data on data-centrism is undeniable — it works. And it works far better than modeling in practically every application imaginable.

At the same time, the development frontier has shifted to ever more sophisticated and complex AI applications. Machines are being trained to do more tasks across a wider variety of environments and applications. And as a result, the field’s data needs have skyrocketed. Even a standard, 2d  image classification application can require a data set of tens of thousands of images. And that only addresses quantity, not quality. In addition to sourcing all those images, each must be accurately and consistently labeled. The capabilities being developed today for AR/VR, Robotics, Logistics, and more are pushing computer vision into the realm of dynamic, 3-dimensional imagery. And that’s a much taller order than identifying images of cars for CAPTCHA.

This confluence of factors has resulted in a perfect-storm of a bottle-neck that’s slowed AI development to a relative crawl. We’ve hit the Manual Data Acquisition Barrier. Thankfully, parallel to that barrier is an open trail, already being blazed — synthetic data. Synthetic data is data that hasn’t been collected by direct observation. Instead, various strategies are employed to create models, simulations, and systems that can generate realistic data points for AI training purposes. In the aforementioned example of object identification and computer vision, synthetic data comes in the form of computer-generated images, labeled with training identifiers appropriate for one’s application. 

In actuality, AI’s “Decade of Data” will be a “Decade of Synthetic Data” — although, it doesn’t roll off the tongue quite as nicely. Poeticism aside, synthetic data is the trap door that will allow scientists and researchers to get past the brick wall standing in front of manual data acquisition. At the moment, however, it’s still a very young field. In fact, Over 71% of the 21 companies in G2’s Synthetic Data software category were founded in just the past 4 years. But, technology moves quickly, and synthetic data is no exception. In the field of computer vision, for example, we’ve already developed the algorithms needed to support the move to 3d, but the limitations of manual data collection have created a bottleneck in the development process . Thankfully, synthetic data capabilities are quickly catching up, with already being capable of generating fully-3d, hyper realistic image data, in context, complete with 3-dimensional ground truth. 

Although the way is already being paved for synthetic data, it would be misleading to suggest it doesn’t have obstacles of its own. Chief among those barriers are the three key resources — time, money, and talent. Creating a synthetic data organization is a major undertaking, requiring not only a large amount of capital, but also a large, multidisciplinary team with expertise in areas that have just barely moved beyond the experimental confines of academia and solid integrations between them. Multiple Fortune 500 tech companies have tried to develop such an outfit internally and failed.    

They instead turned to the small cluster of existing synthetic data startups already in operation. Perhaps most notably, Facebook acquired the synthetic data startup, AI Reverie, and substantially expanded their team. 

Whether it be through acquisition, investment, or traditional growth, it’s clear that, one way or another, the synthetic data decade will be led by this small batch of lean start-ups with the requisite expertise.  In the coming years, we can expect development in the sector to take place around a handful of focal points, including analytics tools, cloud capabilities, and blended media. Right now, the only reliable way to measure a dataset’s efficacy is to see how well the AI ultimately performs after being trained with it. This approach will soon be strengthened by the development of analytics tools designed specifically for training datasets. These tools will enable faster, more targeted assessment of AI performance, and will be able to highlight specific weak spots in datasets and actually make recommendations as to what data is underrepresented or absent. Combined with large-scale, high-performance compute (HPC) cloud architecture, these tools will make iteration faster, more targeted, and more efficient. 

We are also beginning to see powerful new methods for generating synthetic data take shape, such as the blending of real and synthetic image data to create unique, hybrid inputs, and even processes for transforming 2d images into. At the same time, increasingly sophisticated software solutions are giving data engineers more and more control over more and more parameters. These rapidly-evolving capabilities, among others, will make data synthesis an even more attractive solution for data scientists looking to accelerate the AI development process. With the ability to make fast, informed changes of practically any kind to one’s data set, synthetic data will soon take  the commanding role across practically every stage of the research and development lifecycle, and make the 2020s AI’s undisputed Decade of Synthetic Data. 

About the Author

Gil Elbaz is CTO and Co-founder of Datagen. Datagen is a pioneer in the new field of Simulated Data aimed at photo-realistically recreating the world around us, with a focus on humans and human-environment interaction. The company works with innovative companies across a wide range of industries and is supported by some of the most respected investors in the AI field. Gil received his B.Sc and M.Sc from the Technion. His thesis research was focused on 3D Computer Vision and has been published at CVPR, the top computer vision research conference in the world.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: @InsideBigData1 –

Speak Your Mind



  1. Check out the best in class synthetic simulated training data for computer vision – SKY ENGINE AI –