Why Synthetic Data is Beating the Real Thing

With real-world training data expensive to capture and labor-intensive to manage, synthetic data is rapidly taking its place. A long list of advantages is driving adoption and will ultimately see synthetic data form the majority of the data used in algorithm training over the coming years. 

It’s no longer just an intriguing, futuristic strategy — creating your own high-quality, scalable stockpile of AI training data is rapidly becoming the norm. In fact, by 2024, 60% of the data used for the de­vel­op­ment of AI and an­a­lyt­ics projects will be syn­thet­i­cally gen­er­ated, according to Gartner.

This ease and cost effectiveness offered by using synthetic data in machine learning projects will also help drive the adoption of AI projects. A recent PwC study already found 52% of companies questioned as part of the research have accelerated their plans to leverage synthetic data due to restrictions put on them by COVID. 

Eliminates arduous tasks 

Having experienced the benefits of synthesizing their own data, there is no reason to suspect that these companies will ever go back to collecting and tagging real-world data. Why? Because synthetic data solves a critical bottleneck that often limits organizations from moving forward with AI projects: the arduous task of physically collecting, analyzing, and manually annotating massive quantities of images – only to often find that the data is incomplete. 

More specifically, the real world data they collect often cannot cover rare or truly unique edge cases that may have never even been captured by imagery, but still must be detectable. 

And that’s where synthetic data leaps ahead. Since it’s artificially generated, it comes perfectly labeled as part of the process. And that process is incredibly cost-effective and time-efficient: as it’s computer-generated, developers can produce a great many images per second. As put succinctly by Lux Research, “[Synthetic data] can fill the gap between the supply and demand of big data while also reducing the complexities of sharing data.”

A further benefit is the ability to customize and focus one’s data to address precisely the types of scenarios the AI will be looking for.  “Real-world data is really just a snapshot of the situation,” commented Danny Lange, senior VP of AI and machine learning at Unity, during Transform 2021. “What you can do with the synthetic data is augment that real world with special use cases, special situations, special events. You can improve the diversity of your data by adding synthetic data to your data set.” 

And there is another benefit — ensuring your data is unbiased. “You have to have the real world as a baseline, but you can eliminate bias,” said Lange. “You can do your data analytics and ensure that your data represents the real world in a very even way, better than the real world does.”

Dramatically improving accuracy 

In dozens of sectors, key players today adopting AI to drive business decisions are quickly learning that the unexpected and unusual cases are almost as important as the standard, predictable ones. The more data you have to train with – frequently images with limitless potential for variation and nuances – the more accurate predictive analytics are when developing models to trigger action. 

Speedier performance does accelerate go-to-market plans, but importantly, synthetic data also serves to cut costs. Data created virtually will eliminate installing and maintaining cameras, expensive drone and satellite operations, managing IoT devices. It does away with all the other equipment and personnel required to record the data. 

It also eliminates the cost for error-prone human annotators (often expensive subject-matter experts) who manually tag each image, sometimes with hundreds or thousands of notes. As the data is produced by a computer, the same production process automatically, consistently, and accurately tags each image as it’s created. The amount of (accurate!) data that can be produced is limited only by the speed and capabilities of the computers producing it.

This emphasis on accurate tagging cannot be emphasized enough. No matter how skilled the developer, AI algorithms using “dirty data” simply cannot work successfully to deliver business benefits for a commercial product. Any errors included during the training only AI models not only creates extra work, but also can prove to be disastrous for a client making concrete business decisions based on it.

Key sectors rapidly embracing synthetic data 

With all these benefits, what industries are migrating most quickly to synthetic data?

Infrastructure: Gas and oil pipelines, train tracks, bridges, and other systems covering massive expanses of land are incredibly difficult to monitor. Experts flying to locations across the globe to manually inspect these assets are quickly being replaced by more efficient drones and satellites taking hi-resolution images. The challenge, of course, is that each potential flaw being examined must be compared to an AI’s collection, and flight angles, time of day, shadows, color variations, and more can alter the damage to the point of becoming undetectable. Creating millions of computer-crafted variations, automatically and inexpensively, provides an exponentially larger catalog to which to compare each image.

These same benefits apply to insurers looking to assess risk in properties they cover, simulating the repercussions of a particular weakness. Each property is different, as are the problems they might encounter. Only a comprehensive and expansive library built on synthetic data can cover these unprecedented cases.

Urban Planning: The cities of tomorrow will be designed (or re-designed) to better incorporate traffic, utilities, zoning, green areas, all carefully balanced against air quality and noise pollution. To simulate the myriad of combinations and potential synergies or conflicts, the AI needs to “imagine” data points that may not even yet exist. Synthetic data can drive that analysis – especially with futuristic components that literally do not exist today.

Defense: A massive, global market, military planners grapple with threat mitigation, risk assessment, and predicting all possible outcomes of a battlefield scenario. Almost as unpredictable and chaotic as the weather, these scenarios can only be designed and simulated using AI. These are, of course, military situations that have never occurred, so there is no real-world data to feed into the AI engine. Once again, synthetic data takes center stage. Every weapon, vehicle, and human participant can be modified and tweaked to simulate more combinations than a history book could ever describe.

When it comes to food, art, friendships – we prefer the authentic, the natural, the “real.” Not so in training models for AI algorithms. As strange as it sounds, artificial, manufactured versions of simulated reality offer significant benefits over real-world data collections. In short, there are too many scenarios, too many variations, too many unexpected deviations from the norm to believe they could ever record and tag enough of them to create a truly effective analytical AI engine. And there is a concrete, measurable dollar value to getting an AI-driven product to market before the competition does.