Mixing It Up: The Benefits of Blending Synthetic and Real-World Data
To optimize an algorithm’s accuracy, data sets are evolving to feature a mix of synthetic and real data. But what should this mix look like, why is it the optimal solution and what are the benefits?
Within two years, 60% of the data used for the development of AI and Machine Learning projects will be synthetic, according to Gartner. Its key role will be to expand and fill out data sets that have gaps or issues, such as inconsistent annotation, privacy issues, bias, etc. Even though algorithms can be trained using just synthetic data, real imagery will still play an important part in helping to fine tune the synthetic element of the dataset. Why? Because the slight addition of real images has been shown to improve algorithm results.
The Ideal – But Unattainable – Scenario
In a perfect world, we’d train machine learning models on massive, comprehensive libraries of real-world training data. This imagery would be painstakingly (but perfectly) annotated by subject matter experts, so it could quickly and effectively teach the AI precisely what to look for. And while we are imagining this ideal narrative, let’s have this data collection cover every possible scenario with every object, camera angle, lighting situation, and any possible permutation and variable that the final AI implementation may encounter and understand.
However, there is one problem with this ‘perfect’ scenario. Anyone working with machine learning can tell you — it is never going to happen.
As we discussed in a previous blog post, the fundamental flaws in leveraging real-world data present too many disadvantages to make it a viable option, due to the following factors:
1. Cost — Data collection and annotation is expensive as it involves lots of data (either bought or self-produced) and its manual annotation, using large teams of human annotators.
2. The Human Factor — After being collected, real-world data must be manually annotated by subject-matter experts who are expensive to engage, take substantial time to work through massive amounts of content, and produce an end result that includes errors and inconsistencies, especially after the fatigue of multiple of hours spent on the same task. This can negatively impact the training of an algorithm with conflicting guidance
3. Limits on scope — Even if we overcame the previous challenges with massive budgets, armies of perfectionist annotators, real-world data rarely includes all possible scenarios. Edge cases that may rarely occur (or never occur at times where they can be captured) nonetheless need to be made available for training to cover the eventuality in which they do occur.
4. Privacy – In many cases, there are complex regulatory hurdles and ethical quagmires preventing the photographic capture of people and locations without explicit permission. In some AI platforms, this makes training literally impossible.
The Optimal Solution
Synthetic training data simultaneously solves all four problems. It eliminates the cost of physically capturing massive amounts of real-world data. It eliminates the flawed, expensive, time-consuming human annotation because annotation is automatically embedded in synthetic data. Without the limitation of locating real-world examples, developers can create almost infinite permutations with every foreground and background element imaginable, from every angle, in every combination, lighting scenario, and more.
There is an additional benefit to synthetic data. The collection and annotation of real-world training data can create a frustrating bottleneck and delay AI projects. Even the most potentially world-changing algorithm can’t be trained without sufficient high-quality data. Synthetic data not only boosts the results, but the process itself. If you want to start a project and you know you have a cold start — no data and no annotated data — then you have to collect and annotate data. That can take some time just to get the first batch of data and then to annotate it. Also, you need to repeat this process several times. However, you can ensure you have a warm start by using synthetic data. This allows you to jump start your training processes. It allows you to accelerate the training process and reduce the time to deployment. So, by using synthetic data we can get to the same level of results but much more quickly.
The Perfect Mix
The advantage of using a combination of real world and synthetic data was recently illustrated by OneView and Airbus Defence and Space (Intelligence) in a Proof of Concept study. Using Airbus’s satellite imagery, OneView created one dataset of real data, a second of synthetic, and a third that blended the two.
The result? While the real-only results produced an accuracy score of 82%, the dataset using a mix of synthetic and real achieved a score of ~90% — an 8% improvement over that real-only data.
The increase in performance by adding synthetic data is due to the ability to include edge cases, the variation that is achievable and the endless amount of various scenarios you can develop.
The implications are important. Even if just 5-10% of your data is real world – a high-quality, verified core collection – you can boost performance by creating synthetic images to expand the scope of your data set and input edge cases. This blending of the two types harnesses the best of each, eliminates their weaknesses, and yields optimum results.
A Focus on Quality
Up until now, there has been a focus on developing new architectures that run faster and train better algorithms. So, the main goal has been to feed these algorithms with lots of data to try to gain another small percentage from the algorithmic performance.
However, there is a growing understanding that this is not optimal. The focus should not be to throw lots and lots of data into the training set but to throw good data, according to Andrew Ng, Founder and CEO of Landing AI. He believes we should understand what data is missing and add that data. His message is simple — focus on optimizing your data, if you want better results. Synthetic data helps you achieve this by filling the gaps in your datasets, once you are aware where the gaps are. You now have the ability to quickly improve the quality of your dataset and algorithmic performance using the perfect mix of synthetic and real data.