Can Synthetic Data Really Improve Algorithm Accuracy?

OneView and Airbus Defence and Space (Intelligence) wanted to demonstrate how a machine learning algorithm designed to detect aircraft trained using primarily synthetic images not only increases its accuracy but also allows for more sophisticated detection algorithms.

The availability of a wide variety of labeled data has always been the major bottleneck impeding the accuracy of AI models developed for geospatial analytics. The requirement of real-world, annotated images to capture all the various scenarios concerning the detection of objects of interest is incredibly time consuming, costly, error prone and, in many cases, unrealistic.

So, while numerous architectures exist for object-detection algorithms, achieving accurate results usually remains out of reach due to a significant lack of reliable training data to optimize the training of machine learning algorithms.

Synthetic data can be used to mimic any use-case and replace the need for real images. This removes the need for the manual collection of thousands or even hundreds of thousands of real images. Synthetic data also comes automatically annotated, making it ready for use in the process of training machine learning algorithms. This not only saves vast amounts of time and resources, but also improves annotation accuracy and consistency.

Testing the Synthetic Data Hypothesis 

In order to put the use of synthetic images to the test, OneView joined forces with Airbus Defence and Space (Intelligence), one of the leading providers of geospatial imagery, with over 30 years of in-orbit operation and experience with high-resolution satellite imagery. The aim was simple — to assess if synthetic data can truly supplement/replace real images, and increase the accuracy of machine learning and geospatial intelligence object-detection models.

To achieve this we trained a super-category “Airplane” detection algorithm using three different training datasets. The first consisted of only real data. The second was composed exclusively of synthetic data and the third used a combination of real and synthetic data (referred to as “mix”).

Real Pléiades satellite images with aircraft annotations per class, yellow – light aircraft, purple – fighter, green – jets/commercial, maroon – bomber and orange – other military

In addition, we added a more complex task, and compared the performance of another algorithm, meant for the classification of the airplanes to different sub-categories.  

As part of this test, OneView recreated full 3D replicas of airports, both commercial and military (to learn more follow this blog describing our creation process). These contained the features, structures and variables of actual airports, such as terminals, apron, runway, taxiway, parking lots, buildings, etc. Within each model, objects of interest were positioned along with other objects to act as distractions to negate detection accuracy, such as plane parts and related airport objects.

Images of different airport scenes – Top: Beijing Capital International Airport, Bottom: Paris Charles de Gaulle Airport
Left to right: Real Pléiades imagery, OSM map, OneView’s 3D scene automatically built

Our large-scale 3D models catalogue consisted of five aircraft categories (light aircraft, fighter, jets/commercial, bomber and other military) and contained over 140 unique models. Each of these models utilized multiple materials, textures, color combinations and model configuration (i.e folding wings). In order to allow maximum variability, different weather conditions, viewing angles, and time-of-day parameters, were configured on OneView’s platform, generating a wide range of possibilities and scenarios. The randomized combination of these different factors allowed a very large and varied database of synthetic images to be generated to provide a solid training dataset.

Transition from OneView’s automatic segmentation masks to Airbus Defence and Space (Intelligence) diamond-shape annotation

A Higher Degree of Accuracy 

In the super-category “airplane” detection, OneView’s simulated dataset achieved a higher degree of accuracy than the best real-only results, achieving a score of 88% versus 82%. Using the mix dataset, a score of ~90% was achieved representing an 8% improvement over real-only data.

A more challenging problem than straightforward “airplane” detection, is the classification of different airplane categories. In detecting the different categories, the synthetic dataset achieved on-par results to the real dataset, where the “mix” results were superior to the real results, across every class.

Test results of the Real/Syn/Mix datasets
On the left at the super-category level, on the right at the 5 subcategories level

These results provide evidence that synthetic data can be used as a viable option to create new models where real-world data doesn’t yet exist and where it does, it can be used to enhance the use of real-world data.

An AI model trained with both real and synthetic geospatial imagery, according to the results of the test, will outperform a model trained exclusively on real geospatial imagery. This is both in terms of time needed for data collection as well as model accuracy. 

The results provided proof of concept that even impressed a company with the stature of Airbus Defence and Space (Intelligence).

“The case study performed with OneView has exceeded our initial expectations,” says Jeff Faudi, Airbus Defence and Space (Intelligence), Toulouse, France. “On this well-known AI detection topic, it has been possible to achieve high performances that we did not expect to be reachable with synthetic data. It confirms the value of using synthetic data as a complementary approach to traditional methods based upon real data labelling. When faced with the need to improve a model on rare objects, synthetic data can make it much easier to retrain than with real imagery, where collection and annotation of additional relevant rare objects is long and sometimes not possible.”

Detection examples from our test set, after training with OneView’s synthetic data