Combining Synthetic Data with Real Data to Improve Detection Results in Satellite Imagery: Case Study

There’s a lot to unpack in the title above, so let’s first do that quickly to make sure we’re all on the same page.

Synthetic data is data created, real data is data collected. To dive deeper into the what and why of synthetic data, check our blog post explaining the merits and benefits of synthetic data.

When we say “detection” we refer to Object Detection – the act of identifying different objects in satellite images for intelligence gathering. When object detection is needed in scale, like going over thousands to millions of images, ML algorithms are called in.

Data-driven algorithms need training. Their training consists of ‘going over’ massive amounts of annotated data, meaning, data that was prepared for training, in our case, that the object needed detection was marked in the images. This is how machines learn to generalize – this is a car, this is a car, this is a car from a different angle, this is a car at night, when it rains, in blue, green, yellow… eventually, ‘they’ get it.

So title unpacked, now let’s ask the question begged to be asked:

Why use synthetic data for object detection in satellite imagery?

Simply put, because the collection and preparation of real data is bottlenecked, slowing down the entire process of ML-based imagery analytics. The main hurdle is the manual annotation of the dataset, which is understandably time-consuming, error-prone and entails a lot of back and forth to get the annotations just right. This is especially true for satellite imagery, where the objects are hard to spot, relatively small and of low resolution.

Additionally, the diversity of images needed for effective ML model training isn’t always available. That can include elevation angles, sensor variety, different weather conditions, time of day and so on. 

Synthetic data on the other hand offers limitless customization. Since synthetic data is created from scratch, any specification can be accommodated, leaping over the diversity hurdle.

More importantly, it can release the collection and preparation bottleneck as synthetic data is created swiftly, and the annotation is ‘built-in’ into the creation process; the imagery dataset comes out already annotated, immediately ready for model training.

This is not to say that synthetic data can completely replace real data for ML model training; not yet at least, but synthetic data can push forward machine learning model training. As we’ll show, combining real and synthetic data delivers the best results.

The Case Study: Proving the validity of training detection models based on synthetic data

Our goal for the case study was comparing detection results of real data to OneView’s synthetic data. In order to reduce the required expertise for a satellite-oriented use-case, we chose a familiar object in an urban environment: buses in Manhattan.

While a variety of satellites with corresponding sensor parameters exist, we focused on the best commercial satellite currently available: the MAXAR WorldView 3 (WV3) with a 0.31[m] ground sampling distance.

It is worth noting that although a bus is relatively large from the point of view of a human, from a satellite perspective, a bus, at best, occupies an area of 500 squared pixels, making it a small object and thus harder to detect.

Real Data Collection and Preparation

The real training set, to which we will compare, consists of chips cropped from WV3 imagery, encompassing the entire area of Manhattan at summer time. 

To generate the validation and test set, we use the same imagery but different areas from Manhattan.

Now, If you are not familiar with electro-optical satellites, rest assured, you can assume that unless mentioned we have no special satellite related artifacts in the image and simply have a regular camera, positioned at a very high altitude. 

For completeness, here’s some technical stuff about the acquisition parameters of the imagery: it was non-ortho-rectified, projected to 8-bit RGB bands using the standard process of MAXAR (Pansharpened, DRA and Acomp on) – specific details for satellite images.

The acquired images were sent to a qualified annotation company, to be annotated at bounding boxes annotation level. After several iterations, we have found the annotations to be reasonable. However, due to the nature of the problem – the small size of the vehicles – errors were still visible. Therefore, we have performed an additional annotation process by our algorithm team.

Samples of the marked bounding boxes.

Fortunately, the bus category, specifically for WV3 imagery, also comes fully annotated in a publicly available object-detection dataset named XVIEW. While XVIEW contains a large number of instances, it does not contain imagery from Manhattan or even from North America. This allowed us to address an additional question: how good a training set is if the imagery is taken from a different geographical location. 

Synthetic Data Generation

We have built several 3D models of city blocks resembling Manhattan, each of roughly 500 square meters in size, containing roads, downtown buildings, crossroads, parkings, etc.

At first, we decided to build our city blocks manually, as it is faster to start with. However, to allow scale, and later the adaptation to different environments, we took a different procedural approach combining open street map (OSM) data and Houdini engine, an approach we will touch on in a later blog post.

We use the Unity engine to generate the synthetic data, adding our in-house-developed randomization procedures to almost every element of the 3D environment:

  • Materials of the roads and buildings
  • Shifting and changing road marks such as arrows and stop signs
  • Crosswalks and bus lanes
  • Imperfections in the form cracks and on- road oil spills 

We have also randomized the position of related objects in the scene such as trees, benches, cones, etc.. In addition, any object in the scene contains a bank of materials and controllable parameters, allowing large variety in the appearance of every object.

On top of the above, we have randomized the time of day, and the intensity and color of the sun.

Here is a pick of what such variety means:

All of the above, as we will see, allowed a model trained on this synthetic dataset to generalize to real images. All of our images were captured at the same ground sampling distance (GSD) of WV3 (not only at nadir but for every off-nadir angle).

Post Processing

Following the output from the simulator, we augmented the images with different blurring kernels, saturation, and noise. While our synthetic data is varied, to allow generalization to real data, some specific sensor attributes are unique and must be matched. 

For electro-optical satellites, and specifically the WV3, the most unique and visible attribute is PAN-Sharpening. To save weight, cost and bandwidth, satellites acquire only a panchromatic image (PAN) at full resolution while other multispectral bands, including the color (RGB) are digitally upsampled in a procedure called PAN-sharpening. 

Unfortunately, this upsampling causes blurriness in the color of the output image, as can be seen in the figures below:

To match this process, we followed the same procedure: we reduced our simulated image to panchromatic image at full resolution, and downsampled our RGB image. Then, we applied our PAN-sharpening algorithm similar to [1].

Our final dataset for buses contained 950 images of size 1024 × 1024 with an average of 300 instances in every image. And of-course all of our images are pixel-perfect annotated, as you can see below

Training and Network Architecture

We have used Faster-RCNN Resnet-50 [2] (using Feature Pyramid Network [3]) architecture, pertained on ImageNet. 

The anchor size was set proportionally to the size of the target object, with different aspect ratios. Due to the small size of the objects, every image was upsampled by a factor of 4, and the output bounding box was rescaled back to the original image size. 

We have trained every network for 150K iterations, selecting the best model according to the validation set and reporting the results on the test set.

Case study detection results: Combination of real and Synthetic data wins

The results can be seen in the figure below. First, we trained a baseline using only real data samples as can be seen in blue. We have experimented with using a portion of the training data and as expected as the number of training samples increases so is the performance of the algorithm on the test set. However, at some point the performance saturates.

Our simulated data (as seen in red and stretched horizontally for visualization, as no real annotations were used) did not match the performance of the real data trained algorithm, but suppressed it! Our synthetic data, not limited by real-world constraints, achieves superior detection results. This is even more evident when compared with an algorithm trained with the XVIEW dataset (marked in purple triangle).

The results encouraged us to investigate whether we can achieve further improvement by combining both real and synthetic dataset. At training time, we balanced the amount of real and synthetic instances, so every batch in the training process was built with an equal amount of synthetic and real images. 

This approach proves to be very effective, as seen in green: more than 14% improvement of the AP50 when compared with synthetic datasets, and more than 20% improvement compared with real datasets. 

In addition, even slight addition of real images improves the results by a large margin, showing that a small amount of real images and a large amount of synthetic data results in a high-quality algorithm.

To Sum Up

In the above case study of object detection in satellite imagery, we’ve demonstrated that when the sample size is small, synthetic data performs much better than real data, resulting in a high-quality algorithm – achieved in a shorter time and with lesser effort and lower cost.

Additionally, we’ve demonstrated that combining real and synthetic data for model training achieves the best results. The perfect ‘mix’ constitutes a small number of real images with a large amount of synthetic data. 


[1] Padwick, Chris, et al. “WorldView-2 pan-sharpening.” Proceedings of the ASPRS 2010 Annual Conference, San Diego, CA, USA. Vol. 2630. 2010.

[2] Ren, Shaoqing, et al. “Faster r-cnn: Towards real-time object detection with region proposal networks.” Advances in neural information processing systems. 2015.

[3] Lin, Tsung-Yi, et al. “Feature pyramid networks for object detection.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.