There are multiple scenarios in which you may want to use computer vision to detect small objects or symbols within a given image. Whether it’s detecting company logos on grocery store shelves to manage inventory, detecting informative symbols on documents, or evaluating survey or quiz documents that contain checkmarks or shaded circles, the size ratio of objects of interest to the overall image size is often very small. This presents a common challenge with machine learning (ML) models, because the processes performed by the models on an image can sometimes miss smaller objects when trying to reduce down and learn the details of larger objects in the image.
In this post, we demonstrate a preprocessing method to increase the size ratio of small objects with respect to an image and optimize the object detection capabilities of Amazon Rekognition Custom Labels, effectively solving the small object detection challenge.
Amazon Rekognition is a fully managed service that provides computer vision (CV) capabilities for analyzing images and video at scale, using deep learning technology without requiring ML expertise. Amazon Rekognition Custom Labels, an automated ML feature of Amazon Rekognition, lets you quickly train custom CV models specific to your business needs simply by bringing labeled images.
The preprocessing method that we implement is an image tiling technique. We discuss how this works in detail in a later section. To demonstrate the performance boost our preprocessing method presents, we train two Amazon Rekognition Custom Labels models—one baseline model and one tiling model—evaluate them on the same test dataset, and compare model results.
For this post, we created a sample dataset to replicate many use cases containing small objects that have significant meaning or value within the document. In the following sample document, we’re interested in four object classes: empty triangles, solid triangles, up arrows, and down arrows. In the creation of this dataset, we converted each page of the PDF document into an image. The four object classes listed were labeled using Amazon SageMaker Ground Truth.
After you have a labeled dataset, you’re ready to train your Amazon Rekognition Custom Labels model. First we train our baseline model that doesn’t include the preprocessing technique. The first step is to set up our Amazon Rekognition Custom Labels project.
On the Amazon Rekognition console, choose Use Custom Labels.
Under Datasets, choose Create new dataset.
For Dataset name¸ enter cl-blog-original-train.
We repeat this process to create our Amazon SageMaker Ground Truth labeled test set called cl-blog-original-test.
Under Projects, choose Create new project.
For Project name, enter cl-blog-original.
On the cl-blog-original project details page, choose Train new model.
Reference the train and test set you created that Amazon Rekognition uses to train and evaluate the model.
The training time varies based on the size and complexity of the dataset. For reference, our model took about 1.5 hours to train.
After the model is trained, Amazon Rekognition outputs various metrics to the model dashboard so you can analyze the performance of your model. The main metric we’re interested in for this example is the F1 score, but precision and recall are also provided. F1 score measures a models’ overall accuracy, which is calculated as the mean of precision and recall. Precision measures how many model predictions are objects of interest, and recall measures how many objects of interest the model accurately predicted.
Our baseline model yielded an F1 score of about 0.40. The following results show that the model performs fairly well with regard to precision for the different labels, but fails to recall the labels across the board. The recall scores for down_arrow, empty_triangle, solid_triangle, and up_arrow are 0.27, 0.33, 0.33, and 0.17, respectively. Because precision and recall both contribute to the F1 score, the high precision scores for each label are brought down by the poor recall scores.
To improve our model performance, we explore a tiling approach in the next section.
Tiling approach model
Now that we’ve trained our baseline model and analyzed its performance, let’s see if tiling our original images into smaller images results in a performance increase. The idea here is that tiling our image into smaller images allows for the small symbol to appear with a greater size ratio when compared to the same small symbol in the original image. The first step before creating a labeled dataset in Ground Truth is to tile the images in our original dataset into smaller images. We then generate our labeled dataset with Ground Truth using this new tiled dataset, and train a new Amazon Rekognition Custom Labels tiled model.
Open up your preferred IDE and create two directories:
original_images for storing the original document images
tiled_images for storing the resulting tiled images
Upload your original images into the original_images directory.
For this post, we use Python and import a package called image_slicer. This package includes functions to slice our original images into a specified number of tiles and save the resulting tiled images into a specified folder.
In the following code, we iterate over each image in our original image folder, slice the image into quarters, and save the resulting tiled images into our tiled images folder.
You can experiment with the number of tiles, because smaller symbols in more complex images may need a higher number of tiles, but four worked well for our set of images.
The following image shows an example output from the tiling script.
After our image dataset is tiled, we can repeat the same steps from training our baseline model.
Create a labeled Ground Truth cl-blog-tiled-train and cl-blog-tiled-test set.
We make sure to use the same images in our baseline-model-test set as our tiled-model-test set to ensure we’re maintaining an unbiased evaluation when comparing both models.
On the Amazon Rekognition console, on the cl-blog-tiled project details page, choose Train new model.
Reference the tiled train and test set you just created that Amazon Rekognition uses to train and evaluate the model.
After the model is trained, Amazon Rekognition outputs various metrics to the model dashboard so you can analyze the performance of your model. Our model predicts with a F1 score of about 0.68. The following results show that the recall improved for all labels at the expense of the precision of down_arrow and up_arrow. However, the improved recall scores were enough to boost our F1 score.
When we compare the results of both the baseline model and the tiled model, we can observe that the tiling approach gave the new model a significant 0.28 boost in F1 score, from 0.40 to 0.68. Because the F1 score measures model accuracy, this means that the tiled model is 28% more likely to accurately identify small objects on a page than the baseline model. Although the precision for the down_arrow label dropped from 0.89 to 0.28 and the precision for the up_arrow label dropped from 0.71 to 0.68, the recall for all the labels significantly increased.
In this post, we demonstrated how to use tiling as a preprocessing technique to maximize the performance of Amazon Rekognition Custom Labels when detecting small objects. This method is a quick addition to improve the predictive accuracy of a Amazon Rekognition Custom Labels model by increasing the size ratio of small objects relative to the overall image.
For more information about using custom labels, see What Is Amazon Rekognition Custom Labels?
About the Authors
Read MoreAWS Machine Learning Blog