Designing a scalable face detection system in 2020

March 22, 2021


Face detection is one of the core problems in Computer Vision that emerged way long before neural networks became widely used. You could think it had already been solved a long time ago. Github is full of toy repositories and even almost-production-ready frameworks on this subject. 

But reality turned out to be much more complex when we ourselves started solving the mask detection problem in the beginning of March 2020 due to COVID-19.

Why mask detection is important

There arose an urgent need for social responsibility around wearing face masks. People not wearing masks in public places – be it malls, convenience stores, or even streets – are putting others at risk on a daily basis.

This is why a system that will detect such cases be of great help in the times of quarantine. Such a system can also be integrated with various messengers and configured to send mask alert notifications.

For example, a store owner could escape penalties by monitoring the employees and raising public safety awareness among them. So they can use a mask detection system in their store and receive notifications in case some of their staff is detected not wearing a mask.  

The general solution

Mask detection is a downstream task of face detection. An intuitive way to solve it is to use an occlusion-aware face detection system, and then apply a classifier to its results that can tell whether a person is wearing a mask or not. 

In this short article, we want to make a quick recap of all the pitfalls we faced, of places in which we failed while developing a mask detection system, and of small technical decisions that have finally led us to success. 

We will additionally benchmark the existing open-source and commercial face/face-mask detection systems when it comes to accuracy and speed of operation.

The problems and the state of other solutions

We want to start by describing the problems we faced. As you may guess, the main problems of many machine learning tasks also apply to mask detection. Here are the most crucial ones we were met with.

Lack of diverse detection datasets

First of all, there are no large datasets with diverse collections of faces from real-world security cameras. 

Of course, WIDER FACE is a good starting point with its 400.000 faces. However, it turns out that adding 20.000 new faces from custom security camera sets helps increase the average precision of face detection from 0.5 to 0.8 on a diverse set of real-world security cameras (not only on those included in the training set). Specifically, we included data with streaming cracks, rotation, and other commonly found variations and distortions. This helped our model generalize better for usage on real-life data.

It is a common occurrence that custom datasets provide data that’s better suited to your specific needs than the open-source ones, which generally contributes to a higher quality of the final solution.

Lack of diverse classification datasets

There is also a huge demand for classification datasets for faces with and without masks. 

It seems that people didn’t need such sets before COVID-19 started. Some datasets were created after the pandemic began, but their quality is mostly poor. 

For example, MOXA3k, a dataset originally created for detection. Crops can be often used as a classification dataset, but MOXA3k tends to provide studio-quality street images and indoor photos, so this didn’t work. 

Also a lot of datasets (almost every one found on Kaggle) include mostly beautiful high-quality images of doctors with masked faces, some of which are obviously staged. Of course, this is far away from real-world data. A common approach of using synthetic data didn’t help us achieve good results on crops of security camera photos even after we tried utilizing some domain adaptation techniques. 

The task of classifying whether a person is or is not wearing a mask can get even trickier in some edge cases. Here are some examples:

Mask detectionFace mask detection
A person can be technically wearing a mask, but in a way that’s not really conducive to safety in public spaces.People can also make some very creative decisions at times.

Subpar open-source solutions

The next issue is a generally low quality of available open-source solutions. 

Specifically, they don’t detect all people in the frame and don’t achieve good accuracy of results.

We wanted to create a solution that makes more accurate predictions and tends to find and classify more faces (and we’re happy to say we’ve succeeded). 

Example 1

Face-Mask-Detection by chandrikadeb7face-mask-detection by  CindyalifiaOur solution
Face mask detection imageFace mask detection toolFace mask detection system

Example 2

Face-Mask-Detection by chandrikadeb7face-mask-detection by  CindyalifiaOur solution
Face mask detection serviceFace mask detection platformFace mask detection post

Low quality of real images

And finally, all the issues we already mentioned are further exacerbated by an often very low quality of images available in data from security cameras.

Here’s one extreme case of barely comprehensible real-life data:

mask detection system
Yes, it is actually a face! 

Our solution

Our eventual goal was to develop a system that is scalable both by the number of cameras and GPU workers, so making streaming frame-wise predictions was not the best option. Therefore, our pipeline is based on analyzing only a few shots from each camera every X seconds.

The model

A two-stage approach is used for detecting people with and without masks. 

We chose the version of the YOLOv4 architecture implemented on the darknet framework due to its high performance – both in terms of quality and inference time. It was trained on the WIDER FACE dataset and then fine-tuned on a custom dataset from a security camera. Faces cropped by the detector are used as an input for a classification network. On this stage, we used Efficientnet-b0 after training it on a mix of different datasets from the web. 

While designing the pipeline, we were not aimed at a high benchmark score, but rather at good reproducibility on production data and low inference time with high-resolution input images. Our pipeline has successfully achieved 11 FPS for 1024×1024 image size per frame without batch processing and 26 FPS with it.

We have also conducted some experiments using domain adaptation (Ganin &. Lempitsky, Unsupervised Domain Adaptation by Backpropagation, 2015) with synthetic data as source domain and security camera shots as target domain. This approach gave us results similar to those from a simple model, with the only significant increase being that in training time.


Our solution is hosted on the platform. This is an MLOps tool that allows for convenient and scalable model development and deployment in all conventional cloud services, e.g., AWS, GCP, Azure.

The platform consists of two parts:

  • Core is a resource orchestrator. It can be installed in a cloud or on-premise and combines computation capabilities, storage, and environments (Docker images) in one system with single sign-on authentication (SSO) and advanced permission management system.
  • Toolbox is a toolset integrator. It contains integrations with various open-source and commercial tools required for modern ML/AI development.

This is how we’ve set up the whole process:

  1. UI registers the RTSP camera streams.
  2. Grabber workers take snapshots from these streams and collect them to a cloud storage.
  3. Processor workers facilitate interaction between the storage, the model’s API, and the analyzer. They basically get predictions for all grabbed photos and send them further to the analyzer.
  4. The rest of the system triggers required events based on the model’s predictions (for example, notifies if there is a certain number of people without masks) and collects the corresponding statistics.

We use RabbitMQ for load balancing. Here’s a diagram that explains this in more detail:

Mask detection tool provides highly tweakable presets that allow you to run jobs on even a fractional amount of cloud CPU. In our case, this functionality dramatically reduces the overall cost of uptime for the services. Particularly, we run all Grabber and Processor workers responsible for collecting and delivering the data to the detection API, on granular presets that use 20% of a single CPU unit.

Each worker is up with a single command, and the required amount of them can be replicated with a simple script of such kind:

for i in $(seq 1 $NUM_WORKERS); do
neuro run \\
–preset cpu-nano # Desired preset, e.g., with 0.2 CPU
–name worker-$i # Unique worker name
image:mask-detector-worker-image # Docker image of the worker
./app/ # Worker’s entrypoint


This article won’t be complete without the comparison of different approaches. 

Classification model


Detection model


We have achieved 93% classification accuracy on data from security street cameras. Our face detection model achieves 88% AP (while the original SotA RetinaFace has only 65% on the same data). In total, our pipeline has 73% mAP.

Here’s a good example of what our solution is capable of:
Mask detection tools


Getting into the issue of mask detection was quite a sobering and valuable experience. 

We were convinced that face detection in general is an extensively explored field, so finding a solution for a seemingly simple subset of face detection tasks would not require too deep of a delve. However, not finding a satisfactory solution even on the level of reliable face detection solutions, we realized that more work must be done on our side than initially expected. 

Exploring the field deeper, we were able to build a strong pipeline of face detection and classification to achieve some very decent results and compete with SotA performance. Having the platform at our disposal also helped at making the development and deployment processes as quick and convenient as possible.