key: cord-0539208-gi6bzymv
authors: Yang, Dongfang; Yurtsever, Ekim; Renganathan, Vishnu; Redmill, Keith A.; Ozguner, Umit
title: A Vision-based Social Distancing and Critical Density Detection System for COVID-19
date: 2020-07-07
journal: nan
DOI: nan
sha: 4bb879922bc80dfe682cacf6ce3efa49a0982d2b
doc_id: 539208
cord_uid: gi6bzymv

Social distancing has been proven as an effective measure against the spread of the infectious COronaVIrus Disease 2019 (COVID-19). However, individuals are not used to tracking the required 6-feet (2-meters) distance between themselves and their surroundings. An active surveillance system capable of detecting distances between individuals and warning them can slow down the spread of the deadly disease. Furthermore, measuring social density in a region of interest (ROI) and modulating inflow can decrease social distancing violation occurrence chance. On the other hand, recording data and labeling individuals who do not follow the measures will breach individuals' rights in free-societies. Here we propose an Artificial Intelligence (AI) based real-time social distancing detection and warning system considering four important ethical factors: (1) the system should never record/cache data, (2) the warnings should not target the individuals, (3) no human supervisor should be in the detection/warning loop, and (4) the code should be open-source and accessible to the public. Against this backdrop, we propose using a monocular camera and deep learning-based real-time object detectors to measure social distancing. If a violation is detected, a non-intrusive audio-visual warning signal is emitted without targeting the individual who breached the social distancing measure. Also, if the social density is over a critical value, the system sends a control signal to modulate inflow into the ROI. We tested the proposed method across real-world datasets to measure its generality and performance. The proposed method is ready for deployment, and our code is open-sourced.

Social distancing is an effective measure [1] against the novel COronaVIrus Disease 2019 (COVID- 19) pandemic. However, the general public is not used to keep an imaginary safety bubble around themselves. An automatic warning system [2, 3, 4, 5] can help and augment the perceptive capabilities of individuals.

Deploying such an active surveillance system requires serious ethical considerations and smart system design. The first challenge is privacy [6, 7, 8] . If data is recorded and stored, the privacy of individuals may be violated intentionally or unintentionally. As such, the system must be realtime without any data storing capabilities.

Second, the detector must not discriminate. The safest way to achieve this is by building an AI-based detection system. Removing the human out of the detection loop may not be enough -the detector must also be design free. Domainspecific systems with hand-crafted feature extractors may lead to malign designs. A connectionist machine learning system, such as a deep neural network without any featurebased input space, is much fairer in this sense, with one caveat: the distribution of the training data must be fair.

Another critical aspect is being non-intrusive. Individuals should not be targeted directly by the warning system. A non-alarming audio-visual cue can be sent to the vicinity of the social distancing breach to this end.

The system must be open-sourced. This is crucial for establishing trust between the active surveillance system and society.

Against this backdrop, we propose a non-intrusive augmentative AI-based active surveillance system for sending omnidirectional visual/audio cues when a social distancing breach is detected. The proposed system uses a pre-trained deep convolutional neural network (CNN) [9, 10] to detect individuals with bounding boxes in a given monocular camera frame. Then, detections in the image domain are transformed into real-world bird's-eye view coordinates. If Figure 1 . Overview of the proposed system. Our system is real-time and does not record data. An audio-visual cue is emitted each time an individual breach of social distancing is detected. We also make a novel contribution by defining a critical social density value ρc for measuring overcrowding. Entrance into the region-of-interest can be modulated with this value. a distance smaller than the threshold is detected, the system emits a non-alarming audio-visual cue. Simultaneously, the system measures social density. If the social density is over a critical threshold, the system sends an advisory inflow modulation signal to prevent overcrowding. v Our main contributions are:

• A novel vision-based real-time social distancing and critical social density detection system

• Definition of critical social density and a statistical approach to measuring it

• Measurements of social distancing and critical density statistics of common crowded places such as the New York Central Station, an indoor mall, and a busy town center in Oxford.

Social distancing for COVID-19. COVID-19 has caused severe acute respiratory syndromes around the world since December 2019 [11] . Recent work showed that social distancing is an effective measure to slow down the spread of COVID-19 [1] . Social distancing is defined as keeping a minimum of 2 meters (6 feet) apart from each individual to avoid possible contact. Further analysis [12] also suggests that social distancing has substantial economic benefits. COVID-19 may not be completely eliminated in the short term, but an automated system that can help monitoring and analyzing social distancing measures can greatly benefit our society.

Pedestrian detection. Pedestrian detection can be regarded as either a part of a general object detection problem or as a specific task of detecting pedestrians only. A detailed survey of 2D object detectors, as well as datasets, metrics, and fundamentals, can be found in [13] . Another survey [14] focuses on deep learning approaches for both generic object detection and pedestrian detection. State-ofthe-art object detectors use deep learning approaches, which are usually divided into two categories. The first one is called two-stage detectors, mostly based on R-CNN [15, 16, 9, 17] , which starts with region proposals and then performs the classification and bounding box regression. The second one is called one-stage detectors, of which the famous models are YOLOv1-v4 [18, 19, 20, 10] , SSD [21] , RetinaNet [22] , and EfficientDet [23] . In addition to these anchor-based approaches, there are also some anchor-free detectors: CornerNet [24] , CenterNet [25] , FCOS [26] , and RepPoints [27] . These models were usually evaluated on datasets of Pascal VOC [28, 29] and MS COCO [30] . The accuracy and real-time performance of these approaches are good enough for deploying pre-trained models for social distancing detection.

Social distancing monitoring. Emerging technologies can assist in the practice of social distancing. A recent work [2] has identified how emerging technologies like wireless, networking, and artificial intelligence (AI) can enable or even enforce social distancing. The work discussed possible basic concepts, measurements, models, and practical scenarios for social distancing. Another work [3] has classified various emerging techniques as either humancentric or smart-space categories, along with the SWOT analysis of the discussed techniques. A specific social distancing monitoring approach [4] that utilizes YOLOv3 and Deepsort was proposed to detect and track pedestrians followed by calculating a violation index for non-socialdistancing behaviors. The approach is interesting but results do not contain any statistical analysis. Furthermore, there is no implementation or privacy-related discussion other than the violation index. Social distancing monitoring is also defined as a visual social distancing (VSD) problem in [5] . The work introduced a skeleton detection based approach for inter-personal distance measuring. It also discussed the effect of social context on people's social distancing and raised the concern of privacy. The discussions are inspirational but again it does not generate solid results for social distancing monitoring and leaves the question open.

Very recently, several prototypes utilizing machine learning and sensing technologies have been developed to help social distancing monitoring. Landing AI [31] has proposed a social distancing detector using a surveillance camera to highlight people whose physical distance is below the recommended value. A similar system [32] was deployed to monitor worker activity and send real-time voice alerts in a manufacturing plant. In addition to surveillance cameras, LiDAR based [33] and stereo camera based [34] systems were also proposed, which demonstrated that different types of sensors besides surveillance cameras can also help.

The above systems are interesting, but recording data and sending intrusive alerts might be unacceptable by some people. On the contrary, we propose a non-intrusive warning system with softer omnidirectional audio-visual cues. In addition, our system evaluates critical social density and modulates inflow into a region-of-interest.

Object detection with deep learning. Object detection in the image domain is a fundamental computer vision problem. The goal is to detect instances of semantic objects that belong to certain classes, e.g., humans, cars, buildings. Recently, object detection benchmarks have been dominated by deep convolutional neural networks (CNNs) models [15, 16, 9, 17, 18, 19, 20, 10, 21, 22] . For example, top scores on MS COCO [30] , which has over 123K images and 896K objects in the training-validation set and 80K images in the testing set with 80 categories, have almost doubled thanks to the recent breakthrough in deep CNNs.

These models are usually trained by supervised learning, with techniques like data augmentation [35] to increase the variety of data.

Model Generalization. The generalization capability [36, 37] of the state-of-the-art is good enough for deploying pre-trained models to new environments. For 2D object detection, even with different camera models, angles, and illumination conditions, pre-trained models can still achieve good performance.

Therefore, a pre-trained state-of-the-art deep learning based pedestrian detector can be directly utilized for the task of social distancing monitoring.

We propose to use a fixed monocular camera to detect individuals in a region of interest (ROI) and measure the interpersonal distances in real time without data recording. The proposed system sends a non-intrusive audio-visual cue to warn the crowd if any social distancing breach is detected. Furthermore, we define a novel critical social density metric and propose to advise not entering into the ROI if the density is higher than this value. The overview of our approach is given in Figure 1 , and the formal description starts below.

We define a scene at time t as a 6-tuple S = (I, A 0 , d c , c 1 , c 2 , U 0 ), where I ∈ R H×W ×3 is an RGB im-age captured from a fixed monocular camera with height H and width W . A 0 ∈ R is the area of the ROI on the ground plane in real world and d c ∈ R is the required minimum physical distance. c 1 is a binary control signal for sending a non-intrusive audio-visual cue if any inter-pedestrian distance is less than d c . c 2 is another binary control signal for controlling the entrance to the ROI to prevent overcrowding. Overcrowding is detected with our novel definition of critical social density ρ c . ρ c ensures social distancing violation occurrence probability stays lower than U 0 . U 0 is an empirically decided threshold such as 0.05. Problem 1. Given S, we are interested in finding a list of pedestrian pose vectors P = (p 1 , p 2 , · · · , p n ), p ∈ R 2 , in real-world coordinates on the ground plane and a corresponding list of inter-pedestrian distances D = (d 1,2 , · · · , d 1,n , d 2,3 , · · · , d 2,n , · · · , d n−1,n ), d ∈ R + . n is the number of pedestrians in the ROI. Also, we are interested in finding a critical social density value ρ c . ρ c should ensure the probability p(d > d c |ρ < ρ c ) stays over 1 − U 0 , where we define social density as ρ := n/A 0 .

Once Problem 1 is solved, the following control algorithm can be used to warn/advise the population in the ROI.

Algorithm 1. If d ≤ d c , then a non-intrusive audiovisual cue is activated with setting the control signal c 1 = 1, otherwise c 1 = 0. In addition, If ρ > ρ c , then entering the area is not advised with setting c 2 = 1, otherwise c 2 = 0.

Our solution to Problem 1 starts below.

First, pedestrians are detected in the image domain with a deep CNN model trained on a real-world dataset:

(1)

f cnn : I → {T i } n maps an image I into n tuples T i = (l i b i , s i ), ∀i ∈ {1, 2, · · · , n}. n is the number of detected objects. l i ∈ L is the object class label, where L, the set of object labels, is defined in f cnn .

is the associated bounding box (BB) with four corners. b i,j = (x i,j , y i,j ) gives pixel indices in the image domain. The second sub-index j indicates the corners at top-left, topright, bottom-left, and bottom-right respectively. s i is the corresponding detection score. Implementation details of f cnn is given in Section 5.1.

We are only interested in the case of l = 'person'. We define p i , the pixel pose vector of person i, with using the middle point of the bottom edge of the BB:

(2)

The next step is obtaining the second mapping function h : p → p. h is an inverse perspective transformation function that maps p in image coordinates to p ∈ R 2 in real-world coordinates. p is in 2D bird's-eye-view (BEV) coordinates by assuming the ground plane z = 0. We use the following well-known inverse homography transformation [38] for this task:

where M ∈ R 3×3 is a transformation matrix describing the rotation and translation from world coordinates to image coordinates. p im = [p x , p y , 1] is the homogeneous representation of p = [p x , p y ] in image coordinates, and p bev = [p bev x , p bev y , 1] is the homogeneous representation of the mapped pose vector.

The world pose vector p is derived from p bev with p = [p bev x , p bev y ].

After getting P = (p 1 , p 2 , · · · , p n ) in real-world coordinates, obtaining the corresponding list of inter-pedestrian distances D is straightforward. The distance d i,j for pedestrians i and j is obtained by taking the Euclidean distance between their pose vectors:

And the total number of social distancing violations v in a scene can be calculated by:

where

Finally, we want to find a critical social density value ρ c that can ensure the social distancing violation occurrence probability stays below U 0 . It should be noted that a trivial solution of ρ c = 0 will ensure v = 0, but it has no practical use. Instead, we want to find the maximum critical social density ρ c that can still be considered safe.

To find ρ c , we propose to conduct a simple linear regression using social density ρ as the independent variable and the total number of violations v as the dependent variable:

where β = [β 0 , β 1 ] is the regression parameter vector and is the error term which is assumed to be normal. The regression model is fitted with the ordinary least squares method. Fitting this model requires training data. However, once the model is learned, data is not required anymore. After deployment, the surveillance system operates without recording data.

Once the model is fitted, critical social density is identified as:

where ρ pred lb is the lower bound of the 95% prediction interval (ρ pred lb , ρ pred ub ) at v = 0, as illustrated in Figure 2 . Keeping ρ under ρ c will keep the probability of social distancing violation occurrence near zero with the linear regression assumption.

We conducted 3 case studies to evaluate the proposed method. Each case utilizes a different pedestrian crowd dataset. They are Oxford Town Center Dataset (an urban street) [39] , Mall Dataset (an indoor mall) [40] , and Train Station Dataset (New York City Grand Central Terminal) [41] . Table 1 shows detailed information about these datasets.

The first step was finding the perspective transformation matrix M for each dataset. For Oxford Town Center Dataset, we directly used the transformation matrix available on its official website. For Train Station Dataset, we found the floor plan of NYC Grand Central Terminal and measured the exact distances among some key points that were used for calculating the perspective transformation. For Mall Dataset, we first estimated the size of a reference object in the image by comparing it with the width of detected pedestrians and then utilized the key points of the reference object to calculate the perspective transformation.

The second step was applying the pedestrian detector on each dataset. The experiments were conducted on a regular PC with an Intel Core i7-4790 CPU, 32GB RAM, and an Nvidia GeForce GTX 1070Ti GPU running Ubuntu 16.04 LTS 64-bit operating system. Once the pedestrians were detected, their positions were converted from the image coordinates into the real-world coordinates.

The last step was conducting social distancing monitoring and finding the critical density ρ c . Only the pedestrians within the ROI were considered. The statistics of the social density ρ, the inter-pedestrian distances d i,j , and the number of violations v were recorded over time. The analysis of statistics is described in the following section.

We experimented with two different deep CNN based object detectors: Faster R-CNN and YOLOv4. Figure 3 shows the pedestrian detection results using Faster R-CNN [9] and the corresponding social distancing in world coordinates. The detector performances are given in Table 2. As can be seen in the table, both detectors achieved real-time performance. In terms of detection accuracy, we provide the results of MS COCO dataset from the original works [9, 10].

For pedestrian i, the closest physical distance is d min i = min(d i,j ), ∀j = i ∈ {1, 2, · · · n}. Based on d min i , we further calculated two metrics for social distancing monitoring: the minimum closest physical distance d min = 

To find the critical density ρ c , we first investigated the relationship between the number of social distancing violations v and the social density ρ in 2D histograms, as shown in Figure 6 . As can be seen in the figure, v increases with an increase in ρ, which indicates a linear relationship with a positive correlation.

Then, we conducted the simple linear regression, using the regression model of equation (6), on the data points of v versus ρ. The skewness values of ρ for Oxford Town Center Dataset, Mall Dataset, and Train Station Dataset are 0.25, 0.16, and -0.07, respectively, indicating the distributions of ρ are symmetric. This satisfies the normality assumption of the error term in linear regression. The regression result is displayed in figure 4 . The critical density ρ c was identified as the lower bound of the prediction interval at v = 0. Table 3 summaries the identified critical densities ρ c as well as the intercepts β 0 of the regression models. The obtained critical density values for all datasets are similar. They also follow the patterns of the data points as illustrated in Figure 4 . This verified the effectiveness of our method.

This work proposed an AI and monocular camera based real-time system to monitor the social distancing. We use Table 3 . Critical social density of each dataset. The critical density was identified as the lower bound of the prediction interval at the number of social distancing violations v = 0.

the proposed critical social density to avoid overcrowding by modulating inflow to the ROI. The proposed method was verified using 3 different pedestrian crowd datasets. There are some missing detections in the Train Station Dataset, as in some areas the pedestrian density is extremely high and occlusion happens. However, after some qualitative analysis, we concluded that most of the pedestrians were captured and the idea of finding critical social density is still valid.

Pedestrians who belong to a group were not considered as a group in the current work, which can be a future direction. Nevertheless, one may argue that even individuals who have close relationships should still try to practice social distancing in public areas. Figure 7 . The change of minimum closest physical distance dmin, average closest physical distance davg, and the social density ρ over time.

Strong social distancing measures in the united states reduced the covid-19 growth rate: Study evaluates the impact of social distancing measures on the growth rate of confirmed covid-19 cases across the united states

Enabling and emerging technologies for social distancing: A comprehensive survey

Unleashing the power of disruptive and emerging technologies amid covid 2019: A detailed review

Monitoring covid-19 social distancing with person detection and tracking via fine-tuned yolo v3 and deepsort techniques

The visual social distancing problem

Privacy preserving crowd monitoring: Counting people without people models or tracking

Object-video streams for preserving privacy in video surveillance

A track-based human movement analysis and privacy protection system adaptive to environmental contexts

Faster r-cnn: Towards real-time object detection with region proposal networks

Yolov4: Optimal speed and accuracy of object detection

The continuing 2019-ncov epidemic threat of novel coronaviruses to global healththe latest 2019 novel coronavirus outbreak in wuhan, china

Does social distancing matter

Object detection in 20 years: A survey

Object detection with deep learning: A review

Rich feature hierarchies for accurate object detection and semantic segmentation

Fast r-cnn

R-fcn: Object detection via region-based fully convolutional networks

You only look once: Unified, real-time object detection

Yolo9000: better, faster, stronger

Yolov3: An incremental improvement

Ssd: Single shot multibox detector

Focal loss for dense object detection

Efficientdet: Scalable and efficient object detection

Cornernet: Detecting objects as paired keypoints

Centernet: Keypoint triplets for object detection

Fcos: Fully convolutional one-stage object detection

Reppoints: Point set representation for object detection

The pascal visual object classes (voc) challenge

The pascal visual object classes challenge: A retrospective

Microsoft coco: Common objects in context

Landing AI Creates an AI Tool to Help Customers Monitor Social Distancing in the Workplace

Using computer vision to enhance safety of workforce in manufacturing in a post covid world

Social Distance Monitoring

Using 3D Cameras to Monitor Social Distancing

Learning data augmentation strategies for object detection

Understanding deep learning requires rethinking generalization

Understanding why neural networks generalize well through gsnr of parameters

Computer vision: a modern approach

Stable multi-target tracking in realtime surveillance video

Feature mining for localised crowd counting

Understanding collective crowd behaviors: Learning a mixture model of dynamic pedestrian-agents