key: cord-0954184-cyt2lmwb
authors: Yadav, Sangeeta; Gulia, Preeti; Gill, Nasib Singh; Chatterjee, Jyotir Moy
title: A Real-Time Crowd Monitoring and Management System for Social Distance Classification and Healthcare Using Deep Learning
date: 2022-04-05
journal: J Healthc Eng
DOI: 10.1155/2022/2130172
sha: 57edbcfb1f89f2302500b74af53fd32cb3693fbd
doc_id: 954184
cord_uid: cyt2lmwb

Coronavirus born COVID-19 disease has spread its roots in the whole world. It is primarily spread by physical contact. As a preventive measure, proper crowd monitoring and management systems are required to be installed in public places to limit sudden outbreaks and impart improved healthcare. The number of new infections can be significantly reduced by adopting social distancing measures earlier. Motivated by this notion, a real-time crowd monitoring and management system for social distance classification is proposed in this research paper. In the proposed system, people are segregated from the background using the YOLO v4 object detection technique, and then the detected people are tracked by bounding boxes using the Deepsort technique. This system significantly helps in COVID-19 prevention by social distance detection and classification in public places using surveillance images and videos captured by the cameras installed in these places. The performance of this system has been assessed using mean average precision (mAP) and frames per second (FPS) metrics. It has also been evaluated by deploying it on Jetson Nano, a low-cost embedded system. The observed results show its suitability for real-time deployment in public places for COVID-19 prevention by social distance monitoring and classification.

Coronavirus has shaken the whole world after being first observed in China. Its symptoms comprise shortness of breath, fever, chills, loss of taste and smell, body aches, and cough. Coronavirus is primarily spread by physical contact. Its first case was discovered in Wuhan, China, in December 2019. After some time, it spread its roots in the whole world and resulted in a pandemic that has caused a number of infections comprising many fatalities. To save the life of the masses, severe lockdowns were imposed in various countries. e vaccination efforts and preventive measures have contributed a lot to control its spread. e industries, workplaces, and travel are returning to their normal state. But due to the mutations in this virus, a number of new variants are emerging. e WHO (World Health Organization) has also declared this disease as pandemic due to its fatality and severity. e complexities and eruption of new variants have made the spread and duration of this virus unpredictable. No vaccine with full efficacy has been developed yet [1] . It can only be prevented by maintaining social distancing, frequently washing hands, and wearing masks. As it spreads fast by close contact, the infected people are quarantined either in their homes or in hospitals to prevent its further spread to others. To mitigate its mass spread, the nations had to impose lockdown, close their boundaries, stop public gatherings, and close schools, colleges, and workplaces. It has been observed that the adoption of such strict measures has resulted in a reduced number of infections and fatalities [2] .

It has also been reported that fever is the primary symptom of this virus, and studies in China also found that 99 percent of the infected people are found to have a high temperature. To measure a person's surface temperature, noncontact infrared thermometers and thermal cameras are being used.

is massive move significantly limits the widespread of this virus.

According to WHO, social distancing is the most significant preventive measure where people can keep a certain distance from one another, hence minimizing physical contact from virus carriers. e employment of technological tools for the enforcement of social distancing is of main concern. ICT (Information and Communications Technology) and Artificial Intelligence tools could play a significant role in addressing this challenge of implementing social distancing practice. e application of Artificial Intelligence tools can help in prior identification of the infections and diagnosing the same with AI-equipped tools and medical imaging techniques. To predict and monitor the spread of this disease automatically, a number of intelligent neural networkbased networks have been designed. Moreover, AI is also helpful in contact tracing by identifying hotspots and clusters. AI tools can also be utilized in finding the most sensitive places and people so that preventive measures can be taken accordingly. In this way, AI is playing an important role in imparting more preventive and predictive healthcare.

Social distancing is the primary prevention measure to lessen the mass spread of this virus. Motivated by this notion, a real-time crowd monitoring and management system is proposed to detect the social distancing in the common places, hence imparting improved healthcare by reducing the number of infections. e system utilized the YOLO v4 person detection and Deepsort technique along with a social distancing classification algorithm. e paper is organized as follows. Research background and related work have been detailed in Section 2. e details of the proposed system have been given in Section 3. Section 4 contains the experimental results and their analysis. e whole work is concluded in Section 5.

e temperature screening and social distancing are effective preventive measures in mitigating the mass spread of the coronavirus. ese are highly recommended by World Health Organization and several other medical organizations [3] . e efficacy of social distancing on the transmission of this virus has been studied by Russel et al. [4] . ey have presented the trajectory of the outbreak produced by scientific location contact patterns employing SEIR, that is, susceptible exposed infected removed techniques. It has also been presented that sudden removal of social distancing could result in a wider spread of the infection. e effect of social distancing on other genres has been studied and presented by Nabil Kahale [5] .

e main motive of this research is to present an approximation showcasing how new infections and economic loss can be significantly reduced by early social distancing. During the coronavirus outbreak, the researches majorly focused on finding and developing efficient measures to diminish its spread among society [6, 7] . Technological advancements could play a prominent role in preventing its mass spread by detecting infected people at the earliest. A COVID-19 infected person can be tracked via GPS and smartphones' enable applications [8] . is technology has limited applicability and cannot be used to track people who do not have cell signals or Wi-Fi. Moreover, the mass gatherings in the open space can be tracked using drones employed with video cameras [9, 10] . Such technological advancements could help curb or prevent the outbreak. e advancements in the computer vision domain and deep learning have made the task of classification and object detection solvable and easier. ese researches contributed to different genres of vision comprising segmentation, neural style transfer, and object tracking in addition to detection [11] .

Deep Learning, an important branch of Artificial Intelligence, evolved as a significant tool for object detection and data processing. Primarily deep learning comprises sophisticated neural network design. Its concept originated back ago in the 1940s [12] . e learning problems can be solved efficiently with such neural network designs [13] . CNN-(Convolutional Neural Network-) based object detection models are widely used.

ese networks take an image as an input, and the learnable biases and weights are given to the different classes in the image and accordingly distinguish them from each other. e advancements in CNN networks made their augmentation possible with low complex low-resolution input-based embedded systems. A number of object detection techniques based on different deep learning models like YOLO, Single-Shot Detector (SSD), and R-CNN are available these days. ese algorithms are based on efficient motion estimation in the video stream. An efficient technique of person detection in the video stream has been proposed by Ebrahim et al. [14] . e technique comprises a deep learning person detector employed with Gaussian mixture and background subtraction. Another method of person detection has also been presented in [15] . Here, an amalgamation of machine learning and deep learning methods is utilized to attain more precise results with less computation. But this model slows down when used in real-time detection-based applications. e researchers have also proposed a model to detect static crowds [16] . In this method, the mean is taken as SVM (Support Vector Machine) for the categorization of the spots as the groupings of the people. e text features then extract these grouping spots. A pedestrian detection system is proposed in [17] , which detects the walking person by background subtraction and then makes the real-time classification. Similar works pertaining to smart cities have also been carried out in [18, 19] . To track the objects in a scenario, a number of deep learning-based tracking techniques like POI (Person Of Interest), SORT (Simple Online and Real-Time Tracking), and EMATT (Expendable Mobile ASW Training Target) have also been proposed. ese techniques present variation in their results based on the different performance metrics taken.

Keeping in view the limitation of different techniques, a novel real-time crowd monitoring and management system is proposed for imparting improved healthcare by monitoring the social distance among people in the common places utilizing YOLO v4 and Deepsort techniques for person detection and their tracking.

e object detectors detect an object by putting a bounding box and then assign its corresponding label. Deep learningbased object detection has outperformed all earlier traditional schemes and is now widely used for object detection and classification tasks. An efficient deep learning-based detector, R-CNN, that is, Regional Convolutional Neural Network, is proposed by Girshick [20] . is network works in four steps. Firstly, the picture is given as an input to the detector, and then regional proposals are extracted. e CNN computes the features of the extracted region proposals and classifies them. e schematic diagram of the same has been given in Figure 1 , depicting the whole process of detection and classification by R-CNN. e region proposals are extracted using selective search algorithms. is process is time-consuming, approx. taking 47 seconds to perform regional classifications of each image, hence not suitable for real-time applications. Moreover, this network cannot be end-to-end trained, but each part has to be trained separately.

To overcome the disadvantage of the above R-CNN, the same author has proposed its fast variant, Fast R-CNN [21] . Its basic working algorithm is the same as of R-CNN, but some changes are made to make faster detection. In this model, the input image is given to a CNN, which then generates a convolutional feature map of the same. ROI (Region of Interest) pooling layer is utilized to reshape the squared regions into a fixed size. ese regions are then fed to the softmax employed FC (Fully Connected) layer for predicting the label of the region. Figure 2 presents the design of the R-CNN detector.

To extract the region proposals, selective search algorithms are used by Fast R-CNN. ese algorithms are timeconsuming and slow, hence affecting the overall performance of the detection network.

YOLO emerged as one of the efficient object detectors based on deep learning. Recent advancements result in its various versions, including YOLO v2, v3, and v4. YOLO was initially proposed by Redmon et al. [23] . For a whole image, this technique utilized a single neural network. e input image to this network is segregated into different partitions. Each region is then surrounded by a bounding box, and corresponding probabilities are computed. YOLO scans the full image in a single time, and hence, the predictions are informed by the image context. Unlike R-CNN, YOLO works with a single network evaluation. It divides the input image into SxS grids, and then the features are extracted individually from them. e bounding boxes are predicted along with their corresponding labels. ese labels are associated with their predicted confidence score. is process of object detection and prediction is depicted in Figure 3 .

YOLO predicts the confidence score of all bounding boxes detected from the grid cells. e predictions of the bounding box are represented by five parameters, namely, w, h, x, and y, along with the confidence score. e height of the image is given by h and w presents the width of the same. e center of the bounding box is represented by (x,y)

parameters. e confidence score provides the confidence of the detector for the predicted object.

A number of bounding boxes are predicted for every grid cell. In YOLO, during the training phase, only one predictor per bounding box is required. For the ground truth estimation, the detector is trained to predict the object with the highest IoU (Intersection over Union) value. It leads to the more specialized bounding box prediction. e sum squared error between the predicted and actual classes is used for the loss estimation, and henceforth localization, classification, and confidence loss are computed for the network. Equation (1) presents the loss function used by YOLO to enhance and optimize its performance during the training phase.

where the following parameters are used as follows: ℷ coord is a constant value used to increase the weight for the first two terms of the loss function, ℷ noobj is used to weigh down the loss when detecting the background, S 2 is the number of cells, B is the number of box predictions for each cell, 1 obj i,j is one if there is an object in cell i and among all the predictors of this cell, the confidence of j th predictor is the highest, x i and y i are the anchor box's centroid, w i is the anchor box's width, h i is the anchor box's height, C i is the confidence score, ∧Ci is the box j's confidence score in cell i, 1

is the classification loss, and ∧ p i (c) is the class c's conditional class probability in cell i.

YOLO v2, v3, and v4 are the subsequent versions of YOLO. Several enhancements and improvements are made for real-time processing [24, 25] . e later versions resulted in a faster network along with significant accuracy improvement. Batch normalization is employed in all CNN layers to address the localization errors. YOLO v4 is the most recent version of YOLO [26] . It comprises three components, namely, the backbone network component, neck component, and prediction head component. e backbone component plays an important role in input dimension reduction and translation to more complex features. e schematic diagram of the same is given in Figure 4 .

We have used the CSPDarknet53 backbone for our implementation. e neck component takes features from the backbone network and mixes them. It can consist of components like Spatial Pyramid Pooling (SPP) and Path Aggregation Network (PANet). It results in better spatial information preservation and extracting features at various resolutions. For the single-stage prediction head of YOLO, the bounding boxes to the detected objects are created using 

Deep learning's emergence has offered the finest performing algorithms for the different problems and tasks pertaining to diverse domains, including object detection and tracking, medical diagnosis, and much more. A crowd monitoring and management system is proposed to impart improved healthcare using deep learning techniques to monitor and detect social distancing in public places. To maintain a balance of precision and speed, YOLO v4 and Deepsort techniques are used. People are predicted using bounding boxes around each detected object. e circle of influence is calculated for each person. Later, all circles of influence are accumulated for a frame and those having more overlap locations result in high intensity. e result of each surveillance frame is color-coded, depicting the statistical density of people. e details of the whole system regarding social distance detection, network design, Deepsort-based tracking, and algorithm for distance classification are given in the following subsections.

e process of detection of social distance from the video stream captured using the cameras installed in public places comprises the following steps, and the corresponding flowchart is given in Figure 5 :

(1) e video stream is prepared from a camera, and it comprises people. Algorithm for social distance detection. e following algorithm is used for the detection of social distance and finally generating alerts accordingly seeing the intensity of breach:

(1) Input image stream is taken from the installed camera.

(2) People are extracted from images using a YOLO v4 object detector. 

e original YOLO v4 comprises SPP module, CSPDarknet53 backbone, and anchorbased detection head along with PANet path-aggregation neck. A 725 × 725 receptive field, 29 convolutional layers of size 3 × 3, and 27.6 M parameters are used in CSPDarknet53.

e computation of such a network is quite expensive. Hence, we have used a lightweight model variant YOLO v4tiny to achieve desired goals. e input layer of the system fed the images of size (416 × 416 × 3) to the network. It contains an optimized CSP backbone with fewer layers and parameters. e Spatial Pyramid Pooling (SPP) or equivalent, Fast Spatial Pyramid Pooling (SPPF) block, is utilized over the CSPDarknet53 to extract the most notable context features, to enhance the receptive field without any compromise with the speed. e parameter aggregation from the diverse backbone and detection levels is achieved using the PANet method. Finally, pyramid features are processed by the detection head to provide the detection. Figure 6 depicts the design of the YOLO v4-tiny neural network.

Deepsort is an efficient deep learning-based technique used for tracking objects in videos.

e proposed system utilizes Deepsort to track the detected people in the video stream. To predict trajectories of the objects, the learned patterns from the identified objects in the images are coupled with the temporal information. Each object is tracked by assigning a unique ID, and these IDs mapped to the objects of interest are utilized for future statistical analysis. e several issues pertaining to object tracking like nonstationary cameras, numerous views, occlusion, and annotating training data are effectively addressed by the Deepsort. e Hungarian algorithm and the Kalman filter are used for accurate tracking. For the enhanced association and prediction of future locations of the object, the Kalman filter is recursively used. Along with association, the Hungarian method is employed for ID attribution and for identifying the same object in the current and past frame. A linear constant velocity Kalman filter model is used for tracking, and the corresponding target object to be tracked is described with eight dimensions, as presented by equation (2).

where x is the target object, (u, v) is the centroid of the bounding box, h is the video height, and λ is the aspect ratio. e variables' respective velocities are represented by the other variables. Later, a Kalman filter is utilized where u, v, λ, and h parameters are used as the bounding coordinates for the object state. e total number of frames is determined for each track k, beginning with the most recent successful Journal of Healthcare Engineering 5 measurement connection. When the Kalman filter predicts a positive result, the counter is increased, and when the track is connected with a measurement, the counter is reset to 0. If the recognised tracks are older than a predetermined limit, those items are judged to leave the scenario, and the corresponding track is removed from the collection. Table 1 compares the accuracy and speed of Deepsort in object tracking. e comparison uses metrics like multiobject tracking precision (MOTP), multiobject tracking accuracy (MOTA), mostly tracked (MT) as tracks with more than 80% tracking, mostly lost (ML) as tracks with less than 20% tracking, identity switches (ID), and fragmentation (FM). ID is the number of times ground truth identity changes. FM gives a count of interruptions due to missing detection.

e proposed system classifies and decides the safe distancing among the detected people and then represents this information in a visual manner. e top view of the scene is required to compute the distance between the people. In the standard approach for distance classification, the distance between people is measured by the distance between each detected person bounding box.

e proposed system employs a novel approach for monitoring social distancing in a crowded scene. In this approach, the unsafe social distancing is represented by the red color bounding box, and the rest of the colors denote the safe distancing. A screenshot of the same has been given in Figure 7 . Firstly, the total number of people in an image is identified. If the number of people's bounding boxes exceeds two, the distance between them is calculated by the gap among the centers of their corresponding bounding boxes. Equation (3) presents CP(x,y) , center point computation of the bounding box.

where the given parameters denote the different dimensions of the bounding box as follows: CP is the center point, Xmin is the bounding box's width, minimum value, Ymin is the bounding box's height, minimum value, Xmax is the bounding box's width, maximum value, and Ymax is the bounding box's height, minimum value.

e Euclidean formula is used to compute the distance among the centers of bounding boxes. Here, the distance between pixels is transformed into a metric distance which is then compared to the threshold value. If this computed value is less than the threshold value, such boxes will appear in red color. Otherwise, they will appear in different colors. e distance among people can be effectively computed with the top view of the scene, and this is achieved by homographic transformation of the concerned scene. To map with the real-world distance between people, this computed space is scaled by a scaling factor S, and it is determined by the number of pixels of the image corresponding to one meter in the real world.

e system focuses on identifying bad actors who are breaching the social distance and taking action on them. Real-time monitoring and acting on people on a large scale poses different types of social, operational, privacy, and morale-related challenges. Also, given the size and changing Figure 6 : YOLO v4-tiny neural network. Journal of Healthcare Engineering nature of bounding boxes in a video frame, it is difficult to monitor this in real time. To address the above concerns, we implemented a novel social distance monitoring method that focused more on the system or physical structure where the breach was happening and incentivised to improve the system. We represent the physical system in a color-coded heatmap where the breach is happening and what is the severity of the breach. e corresponding screenshots have been given in Figures 8 and 9 .

To obtain the structural heatmap of the social distancing breach, first, the circle of influence for each tracked bounding box is calculated. It is calculated by scaling the corresponding bounding box with a scaling factor S, which is determined by the number of pixels of the image corresponding to one meter in the real world. en, the overlap between the circles of influence of the detected people is computed to identify a breach of social distancing. Finally, overlap frames are aggregated over a time window of Tseconds. Aggregation of overlap results in a heatmap-like image with a higher crowd resulting in higher intensity. It can be used to identify violations and hotspots and generate real-time alerts.

We have used two different datasets for evaluating the performance of the system. For testing the accuracy and performance of different object detection techniques for person detection tasks, a standard MS-COCO dataset is used. It comprises 1.5 million annotations of 80+ object categories. It is one of the standard datasets used for benchmarking object detection. In order to test the end-toend application with tracking and social distance heatmap visualisation, we have used the surveillance footage of the Oxford town center. is video contains 7500 frames annotated with person category. Figure 10 presents the comparative results of the performance of YOLO v4 with other object detectors. For the MS-COCO dataset, it reports 43.5% AP with ∼65 FPS on Tesla V100. e experimental results of models evaluated with mAP are given in Table 2 . For faster R-CNN evaluation, we resized images to a resolution of 600 × 600 pixels by scaling shorter dimensions to 600 pixels and then taking a center-crop of size 600 × 600 pixels. e images are scaled to 416 × 416 pixels with fixed dimensions in YOLO. e maximum mAP with the lowest FPS has been observed with a faster R-CNN model. e required model should have high enough accuracy with low compute requirements to make it suitable for real-time operation on embedded devices. Low-performance results make the faster R-CNN model not suitable for real-time applications. e performance of YOLO v3 and YOLO v4-tiny is the highest, meeting the real-time application requirements. Compared to YOLO v3-tiny, YOLO v4-tiny shows better results with balanced FPS and mAP score and hence used for social distance detection among people in the surveillance video.

YOLO v4-tiny is approximately 8 times as fast at inference time as YOLO v4 as per the performance metrics, and on a very hard MS-COCO dataset, its performance is about two-thirds. If only person detection tasks are concerned, where people are readily visible in CCTV camera, there is even less performance degradation.

For large-scale implementation, the algorithm needs to be deployed on low-cost embedded devices such as Jetson Nano, TX2. We selected only YOLO variants for the implementation because of their higher performance combined with good accuracy. For benchmarking, we selected Jetson Nano and Jetson TX2 platforms. e FPS for YOLO v3 and YOLO v4 models in full and tiny variants on different scales of images during real-time detection with embedded SOCs have been given in Table 3 .

After careful evaluation, YOLO v4-tiny is selected in our final implementation for person detection at 416 × 416 resolutions and then combined with Deepsort for tracking and our social distance monitoring approach. Our system achieved end-to-end performance of 6 FPS on Jetson Nano which met the requirement of real-time monitoring of social distancing. Jetson Nano is also very power efficient, is kind of large-scale public space automatic monitoring can play an essential role in mitigating the spread and impact of subsequent COVID-19 waves. Since this application is intended to be used in public spaces with various environmental and lighting conditions, high end-toend accuracy is required. Along with the accuracy of the system, the privacy and individual rights of observed people are also a genuine concern. e system should not disclose a person's identity in general and should not assist in targeted tracking and surveillance of common people. Maintaining transparency about its fair uses by its stakeholders is also very essential.

e paper proposes a real-time crowd monitoring and management system for imparting improved healthcare by social distance detection and classification in public places using deep learning-based YOLO v4 object detection and Deepsort techniques. e bounding box generated around people helps in detecting the groups, and their closeness is computed with the help of the circle of influence approach.

e system also generates a color-coded heatmap of physical structure depicting where the breach of social distancing is happening and what is the severity of the breach. e proposed system has also been deployed, tested, and evaluated on Jetson Nano, a low-cost embedded system to meet the requirements of real-time large-scale deployments. e observed results show the suitability of this system in COVID-19 prevention in public places by social distance detection and classification via crowd monitoring in real time [27] [28] [29] [30] .

Data Availability e data are available upon request from the authors.

e authors declare that they have no conflicts of interest.

Advice for the public on Covid-19-World Health Organization

e effect of control strategies to reduce social mixing on outcomes of the covid19 epidemic in Wuhan, China: a modeling study

On the economic impact of social distancing measures

Target specific mining of covid-19 scholarly articles using the oneclass approach

Automated diagnosis of covid-19 with limited posteroanterior chest x-ray images using finetuned deep neural networks

EMByour global connection to the biomedical eng

e use of drones during mass events

LaPlace, A.: megapixels.cc: origins, ethics, and privacy implications of publicly available face recognition image datasets

Object detection and tracking in 2020

How we know universals the perception of auditory and visual forms

Object detection with deep learning: a review

People detection and finding attractive areas by the use of movement detection analysis and deep learning approach

Computer vision and deep learning techniques for pedestrian detection and tracking: a survey

Detection of static groups and crowds gathered in open spaces by texture classification

Robust realtime pedestrians detection in urban environments with lowresolution cameras

Design of cloud-based green IoT architecture for smart cities

A prototype of IoT-based real time smart street parking system for smart cities

Rich feature hierarchies for accurate object detection and semantic segmentation

Proceedings of the IEEE Int. Conf. Comput. Vis., IEEE

Implementing a real-time, AI-based, people detection and social distancing measuring system for Covid-19

You only look once: unified, real-time object detection

YOLO9000: better, faster, stronger

Monitoring COVID-19 social distancing with person detection and tracking via fine-tuned YOLO v3 and Deepsort techniques

YOLOv4: optimal speed and accuracy of object detection

coronavirus: a visual guide to the outbreak

). WHO coronavirus disease (COVID-19) dashboard

Lapped convolutional neural networks for embedded systems

Shape similarity for 3d video sequences of people