key: cord-0230775-dq91xvi3
authors: Dai, Zhirui; Jiang, Yuepeng; Li, Yi; Liu, Bo; Chan, Antoni B.; Vasconcelos, Nuno
title: BEV-Net: Assessing Social Distancing Compliance by Joint People Localization and Geometric Reasoning
date: 2021-10-10
journal: nan
DOI: nan
sha: a455dd6682cea9261bd8964329452a5f96e42d96
doc_id: 230775
cord_uid: dq91xvi3

Social distancing, an essential public health measure to limit the spread of contagious diseases, has gained significant attention since the outbreak of the COVID-19 pandemic. In this work, the problem of visual social distancing compliance assessment in busy public areas, with wide field-of-view cameras, is considered. A dataset of crowd scenes with people annotations under a bird's eye view (BEV) and ground truth for metric distances is introduced, and several measures for the evaluation of social distance detection systems are proposed. A multi-branch network, BEV-Net, is proposed to localize individuals in world coordinates and identify high-risk regions where social distancing is violated. BEV-Net combines detection of head and feet locations, camera pose estimation, a differentiable homography module to map image into BEV coordinates, and geometric reasoning to produce a BEV map of the people locations in the scene. Experiments on complex crowded scenes demonstrate the power of the approach and show superior performance over baselines derived from methods in the literature. Applications of interest for public health decision makers are finally discussed. Datasets, code and pretrained models are publicly available at GitHub.

Social distancing, the strategy of maintaining a safe distance between people in public spaces, has been shown to be an effective measure against the transmission of contagious pathogens, including influenza virus and coronavirus [57, 7, 23] . However, the monitoring of social distancing by human observers is neither practical in many settings nor scalable. This has motivated an interest in methods to detect and count social distancing violations automatically. While non-vision-based methods are available 1 , they typically re- 1 The DP-3T contact tracing protocol [59] , for example, estimates distances using Bluetooth signal on smartphones. quire users to install certain applications on their mobile devices, and are limited in precision of distance estimates.

Computer vision offers a viable alternative for the collection of social distance measurements. In particular, it has several advantages for the monitoring of public spaces. First, it can leverage surveillance cameras that are already available in many public locations. No expensive infrastructure changes are required.

Second, anonymization of visual data is straightforward by removing all facial identities, as the system has no access to other sensitive information of pedestrians. This makes it much more privacy preserving than the monitoring of mobile devices, or similar approaches. On the other hand, it can produce complete statistics of social distancing violations, as there is no prerequisite on the population (e.g. smartphone with Bluetooth enabled). While not suited for contact-tracing, these statistics can be very useful to decision makers, e.g. to enable the implementation of highly localized "lock-downs," time-varying control of pedestrian access to certain areas, etc.

Computer vision has a long history of sensing humans in images. Object detection [14, 17, 16, 52, 37] and instance segmentation [8, 21, 35] recognize and localize objects by bounding boxes or pixel-wise masks. While effective for close objects and sparsely populated scenes [12, 39] , detection quality degrades substantially for far cameras and busy spaces, involving significant degrees of occlusion. In fact, on such scenes, it is even impossible to collect accurate bounding box annotations. Social distancing measuring is more closely related to crowd-counting methods [66, 5, 42] , which are trained to produce a heatmap that highlights the head location of every person in the scene. However, because these methods don't explicitly reason about scene geometry, they are unsuitable to estimate distances between individuals. Furthermore, head locations are not ideal for estimating scene distances, since heads do not lie on a shared plane in 3D, due to people height variation. Much more accurate distance estimates can usually be obtained by reconstructing the scene ground plane and measuring feet distances. This, however, presents additional challenges, due to occlusion.

In this work, we consider the problem of social distancing compliance assessment (SDCA), which aims to measure the distances between individuals in a scene and detect violations of social distancing thresholds. Based on the above observations, we argue that SDCA requires a geometryaware approach, where the relevant extrinsic parameters of the camera are estimated and used to generate a bird's eye view (BEV) map of people locations, as illustrated in Figure 1. We propose a novel benchmark for SDCA, CityUHK-X-BEV, which repurposes the CityUHK-X dataset [28] to the SDCA problem, by adding ground-plane annotations. Specifically, for each head location in the dataset, the corresponding feet position is annotated, and mapped to the BEV, using the known intrinsic and extrinsic camera parameters. A novel set of evaluation criteria is introduced, each focusing on a different aspect of the task: Localization metric that measures the accuracy in detecting real-world locations of people in the ground plane; local risk metric that evaluates the capability to discover regions with high chance of infection; global risk metric that predicts the overall risk level of captured scene. Figure 1 shows examples of these tasks. Unlike many vision problems that localize object in the image plane, these geometry-aware criteria directly evaluate the capacity of models to make predictions in the 3D ground plane, leading to outputs that are much more informative for real-world applications that require metric information.

A multi-branch convolutional architecture, BEV-Net, is then proposed to solve the SDCA task. BEV-Net follows an encode-decoder structure, using a projective transformation module to convert convolutional feature maps from image view to BEV. The decoder is implemented with a pose regression branch that estimates camera parameters and three separate branches to predict feet, head and BEV heatmaps. These branches are trained with individual losses, in a multi-task manner. To compensate for the height variations in the crowd, a group transformation module with spatial self-attention is used to group people by head height and independently align the feet and head feature maps of the resulting groups. Experiments show that the BEV-Net outperforms all DET and CC baselines under all proposed SDCA evaluation metrics. It is further shown, through ablation experiments, that both head and feet annotations are essential to achieve the best prediction quality.

A number of applications of potential interest for public health decision makers are then illustrated. These range from the characterization of risks for a single image, as illustrated in Figure 1 , to global measures of scene risk, integrated over image datsets, as shown in Figures 7 and 8 . The latter can be used to identify events of unusually large risk or inform the deployment of risk mitigation measures, such as the introduction of obstacles in the scene to modify walking patterns and other crowd behaviors.

The paper makes four major contributions: First, we introduce the idea of using computer vision for joint geometric reasoning and social distancing compliance assessment on public spaces. Second, a novel benchmark for SDCA in crowd scenes, CityUHK-X-BEV, is introduced with personlevel annotations in bird's eye view. Third, a multi-branch convolutional network, BEV-Net, is proposed and shown to achieve best SDCA results by learning to perform both heatmap prediction and geometric reasoning. Finally, we show promising results for several potential applications of the SDCA framework in the public health domain.

Object detection (DET). Object detection methods recognize and localize multiple classes of objects with bounding boxes. While early algorithms relied on hand-crafted visual features [44, 9] , the introduction of CNN-based detectors [17, 16, 52, 4, 51, 38] trained on large-scale image databases [10, 39] has enabled dramatic performance gains.

Existing approaches to SDCA have mostly relied on pretrained DET models, making few technical advances to their architectures. [13] proposed to detect social distancing violations by regressing head and feet locations from the bounding box of each detected person. While effective for sparsely populated environments, such an approach does not scale to busy spaces and distant cameras with a large field of view, as is usually the case of large public spaces. In addition, [13] requires external homography calculation based on markers manually placed in the scene. By contrast, the proposed BEV-Net automatically estimates camera ge-ometry and is able to estimate social distances for crowded environments and wide field of view cameras. Crowd counting (CC). Crowd counting focuses on counting the number of people in an image. State-of-the-art methods either learn to regress head counts from the image data directly [46] , or predict a people density map which is then integrated to obtain the people count [66, 56, 5, 36, 40] . Crowd scene datasets tend to be collected in public spaces and focus on busy scenes [63, 25, 60, 28] , where the estimation of head locations is difficult due to small people sizes and significant occlusion. These are the scenes that we emphasize in this work, where we augment a popular crowd counting dataset with rich annotations required for SDCA.

Most CC methods are trained to produce density maps in the camera view, a simpler problem than the proposed combination of SDCA tasks. An exception is WACC [64] , which learns to predict ground-plane heatmap directly. The BEV-Net differs from this work in three main aspects: First, WACC requires inputs from multiple cameras, while BEV-Net is designed to work with a single camera view. Second, WACC is supervised by head annotations only, leading to inaccurate ground-plane locations due to varying head heights; BEV-Net addresses this by producing feet annotations that, unlike heads, lie on the common ground plane. Third, WACC assumes known geometry for all camera views, while BEV-Net jointly learns to predict extrinsic camera parameters. Geometry in computer vision. The scene geometry recovery is a classical problem in computer vision [20] . This can be decomposed into the estimation of camera parameters and scene geometry, i.e., depth variability in different parts of the scene. Since the introduction of deep learning, both components have been estimated by neural networks, typically by using two dedicated branches that are trained jointly, in an end-to-end manner [32, 67, 18, 27] . The scene reconstruction required by SDCA amounts to recovering the ground plane and 3D feet locations of all individuals. This is not trivial because feet locations are frequently occluded in the camera view. BEV-Net addresses this by leveraging head locations and the regularizing geometric constraint that a standing person's head and feet are co-located in BEV. Social behavior analysis. Computer vision methods have been applied to modeling human behavior in public spaces. One line of work achieves this through the task of trajectory prediction [34, 49, 29, 1, 19] , which requires generating plausible motion paths for pedestrians in the image plane. Early work used physics models such as Social Force [22] to account for interaction between humans [47, 50, 62, 53] ; more recently, neural network modules were used to capture such dependencies between agents, e.g. with recurrent networks [1, 3, 55] or graph convolutional networks [48, 58] . Other tasks for social behavior modeling have also been explored, including early action detection [54, 30, 45] and group activity recognition [6, 33, 24] . All of the above tasks require temporal modeling on video data and semantic understanding of human activities; this work instead focuses on sensing the spatial locations of people, which can be efficiently recovered from individual image frames.

To the best of our knowledge, no prior work has attempted to evaluate the quality of visual SDCA in busy public spaces and large field-of-view scenes. In this section, we propose a new dataset for this task.

A dataset for SDCA should satisfy various requirements. First, it should contain a wide range of scenes with varying people densities. Second, it should include ground-truth locations for the people in the scene, either in 3D world coordinates or in the form of a BEV heatmap. Optionally, it could provide ground-truth for the intrinsic and extrinsic camera parameters, allowing direct camera pose supervision during training and facilitating the recovery of ground plane and homography between camera view and BEV.

DET datasets [11, 39, 65] are not suitable for SDCA since the number of people per image tends to be low. CC datasets are more relevant, as they contain abundant examples of people gathering in clusters-often within 1 to 2 meters, the range of droplet transmission [61, 15, 31] -and the scene/camera configurations are most suitable for monitoring social distances in public spaces. However, geometric meta-information is unavailable in most CC datasets.

CityUHK-X [28] is an exception, providing extrinsic camera parameters, including height and pitch angle, which make it a potential SDCA benchmark. Nevertheless, it has limitations. Like other CC datasets, it only provides image annotations for each head in the scene. Even with known camera parameters, image head locations aren't enough to recover locations in world coordinates, as each individual's height is variable and unknown. Hence, additional annotations are needed for the feet locations of each person in the image, as well as head-feet correspondences.

The Amazon MTurk platform was used to collect feet annotations for CityUHK-X with correspondences between feet and heads. Given a crop of the scene image with annotated head location, MTurk workers were asked to annotate the center point between both feet, and to verify if the feet were clearly visible (feet location is precise) or occluded (feet location is estimated). 87,746 feet locations were annotated in the 2,982 scene images in CityUHK-X. Among them, 63,669 (72.6%) were clearly visible and 24,077 (27.4%) occluded. Detailed description of the annotation procedure can be found in the supplemental.

As illustrated in Figure 3 , ground-truth is provided in the form of three density maps: feet and head maps in image view (IV maps), and person location map in BEV coordinates (BEV map). Image view maps. Following [66] , the ground-truth head and feet locations in image view are represented with heatmaps M head and M feet composed by mixture of Gaussians. Each Gaussian from the head (feet) heatmap is centered at the annotated head (feet) location of a person in the image, with a fixed standard deviation σ = 5 pixels. Bird's eye view map. Given the image coordinates of annotated keypoints and camera geometry, the BEV map M BEV is generated using a homography. This is achieved by projecting the keypoints to world coordinates, choosing a suitable region of interest within the ground plane, then resampling into a fixed-size BEV heatmap, as illustrated in Figure 2 . First, the world coordinate frame is defined by setting the ground plane to z = 0 and the camera sensor to (0, 0, h), where h denotes camera height above ground. The camera's yaw angle is zero in this setting. Further assuming that the camera has pitch angle θ, and zero roll (the roll angle could be zero by transforming the input image properly), the transformation between world coordinates (x, y) in the ground plane and image coordinates (u I , v I ) is then given by the homography (derivations in supplemental)

where I H W is constructed as a function of h, θ and intrinsic camera parameters such as focal lengths (f u , f v ).

Second, to define a proper region of interest (RoI) on the ground plane, we require that the center (u B c , v B c ) of BEV map be projected to the image center (u I c , v I c ), and the bottom-center pixel in BEV be aligned with the bottomcenter image pixel, as indicated in 3. The scale factor s, measuring distance in meters spanned per BEV pixel, is then given by

where H is the height of the BEV map. The transformation from BEV map coordinates (u B , v B ) to world coordinates in the ground plane is then given by the homography

where W T B is constructed to align the BEV map with image center, and rescale to s meters per BEV pixel. Finally, the homography between image and BEV map coordinates is obtained by combining (1) and (3) into

where

Given a set of feet locations {q j } d j=1 in image I, the corresponding locations in BEV coordinates are given by

Similar to the IV maps, the BEV map M BEV is generated using a Gaussian kernel with σ = 5 px. Example heatmaps are shown in figure 3.

We propose two types of criteria for evaluating the quality of predicted BEV heatmaps for SDCA. Localization error. Models are required to identify BEV locations in real-world distance units (e.g. meters or feet), for all individuals in the scene. Given a set of predicted locationsX = {x i } M i=1 and a set of ground-truth locations

is used to evaluate localization error in terms of real-world distances. Predicted locationsX are determined from the BEV heatmap, using non-maximum suppression of size 5 × 5 pixels, followed by pixel-wise thresholding at the heatmap value 10 −3 . The non-zero entries of the postprocessed BEV map are then extracted and converted to world coordinates using (3) . We also evaluate the normalized chamfer distance D n (X, X) = D(X,X) 2d0 , which measures the localization error as a percent of the safe distance threshold. A minimum safe distance of d 0 = 1.5m is used, but can be adjusted per public health guidelines.

Risk estimation error. Models are expected to measure compliance with social distancing by estimating risk levels in the scene, either locally or globally. Local risk levels are represented as a heatmap R on the ground plane, with greater values indicating locations with higher risk of infection. Risk is estimated from the BEV localization map M BEV of sect. 3.3 by applying a scale-adaptive kernel K determined by a chosen infection risk model.

When the infection risk is defined as simply the number of people within the safe distance to a person, K is a diskshaped kernel of radius r = d 0 /s, where s is the scale parameter of equation 2. Therefore, R(u, v) represents the count of people within radius r of location (u, v). Evaluating the accuracy of the risk map by comparing pixelwise values can be sensitive to overcrowded areas and fail to capture borderline cases. Instead, we pose local risk estimation as a segmentation problem: For a given image, the network outputs a binary mask by thresholding the risk heatmap, with positive regions indicating areas where transmission is likely to occur; the prediction quality is evaluated by intersection-over-union (IoU) w.r.t. ground-truth mask.

Global risk levels are defined by counting occurrences of social distancing violation-the total number of people that fail to maintain a minimum distance d 0 from one anotherand normalizing by the area covered by the BEV map. This is estimated by multiplying the BEV heatmap with the binary risk mask, then integrating over the space

where r 0 is a threshold on the acceptable risk level. Risk above r 0 is considered unsafe, indicating the possibilty of infection due to violation of social distancing recommendations. Global risk error is measured by mean squared error (MSE) between estimated and ground-truth risks. Figure 1 shows sample outputs for the tasks described above. The BEV heatmap M BEV relies on accurate detection of each individual, while local and global risk estimates are more robust to minor localization errors, as the influence of each person is spread out in the ground plane.This diverse set of criteria assures that model outputs perform well with respect to both metrics of interest for computer vision (localization) and public health practitioners (risk levels).

In this section we present BEV-Net, a unified framework for the solution of crowd counting, camera pose estimation and social distancing compliance assessment.

The design of BEV-Net is based on the encoder-decoder architecture commonly used in CC models [5, 42] . How-ever, we have found that directly training an encoderdecoder to generate BEV heatmaps leads to poor results, since a fully convolutional architecture has difficulty modelling the large and non-uniform displacements that exist between a pixel location in the input image and the corresponding location in the BEV map.

BEV-Net addresses this problem through the multibranch architecture of figure 4 . The head and feet branches are trained to predict heatmaps for head and feet in the image view (IV), respectively. Standard convolutional encoder and decoder layers suffice to implement these branches, as the input image and the IV heatmaps are aligned. For SDCA, these heatmaps are not of interest per se, since they contain no metric information. However, the addition of the two branches enables supervision for head and feat locations, which is critical to let the network select which image features to pay attention to. In this sense, they can be seen as a top-down attention mechanism.

All metric information is recovered by the central BEV branch. This branch has two stages. The first is a pose regression network that runs in parallel with head and feet branches, enabling geometric reasoning by learning to predict the height h and pitch angle θ of the camera. This is implemented with a CNN feature extractor, followed by layers of MLP, and supervised by a pose-estimation loss. The second stage uses the camera parameters to rectify the IV head and feet feature maps, denoted F IV, head and F IV, feet , into BEV coordinates. Given the predicted camera pose (ĥ,θ), the IV feature maps are first aligned in BEV space through projective transformation T head and T feet (details in section 4.2 and 4.3). This produces a pair of feature maps F BEV, head and F BEV, feet in BEV, which are then concatenated along channel dimension and fed to the BEV decoder, eventually producing the predicted BEV map.

The projective transformation between IV and BEV (see figure 3 ) creates a spatially varying displacement between IV and BEV feature map locations. This makes it difficult to predict the BEV map from the IV feature maps, since the convolution operation is not naturally suited to model spatially varying displacements. Inspired by [26] , we address this problem by designing a differentiable BEV Homography transformation module based on (4), called BEV-Transform, to perform feature-level homography mapping.

Given the predicted camera pose (ĥ,θ), for a plane at height h 0 , BEV-Transform calculates the homography transformationÎ H B = H(ĥ − h 0 ,θ) that maps BEV map grid G B into the IV map sample grid G I :

where (u B , v B ) and (u I , v I ) are coordinates in G B and G I respectively. Note that h 0 is different for head and feet planes. Hence, two matrices (Î H B ) i , i ∈ {head, feet} are needed to transform the feature maps of head and feet, respectively, in order to align them in BEV. In practice, more matrices are used as revealed in Section 4.3.

Given these matrices, (9) is then used to transform the feature maps from image to BEV coordinates, using (10) As in [26] , this is implemented with a differentiable bilinear interpolation layer. It should be noted that when feature maps are internally resized by the network, the coordinates (u I c , v I c ) of image center and focal lengths (f u , f v ) are scaled proportionally to match the map size.

The determination of feet and head plane heights has different complexity. For feet, the height is determined trivially as h 0 = 0. For heads, however, since people's heights are different, the head annotations are not in the same horizontal plane in the world frame. One possibility would be to simply ignore head locations. However, the co-location of the vertical projections of head and feet is a strong regularization constraint for camera pose estimation.

To take advantage of this, BEV-Net relies on a set of head planes that quantify the range of person heights. To cover both adult and child heights, planes are placed at heights 1.1m, 1.2m, ..., 1.8m from the ground. People in different regions of the image are then automatically assigned to different height planes by a self-attention mechanism, as illustrated in figure 5 . The IV head feature maps (F IV ) head 

As shown in figure 4 , BEV-Net is trained with four loss functions. The loss functions for the head, feet and BEV branches are all MSE losses:

gives the global people counting. We amplify ground-truth heatmaps by a factor of m = 100 for better convergence, following practices in [66] . The pose loss combines two MSE losses, for camera height and pitch angle:

where λ angle and λ height are the weight factors. The final loss is a weighted sum of the above four loss functions,

The loss weights are set to λ height = 0.02, λ angle = 2.0, λ head = λ feet = 1.0, and λ BEV = 8.0 in all experiments.

In this section we present experimental evaluations of SDCA on the CityUHK-X-BEV dataset.

Training procedure. The head, feet and pose branches are pretrained for 50 epochs before training the BEV-Net model. All the training process uses AdamW [43] with learning rate lr = 0.0008 exponentially decreasing by factor 0.98 per epoch. A batch size of 8 and a train-validation split of 4:1 was used in all experiments. After pre-training, the BEV-Net is trained end-to-end. The BEV branch is first trained for 5 epochs with frozen pre-trained branches. All branches are then unfrozen step by step and jointly trained for 195 epochs. Comparisons. We compare BEV-Net to two types of baselines. A Detection-based approach (DET) utilizes a person detector combined with the pretrained pose branch used by BEV-Net. The bottom center of each bounding box is used as the feet location q j i . These locations are then converted to world coordinates in BEV using I H B −1 q j i with projection matrix I H B estimated from the predicted camera pose (see section 3.3). We use pretrained CSPNet [41] , Mask R-CNN [21] and Faster R-CNN [52] . The Faster R-CNN is finetuned on pseudo ground-truth bounding boxes generated from head/feet locations with aspect ratio 1/(3 cos θ).

A Counting-based approach (CC) uses standard crowdcounting networks to generate IV head heatmaps, which we project to BEV using the BEV-Transform and the same pose branch used in DET methods. We use four networks: finetuned CSRNet [41] and DSSINet [40] ; and our IV-Net, which is the head/feet branch in BEV-Net. The vertical displacement between head locations and the ground plane is compensated for by subtracting an average pedestrian height used in [28]-1.75 meters-from the predicted camera height. To explore the upper bound of counting-based methods for SDCA, we introduce a CC oracle which uses the ground-truth camera pose parameters and head maps.

Various ablations of the proposed BEV-Net are also evaluated. To study the effect of feet and head branches, we remove each of them from the architecture, leading to "Head only" and "Feet only" variants. We further ablate the head branch by replacing its group BEV-Transform module with a naive BEV homography ("No group transf.").

Model performance. Table 1 summarizes the performance of all methods in terms of the evaluation metrics of Section 3.4. The BEV-Net outperforms all detectionand crowd counting-based methods by a significant margin. Among different evaluation criteria, local and global risk estimates,and normalized Chamfer distance showed the most significant difference between models: BEV-Net obtains over 25% higher IoU in local risk prediction, 7× lower error in global risk, and a 20% reduction in normalized Chamfer Distance. While detection methods like Faster R-CNN can achieve relatively good localization performance with tight bounding boxes after finetuning, their risk estimates remain unreliable due to low recall from missed pedestrians. CC networks like DSSINet suffer from poor ground-plane localization, which hurts their SDCA performance in all criteria. Notably, even CC oracle with groundtruth head locations and camera pose or the IV-Net that predicts feet heatmap fails to meet the accuracy of the BEV-Net due to poor localization performance or feet occlusion. This performance gap indicates that ground-plane modeling is essential for SDCA tasks, which cannot be effectively addressed by conventional detection or crowd counting approaches that operate solely in camera view. Ablation study. Also reported in Table 1 are ablated variants of BEV-Net. First, both "Head only" and "Feet only" models performed worse than the multi-branch BEV-Net. This suggests that both feet and head locations are important for SDCA, which is intuitive: Head locations do not lie on a shared 2D plane due to height variations among the crowd, making it challenging to estimate real-world distances; feet locations lie on the ground plane, but are often occluded in the camera view. Among the two, feet modeling is most effective. The "Head only" model struggles to produce accurate BEV maps and localization results. This is likely to be the case for all but very crowded scenes, with large amounts of feet occlusion. Second, the group BEV-Transform module of section 4.3 enabled considerable gain over the vanilla implementation ("No group transf.") under IoU of local risk prediction and global risk error. This confirms that modeling height variations within the population is beneficial for transforming image coordinates into BEV with high precision. Other ablation studies are presented in the supplemental material.

We next evaluate the visual quality of the BEV-Net results, and discuss its possible applications. BEV heatmaps. Figure 6 compares BEV heatmaps and risk maps predicted by different methods. Mask R-CNN [21] , Faster R-CNN [52] and CSPNet [41] suffer from low recall of people detection, such that the risk maps fail to identify all regions with high risk of infection. CSRNet [36] and DSSINet [40] can capture more risky areas, but are unable to predict the risk level correctly due to the ambiguity in head heights. BEV-Net produces the closest localization and risk heatmaps to ground-truth. Note how it accurately predicts the risk "hot-spots" inside clusters of people. Risk-based retrieval. The multi-modal outputs from BEV-Net enable retrieval of images, individuals and clusters with highest risk of infection. Figure 7 shows the distribution of individual and scene risks in the dataset and examples of events of different risk level. Individual risk is measured by the local risk level at the ground-plane location of each person. The graph shows that only 30% of detected individuals are in compliance with the social distancing rule which prevents two or more people from gathering together (i.e. risk ≥ 2). Under a less restrictive rule that relaxes the threshold to five people, the compliance rises to 72%. We believe this type of analytics is of interest for public health experts, e.g to estimate transmission factors in real world scenes. Similarly, the global risk measure can be used to detect events of high risk, where viral transmission is most likely to occur. Scene risk analysis. While we have so far focused on image measurements, BEV-Net can also be used to estimate intrinsic risk profiles of scenes. Figure 8 shows the average risk map, over the test set, of several scenes. It can be observed that high-risk areas coincide with entrances, corners, passages or escalators. These risk maps could be used by public health decision makers to identify potential infection hot-spots and place obstacles or warning signs in the scene to mitigate infection risks.

In this work, we have introduced the problem of social distancing compliance assessment on busy public spaces, from wide field-of-view cameras, without the need for manual camera calibration or introduction of scene markers. A novel benchmark was proposed for this problem, where models are evaluated on their capability to localize people in the ground plane through geometric reasoning, and to identify regions where social distancing is violated. A multibranch architecture, BEV-Net, was then presented, which fuses information from head and feet annotations to generate a BEV reconstruction of pedestrian locations. Experiments have shown that BEV-Net exceeds baseline methods under all evaluation metrics. Several applications of interest for public health decision makers have also been discussed.

Zhirui Dai 1 

Annotation procedure. The original CityUHK-X dataset [4] contained the head annotations of all people in the scene, as well as extrinsic camera parameters in the form of height h and pitch angle θ relative to ground plane. The intrinsic parameters were assumed available at training and test time. As the height of each individual is unknown, head locations are not sufficient to recover pedestrians' locations in the world coordinates. Therefore, we used Amazon Mechanical Turk to annotate feet locations of each person, with one-to-one correspondence to the head locations.

As the number of people in each scene varies greatly (minimum 1 to maximum 121), the scene images are preprocessed into rectangular crops around each head location. The size of rectangles are selected adaptively to make sure that each crop contains the whole person selected in the original image. Given each crop with marked head location, workers are required to locate the midpoint between both feet that correspond to the same person (figure 1); In crowded areas where one or both feet are occluded by objects or other pedestrians, workers are expected to provide their best estimate of feet location, or indicate that too little information is available to do so.

Each of the crops is assigned to three workers. The annotated coordinates from each worker are averaged after the exclusion of outliers. If at least two workers think they could see the feet clearly of the given person in the crop, then the crop is marked 'valid' (clearly visible). Otherwise, the feet of the given person are marked to be occluded.

Annotation outcome. 87,746 feet locations were annotated using the procedure described above. Among them, Figure 2 shows the percentage of estimated annotations due to occluded body parts as functions of camera height and angle. The statistics reveal that occlusion occurs more frequently with low camera height and small pitch angles, making social distancing detection particularly challenging in these scenarios.

The setting of the camera is shown as figure 2 in the main text. The origin of world coordinate is set to be the camera's perpendicular projection on the ground plane, and the yaw angle of camera is set to be 0 by aligning it with the x-axis of world coordinates. We further assume that the camera has zero roll angle, i.e. its view is straightened to the horizon. This is a reasonable setting for most surveillance systems. Given the camera's height h and pitch angle θ, the transformation from the world frame to the optical frame, 

where O T C is the transformation from the camera frame to the optical frame, and W T C is from the camera frame to the world frame.

In CityUHK-X-BEV dataset, the camera focal lengths (f u , f v ) are given and for generality, we suppose there is no optical skew nor image center displacement. Hence, the intrinsic matrix is

Denoting with P the canonical projection matrix, transformation from point (x, y, z) in the world frame to coordinates (u, v) in the image frame is given by

For a plane at z = h 0 , we can easily get the projection of points in the plane by using the camera's relative height h ′ = h − h 0 . So, z = 0 and equation 4 becomes

where

α = cos θ, β = sin θ, (f u , f v ) are the horizontal and vertical focal length of the camera respectively, and (u I c , v I c ) the image center.

Since the BEV map is under a certain scale as equation 2 in the main text, the transformation between BEV map coordinates and the world frame is

where H, W are the height and width of the BEV map, and C. Network Architecture Figure 3 summarizes the architecture for each branch of BEV-Net. Image-view (IV) branches estimate head or feet locations from input image using an encoder-decoder structure. The IV encoders followed the same design as the first 4 convolutional blocks of VGG-16 [9] with batch normalization [3] . Head and feet feature maps are then processed by a fully-convolutional decoder network into the IV heatmap. Pose branch uses fully connected layers stacked on top of a ResNet-101 [2] feature extractor to regress camera height and pitch angle. The head and feet feature maps are projected into bird's eye view (BEV) using the BEV-Transform module (section 4.2 of main text), then fed into the BEV decoder which predicts the final BEV heatmap.

Performance on split scene setting. As shown in figure 4, camera poses varies even within the same scenes of the CityUHK-X-BEV dataset. In the paper, we use the setting of PoseNet [5] , which trains and tests on the same scenes. We believe that this is the most suited for a public health setting, where there is usually some planing of the locations to monitor and data can be collected at those locations. In this setting, parameter variation is mostly due to camera motion (e.g. pan-zoom cameras), wind effects, etc. and usually less severe than even in figure 4 . A more drastic generalization to completely unseen scenes is a much more challenging task. We also test BEV-Net with some scenes unseen during training. The chamfer distance increases to 2.41/80.33%, IoU of local risk drops to 54.86%, and the global risk MSE is 50.14 × 10 −4 . We can see that BEV-Net still outperforms most baselines. baseline approaches using detection [1, 8] and crowd counting [6, 7] backbones. The results confirm the observations in main paper that detection methods have low recall for pedestrians far away, while counting methods fail to produce accurate localization in ground plane. In contrast, BEV-Net captures more people in crowded scenes, especially in areas far from the camera where occlusion is common, as well as those with extreme (close to 90 • ) camera angles. This advantage translates to better localization and risk estimate performance, both in the visualizations and in quantitative results (table 1 of main text).

Social lstm: Human trajectory prediction in crowded spaces

Parametric correspondence and chamfer matching: Two new techniques for image matching

Context-aware trajectory prediction

Cascade r-cnn: Delving into high quality object detection

Scale aggregation network for accurate and efficient crowd counting

Learning context for collective activity recognition

Strong social distancing measures in the united states reduced the covid-19 growth rate: Study evaluates the impact of social distancing measures on the growth rate of confirmed covid-19 cases across the united states

Instance-aware semantic segmentation via multi-task network cascades

Histograms of oriented gradients for human detection

Imagenet: A large-scale hierarchical image database

Bernt Schiele, and Pietro Perona. Pedestrian detection: An evaluation of the state of the art

The pascal visual object classes (voc) challenge

Interhomines: Distance-based risk estimation for human safety

Object detection with discriminatively trained part-based models

Aerobiology and its role in the transmission of infectious diseases

Fast r-cnn

Rich feature hierarchies for accurate object detection and semantic segmentation

Brostow. Unsupervised monocular depth estimation with leftright consistency

Social gan: Socially acceptable trajectories with generative adversarial networks

Multiple View Geometry in Computer Vision

Piotr Dollár, and Ross Girshick. Mask r-cnn

Social force model for pedestrian dynamics

The effect of large-scale anti-contagion policies on the covid-19 pandemic

A hierarchical deep temporal model for group activity recognition

Multi-source multi-scale counting in extremely dense crowd images

End-to-end recovery of human shape and pose

Incorporating side information by adaptive convolution

Neural Information Processing Systems

Activity forecasting

Anticipating human activities using object affordances for reactive robotic response

Transmission routes of respiratory viruses among humans. Current opinion in virology

Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks

Social roles in hierarchical models for human activity recognition

Crowds by example

Fully convolutional instance-aware semantic segmentation

Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes

Feature pyramid networks for object detection

Kaiming He, and Piotr Dollár. Focal loss for dense object detection

Microsoft coco: Common objects in context

Crowd counting with deep structured scale integration network

High-level semantic feature detection: A new perspective for pedestrian detection

Contextaware crowd counting

Decoupled weight decay regularization

Distinctive image features from scaleinvariant keypoints

Learning activity progression in lstms for activity detection and early detection

Fully convolutional crowd counting on highly congested scenes

Abnormal crowd behavior detection using social force model

Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction

You'll never walk alone: Modeling social behavior for multi-target tracking

Improving data association by joint modeling of pedestrian trajectories and groupings

You only look once: Unified, real-time object detection

Faster r-cnn: Towards real-time object detection with region proposal networks

Learning social etiquette: Human trajectory understanding in crowded scenes

Human activity prediction: Early recognition of ongoing activities from streaming videos

Sophie: An attentive gan for predicting paths compliant to social and physical constraints

Switching convolutional neural network for crowd counting

Guideline for isolation precautions: preventing transmission of infectious agents in healthcare settings

Recursive social behavior graph for trajectory prediction

Apostolos Pyrgelis, Daniele Antonioli

Nwpucrowd: A large-scale benchmark for crowd counting

On air-borne infection. study ii. droplets and droplet nuclei

Who are you with and where are you going?

Cross-scene crowd counting via deep convolutional neural networks

Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns

Citypersons: A diverse dataset for pedestrian detection

Single-image crowd counting via multi-column convolutional neural network

Unsupervised learning of depth and ego-motion from video

Piotr Dollár, and Ross Girshick. Mask r-cnn

Deep residual learning for image recognition

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Incorporating side information by adaptive convolution

Neural Information Processing Systems

Posenet: A convolutional network for real-time 6-dof camera relocalization

Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes

Crowd counting with deep structured scale integration network

High-level semantic feature detection: A new perspective for pedestrian detection

Very deep convolutional networks for large-scale image recognition

Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research

Acknowledgements. This work was partially funded by NSF awards IIS-1924937, IIS-2041009, NVIDIA GPU donations, and a gift from Amazon. We also acknowledge and thank the use of the Nautilus platform for some of the experiments discussed above. ABC acknowledges support from a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 11212518).