key: cord-0288910-mmsgqr1l
authors: Wu, Qiming; Zou, Zhikang; Zhou, Pan; Ye, Xiaoqing; Wang, Binghui; Li, Ang
title: Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting
date: 2021-04-22
journal: nan
DOI: 10.1145/3474085.3475378
sha: 2f2d9e4e3198bfb3b641e3b6eb684646d0c48e64
doc_id: 288910
cord_uid: mmsgqr1l

Crowd counting has drawn much attention due to its importance in safety-critical surveillance systems. Especially, deep neural network (DNN) methods have significantly reduced estimation errors for crowd counting missions. Recent studies have demonstrated that DNNs are vulnerable to adversarial attacks, i.e., normal images with human-imperceptible perturbations could mislead DNNs to make false predictions. In this work, we propose a robust attack strategy called Adversarial Patch Attack with Momentum (APAM) to systematically evaluate the robustness of crowd counting models, where the attacker's goal is to create an adversarial perturbation that severely degrades their performances, thus leading to public safety accidents (e.g., stampede accidents). Especially, the proposed attack leverages the extreme-density background information of input images to generate robust adversarial patches via a series of transformations (e.g., interpolation, rotation, etc.). We observe that by perturbing less than 6% of image pixels, our attacks severely degrade the performance of crowd counting systems, both digitally and physically. To better enhance the adversarial robustness of crowd counting models, we propose the first regression model-based Randomized Ablation (RA), which is more sufficient than Adversarial Training (ADT) (Mean Absolute Error of RA is 5 lower than ADT on clean samples and 30 lower than ADT on adversarial examples). Extensive experiments on five crowd counting models demonstrate the effectiveness and generality of the proposed method. The supplementary materials and certificate retrained models are available at url{https://www.dropbox.com/s/hc4fdx133vht0qb/ACM_MM2021_Supp.pdf?dl=0}

Due to the global outbreak of the COVID-19 virus, a large amount of public places are requiring people to keep the social distance. Therefore, the video surveillance systems composed of crowd counting models [60] are undoubtedly gaining prominence in the administration. As crowd counting is one of the most significant applications of deep neural networks (DNNs) and has been adopted in many safety-critical scenarios like video surveillance [60] and traffic control [16] . However, recent works have demonstrated the vulnerability of DNNs towards adversarial examples [13, 49] , this definitely provides attackers with new interface to perform attacks for malicious purposes. Naturally, attackers may intend to generate perturbations that fool DNN models to count the crowd inaccurately so as to increase the possibility of causing public safety accidents (e.g., viral infection, stampede and severe traffic accidents.).

To achieve the adversarial goal, we design the Adversarial Patch Attack with Momentum (APAM) algorithm on the crowd counting systems. Compared with the classic adversarial examples restricted ) and the second row shows that after retraining with ablated images, the output density maps of normal and adversarial images have little differences. The third row illustrates the imperceptibility of the generated adversarial patch. It is hard for human eyes to find the patch in a congested scenes when patch size is 20 × 20 or 40 × 40. with 0 , 2 or ∞ distance metrics [4, 13, 41, 42] , the APAM attacks are not limited to the norm bound (since the norm bound is to guarantee the imperceptibility of adversarial examples, we claim that the adversarial patch generated by APAM algorithm can achieve the goal by reducing the patch size, e.g., In the third row of Fig. 1 , 20×20 and 40×40 patches are hard to find in the dense crowd scenes) and our proposed algorithm can accelerate the optimization with momentum to obtain robust adversarial patches. Therefore, performing adversarial patch attacks to evaluate the robustness of crowd counting systems is definitely a promising research direction, which is the major focus of this work.

Concurrent Work. There is only one recent work [30] which introduces a defense strategy to study the robustness of crowd counting systems. However, the proposed defense strategy relies on the depth information of RGBD datasets, which is not generally available in the crowd counting applications. Additionally, they ignore the importance of background information in crowd counting, where such information is effectively exploited by our attacks.

Motivation. We observe that the extremely dense background has been the key obstacle in crowd counting. Numerous works are dedicated to overcome this challenge by utilizing deeper convolutional neural networks (CNN). However, due to the inherent vulnerability of CNN, such CNN-based methods leave a new vulnerable interface for potential attackers. This motivates us to design novel adversarial attacks so as to better understand and improve the robustness of the CNN-based crowd counting system. APAM Attacks. Our proposed attacks aim to leverage the congested background information for generating the adversarial patches. In addition, we enhance the algorithm of generating the adversarial patches with momentum and remove the norm bound to strengthen the attack capabilities ( Fig. 1 shows the generated adversarial examples on a typical dataset).

Certified Defense via Randomized Ablation. We propose a certified defense strategy against APAM attacks on crowd counting, namely, randomized ablation. Our defense strategy consists of two parts: image ablation and certificate retraining crowd counting models. The first step is inspired by the recent advance in image classifier certification [23] . Specifically, randomized ablation is effective against APAM attacks because the ablation results of normal image and adversarially perturbed image˜are likely to be same (e.g., retaining 45 pixels for each images in Fig. 1 ). Note that several other methods have been proposed to certify the robustness, like dual approach [10] , interval analysis [14] , and abstract interpretations [40] . Compared with these methods, randomized ablation is simpler and more importantly, scalable to complicated models.

Our major contributions are summarized as follows:

• To the best of our knowledge, this is the first work to propose a systematic and practical method on the evaluation of the robustness of crowd counting models via adversarial patch attacks and the certified defense strategy (i.e., randomized ablation). • We design a robust adversarial patch attack framework called Adversarial Patch Attack with Momentum (APAM) to create effective adversarial perturbations on mainstream CNNbased crowd counting models. • We implement the APAM attack algorithm on the network in two forms: white-box attack and black-box attack. We evaluate the proposed attacks in both digital and physical spaces. Qualitative and quantitative results demonstrate that our attacks significantly degrade the performances of models, hence, pose severe threats to crowd counting systems. • We provide the first theoretical guarantee of the adversarial robustness of crowd counting models via randomized ablation. More practically, after training the verification models with this strategy, we achieve the significant robustness enhancement. Meanwhile, our proposed method defeats the traditional adversarial (patch) training both on clean sample and adversarial example evaluation tests.

Crowd analysis is an inter-disciplinary research topic with researchers from different domains [48] , the approaches of studying crowd counting are also characterized by multidisciplinary integration [6, 18] . Initial research approaches are divided into three categories [48] : detection-based methods, regression-based methods and density estimation-based methods. A survey of classical crowd counting approaches is available in [32] . However, the quality of predicted density map generated by these classical crowd counting methods are limited when applied in congested scenes. Thanks to the success of CNNs in other fields [5, 51, 52] , researchers recently propose the CNN-based density estimation approaches [24, 29, 47, 59, 61] to find a way out of the dilemma. A survey of the CNN-based crowd counting methods is available in [48] . Although more and more impressive models have yielded exciting results [25, 27, 28, 53, [56] [57] [58] on the bench-mark datasets, their robustness has not been reasonably understood. In particular, we focus on the systematic evaluation and understanding of the robustness of these five models [24, 29, 47, 59, 61] in this article.

Norm Bounded Adversarial Perturbation. Recent works have demonstrated the existence of adversarial examples in deep neural networks [49] , a variety of methods such as FGSM [13] , Deepfool [41] , C&W [4] and JSMA [42] , have been proposed to generating adversarial examples bounded by norms. Basically, the problem is formulated as: ||˜− || ≤ , where is the parameter control the strength of perturbation. Researchers often choose 0 , 2 and ∞ metrics in practice. 0 norm counts the number of changed pixels iñ , 2 norm is formulated as ||˜− || 2 = (Δ 1 ) 2 +(Δ 2 ) 2 +· · ·+(Δ ) 2 and ∞ norm is formulated as ||˜− || ∞ = {Δ 1 , Δ 2 , · · · , Δ }. The attacker aims to find the optimal adversarial example˜= + to fool the neural networks. A survey of adversarial examples in deep learning is available in [55] .

Empirical Defense Strategy. Many defense strategies [1] have been proposed, such as network distillation [43] ,adversarial training [13, 17] , adversarial detecting [26, 33, 39] , input reconstruction [15] , classifier robustifying [2] , network verification [21] and ensemble defenses [38] , etc. However, these defense strategies have a common major flaw: almost all of the above defenses have an effect on only part of the adversarial attacks, and even have no defense effect on some invisible and powerful attacks.

Adversarial Patch Attacks. As numerous empirical defense strategies have been proposed to defend against norm adversarial attacks, researchers explore adversarial patch attacks to further fool DNNs [3] . Attackers obtain the adversarial patch via optimizing the traditional equation = arg max{ ( )| ( , , , )} in [3] .

Specifically,ˆ( , , , ) is a patch application operator, where is the input image, is the patch location, is the patch, and is the image transformation. Because the adversarial patch attacks are image-independent, it allows attackers to launch attacks easily regardless of scenes and victim models. Moreover, the state-of-theart empirical defenses, which focus on small perturbations, may not be effective against large adversarial patches. Meanwhile, the adversarial patch attack has been widely applied to many safetysensitive applications, such as face recognition [44, 54] and object detection [22, 31, 46] , which inspires this work either.

We first define the regression models, and then introduce the background of crowd counting. After that, we define our attacks against crowd counting models. Definition 3.1 (Regression Models). In statistical modeling, regression models refers to models that can estimate the relationships between a dependent variable and independent variables.

Given a set of labeled images

R · · and , , and are the height, width, and channel number of the image, respectively. is the − ℎ ground truth density map of image . Then, a crowd counting system aims to learn a model , parameterized by , by using these labeled images and solving the following optimization problem:

(1)

Note that researchers recently have adopted more effective loss functions in crowd counting [7, 35] and we consider the most commonly used 2 loss function. Moreover, different crowd counting models [24, 29, 47, 59, 61] will use different architectures. For instances, MCNN [59] uses Multi-column convolutional neural networks to predict the density map. The learned model can be used to predict the crowd count in a testing image . Specifically, takes as an input and outputs the predicted density map ( ). Then, the crowd count in is estimated by summing up all values of the density map.

As crowd counting systems are of great importance in safetycritical applications, such as video surveillance, an adversary is motivated to fool the systems to count the crowd inaccurately so as to increase the possibility of causing public safety accidents. Next, we will introduce the threat model and formally define our problem.

Adversary's knowledge: Depending on how much information an adversary knows about the crowd counting system, we characterize an adversary's knowledge in terms of the following two aspects:

• Full knowledge. In this scenario, the adversary is assumed to have all knowledge of the targeted crowd counting system, for example, model parameters, model architecture, etc. • Limited knowledge. In this scenario, the adversary has no access to the model parameters of the targeted crowd counting system. In practice, however, there exist various crowd counting systems different from the targeted system. We assume the adversary can adopt these crowd counting systems as the substitution and perform an attack on these substitute systems. Adversary's capability. We consider different capabilities for an adversary to launch attacks. In the full knowledge setting, an adversary can launch the white-box attack, i.e., it can generate an adversarial patch to a testing image by directly leveraging the model parameters of the targeted crowd counting system. In the limited knowledge setting, an adversary can launch the black-box attack, i.e., an adversary cannot leverage the model information of the targeted system, but can use several substitute crowd counting systems to generate an adversarial patch. Plus, we also study physical attack. In this scenario, we aim to attack crowd counting systems in a real-world case. Attacking crowd counting systems in the real world is much more difficult than in the digital space. With less information obtained, the attacker directly poses the generated adversarial patch in the scene to fool networks. This requires the adversarial patch to generalize well across various crowd counting systems. To this end, we randomly select a shopping mall to evaluate our generated adversarial patch. Adversary's goal. Given a set of testing images with ground truth crowd counts and a targeted crowd counting system, an adversary aims to find an adversarial patch for each testing image such that the perturbed image has a predicted crowd count by the targeted system that largely deviates from the ground truth.

Given a crowd counting model and a testing image with ground truth density map , we aim to add an adversarial perturbation (i.e., adversarial patch in our work) to the testing image such that the model predicts the crowd count in the testing image as the adversary desires. Note that the crowd count is calculated by the summation of all values of the density map, modifying the prediction of the crowd count equals to modifying the density map. Suppose the adversary aims to learn a targeted density map * with a perturbation . Then, our attack can be defined as follows:

where is a distance function. Directly solving Eq. 2 is challenging due to that the equality constraint involves a highly nonlinear model . An alternative way is to put the constraint into the objective function. Specifically, arg min

where is the budget constraint and is a loss function (e.g., crossentropy loss).

In this section, we introduce the proposed APAM attack in following orders: we first design it under the white-box setting. And then, the black-box settings.

Our proposed white-box attack consists of two phases: adversarial patch initialization and adversarial patch optimization with momentum.

Our patch initialization process includes two steps: image transformation and interpolation smoothness. Image Transformation. As researchers have demonstrated that Cyber-physical systems can destroy perturbations crafted using digital-only algorithms [34] and the physical perturbation can be affected by environmental factors, including viewpoints [11] . To solve the problem, we manipulate the image transformation function to make the patch more robust through Eq. 4 and Eq. 5. Also, the pipelines of image transformation function is vividly illustrated in Fig. 2 .

Inspired by [45] , we propose to interpolate the generated adversarial patch with tensor . As ∈ R × × is one tensor in the image space and ℎ, , ∈ (0, 1), where ℎ ∈ [1, ], ∈ [1, ] and ∈ [1, ] . Manipulating this kind of interpolation guarantees the quality of the adversarial patch. We then formulate the initialization of adversarial examples as follows:

where P ℎ = I • R • (P ( ) · ).

(5) Interpolation Smoothness. As the interpolation tensor contains many parameters, we simplify the problem by considering setting a smoothness constraint on . As defined in Eq. 6, the smoothness loss is widely used in processing images as pixel-wise de-noising objectives [20, 37] .

Adversarial Patch Generation. We acquire the final adversarial patch through minimizing the objective L in Eq. 7 with the initial adversarial example˜generated through the image transformation function T . Particularly, our objective function L has two parts: the adversarial loss and the smoothness regularization L ( ). The smoothness term aims to smooth the optimization of adversarial patches and guarantee the perceptual quality of the patch since it is scaled and rotated during image transformation process. To make the final generated adversarial robust, we propose to minimize the following objective function: 

where is a hyperparameter to balance the two terms.

Momentum-based Optimization. Momentum method is usually used in the gradient descent algorithm to accelerate the optimization process with the help of the memorization of previous gradients. As the adversary searches the optimal adversarial exam-ple˜in the high-dimensional space, there are high possibilities of being trapped in small humps, narrow valleys and poor local minima or maxima [8, 9] . In order to break the dilemma, we integrate momentum into the optimization of the adversarial patch so as to update more stably and further enhance the potential capability of the attacker. Specifically,

Researchers propose to boost the traditional adversarial attacks with momentum in Eq. 8 to generate perturbations˜+ 1 =˜+ · ( +1 ) iteratively [8] . Inspired by their work, we extend the optimization process of the adversarial patch with momentum. By 5. This image process is done during the certificate training, and we could gain the specific ablated image when given the pristine one. Moreover, the model will not overfit on the obtained ablated image dataset since the ablated images vary in training epochs (i.e., the results of sampling pixels of the image vary from epochs to epochs).

adding variables to control the exponentially weighted average, the optimization process can be smoothed and accelerated. Therefore, the generated adversarial patch is capable of transferring across various models and its attack ability simultaneously remains strong, thus demonstrating robust adversarial perturbations.

In a black-box attack, an adversary has no access to the internal structure of victim models. However, the adversary can adopt substitute models to generate an adversarial patch. In order to make the adversarial patch robust, we consider jointly attacking multiple crowd counting systems. Specifically, the objective function of our black-box attack is defined as follows:

where each denotes a substitute crowd counting model with parameter .

In this section, we will clarify the principles of randomized ablation. A vivid process is depicted in Fig. 3 . We use to denote the set of all possible pixel values and X = represents the set of all images and is the retention constant. When using the null symbol NULL during encoding images, we adopt the mean pixel encoding method proposed in [23] for simplicity and efficiency. Similarly Then, we extend the randomized ablation scheme from classifier setting (R → {0, 1} ) to the real-valued function settings (R → [0, 1] ). To achieve this goal, we introduce a top-K overlap metric ( ,˜, ) proposed in [12] to measure the adversarial robustness of crowd counting models. 

where ∈ [0, ] denotes the number of overlapping largest top k elements of output density maps of A ( ) and A (˜). Intuitively, we can derive the upper bound and lower bound as:

where is the number of retained pixels in the image.

Proof. We now follow the notations to complete the proof. Note that the in Eq. 10 counts the number of top-k largest pixels in two output density maps (normal ablated and adversarial ablated images). Next, we will prove the upper and the lower possibility to bound with probability at least ∈ [ , ]. If = , then we have T ∩ ( ⊖˜) = ∅, which indicates andã re identical at all indices in T (That is, the ablated results of and are identical, ( , T ) = (˜, T )). In this case, we have:

where and represent total ways of uniform choices of elements from .

If ≤ , we have T ∩ ( ⊖˜) ≠ ∅. Then, similarly:

where the final inequality denotes the worst case: the retained pixels in the adversarial image˜are all picked from the adversarial perturbation regions of the image. □ (40) Table 3 : Theoretical possibility analysis of the four selected adversarial patch. Note that the Upper bound possibility means the possibility of picking non-adversarial pixels from the image. The denotes the probability of worst case (depicted in Eq. 5) happening and we find in fact the randomized ablation method can almost avoid it ( is always close to zero) .

Dataset and Networks. We select five well pre-trained crowd counting models, which are publicly available: CSRNet [24] , DA-Net [61] , MCNN [59] , CAN [29] and CMTL [47] as our verification models thanks to their prevalent usage in the field. Additionally, we select the ShanghaiTech dataset [59] for retraining and evaluation because it is the most representative dataset in the field, which contains 1198 images with over 330,000 people annotated. Besides, we adopt mean absolute error and root mean squared error (defined in Eq. 16) as the evaluation metric.

where is the number of images, and˜separately denote the -th ground-truth counting and the counting of the corresponding adversarial image. We apply the following settings in our implementations: we set the attack target as * = 10 · and set = 0.01 in Eq. 7 to balance the two terms. Moreover, four patch sizes are selected in our experiments: 20 × 20, 40 × 40, 81 × 81 and 163 × 163 to represent 0.07%, 0.31%, 1.25% and 5.06% of the image size, respectively. Moreover, training details of randomized ablation are summarized in Supplementary Materials.

The upper part of Table 1 shows results of white-box APAM attacks. We observe that the effectiveness of attacks strengthens as patch size climbs from 20 to 163. We find that all of the five crowd counting models suffer most when patch size reaches 163 × 163 since the adversarial perturbation reaches the maximum. However, the robustness varies from model to model. For example, compared with other models, DA-Net [61] remains relatively robust against white-box APAM attacks. For MCNN [59] and CAN [29] , their performances degrade significantly even when the adversarial patch is small. Moreover, we compute the relative increase percentage of these values and find that the vulnerability of networks depends on whether the specific patch size is reached or not. For instances, the fastest increase percentage of MAE and RMSE values of CMTL [47] are 40.09% and 40.70% when the patch size changes from 20 to 40. Then, we infer the threshold value of CMTL [47] is between 20 to 40. We find the effectiveness of APAM attacks is affected by the attack target * . In the experiment, we use three different attack target * = 5 , 20 , 100

to study the effectiveness of attack target * . From Table 2 , we observe that the values of MAE and RMSE increase when the * becomes larger and the performances of victim networks are basically in accord with those of the attack experiments with * = 10 . 

Since the mainstream crowd counting models have three popular structures (dilated convolution, context-aware and multi-scale structures), we select three representative nets as substitute models: CSRNet [24] (dilated convolution structure), CAN [29] (contextaware structure) and DA-Net [61] (multi-scale structure). We jointly optimize Eq. 9 with substitute models to make the generated blackbox patch consist of contextual information and be generally robust towards different model structures. The targeted models are MCNN [59] and CMTL [47] , which are also the representative models with multi-column structures.

The black-box APAM attack results are summarized in the lower part of Table 1 . Compared with the white-box APAM attacks, blackbox APAM attacks are somewhat weaker. But we still find some intriguing phenomena: consistent with white-box attacks, CSRNet [24] and CAN [29] are still vulnerable to the adversarial patch and their performances are severely degraded. Simultaneously, DA-Net [61] is relatively robust against adversarial patch both in white-box and black-box settings. Particularly, we find DA-Net [61] , MCNN [59] and CMTL [47] stay relatively robust when the patch size is small (e.g., 20 × 20 and 40 × 40). One possible explanation for this is that when training a black-box adversarial patch, it equals implementing some kind of adversarial retraining, which strengthens the involved models.

Evaluation. We define the error rate to evaluate the performances of crowd counting models. In the equation, denotes normal output prediction number and˜denotes adversarial prediction number.

We first choose a well-trained adversarial patch and print it out. Then, we select a large shopping mall as our physical scene and take a group of photos to study (seen in Fig. 4 ). In Fig. 4 , our proposed attack reaches the error rate as 526.2% for CSRNet [24] , 982.9% for CAN [29] , 252.2% for DA-Net [61] , 100% for MCNN [59] and 76.9% for CMTL [47] . From Fig. 4 , the physical APAM attack does degrade the prediction of networks even the adversarial patch area is less than 6% of that of the image. We observe that there are two kinds of the network estimation: an extremely large number and a small number. For instances, Fig. 4 shows that CSRNet [24] , DA-Net [61] and CAN [29] incline to predict larger number of the crowd while MCNN [59] and CMTL [47] tend to estimate the number as approximately zero. Actually, the attack target is set as * = 10 × . We attribute the phenomenon to the physical environment factors (e.g., light, size and location). And it does not mean physically attacking failure since we define the error rate in Eq. 17 to consider the model to be successfully physically attacked when its error rate is high. Last but not least, experimental results demonstrate that the physical adversarial patch can severely degrade the performance of models.

To begin with, we summarize the training parameters in Supplementary Materials for better reproducibility. The final training results of the five crowd counting models via randomized ablation are summarized in Table 4 . In general, we find that along with the adversarial robustness enhancement is the loss of some clean accuracy (i.e., the MAE or RMSE values decrease in the adversarial environments while they increase in the clean environments). This phenomenon is well explained in [50] . There exists a balance between the adversarial robustness and the clean accuracy. Compared Table 1 with Table 4 , we find the randomized ablation training method helps the crowd counting models to be more robust against APAM attacks 11.28) . For other models, we all observe the decrease trend, which demonstrate the practical effectiveness of the proposed method. Moreover, we also observe that the clean accuracy loss is indeed acceptable In Table 4 , (0) value of MCNN [59] increases from 110.20 to 117.32, (0) value of MCNN [59] increases from 173.20 to 185.44. Compared with the popular used adversarial training [13] in Table 5 , the (0) and (0) values of MCNN [59] are 121. 26 and 191 .81, respectively. Adversarial training results of clean examples are worse than those of randomized ablation, which demonstrates the effectiveness.

The most intuitive approach to enhance the adversarial robustness of DNNs is Adversarial Training [13, 36] . In the adversarial patch scenario, we generate the white-box adversarial patches during the model training loop to enhance the robustness. Note that there are other improved defense methods based on RobustBench 1 , their effectiveness on regression models remains an open problem. The experimental results are summarized in Table 5 . We now formulate the loss function used in the adversarial training loop:

where denotes the clean sample images and is the corresponding groundtruth map. Theoretical Possibility of Adversarial Patch via Randomized Ablation. We use the Theorem 1 to predict the practical possibility of retaining pixels from non-adversarial regions (upper bound possibility ) and adversarial regions (lower bound possibility ). The results of four patches (20 × 20, 40 × 40, 81 × 81 163 × 163) are summarized in Table 3 . We find that decreases as the patch size increases. Specifically, the drops sharply (0.6611 to 0.1827) when patch size increases from 81 × 81 to 163 × 163. Meanwhile, although the lower bound possibility increases dramatically (from 1.7×10 −147 to 3.9×10 −65 ) when patch size increases, is still close to zero and this phenomenon indicates the worst case (depicted in Eq. 5) is unlikely to happen, which guarantees the stability of the proposed randomized ablation method.

In this article, we introduce an adversarial patch attack framework named APAM, which poses a severe threat to crowd counting models. We further propose a general defense method to certify the robustness of crowd counting models via randomized ablation. We theoretically and experimentally demonstrate the effectiveness of the proposed method.

Physical Evaluations. Despite the effectiveness of the framework, we now present some limitations. We mainly design our patch in the digital space, and therefore, we indeed find our physical evaluations somewhat weak since elaborately design the physical patch for attack is another story (i.e., light conditions, angles of the patch, human impact and so on). But the experimental results are quite exciting and motivates the further research on the physical evaluations of adversarial robustness of regression models.

Experiments. We evaluate the attack and defense framework mainly (indeed, our method is general to other popular datasets such as UCF-CC-50 [18] and UCF-QNRF [19] ) on the ShanghaiTech dataset [59] . The reason that we only evaluate on one dataset are as follows: 1) ShanghaiTech dataset is one of the most representative and challenging datasets in crowd counting [59] , as is detailed claimed in Section 6.1. 2) For the real-life adversary, successfully attacking victim models with various structures is definitely more significant. Since we miss the attack baseline to compare, we have added the random patch experiments in Supplementary Materials for the better understanding.

Threat of adversarial attacks on deep learning in computer vision: A survey

Adversarial examples, uncertainty, and transfer testing robustness in Gaussian process hybrid deep networks

Towards evaluating the robustness of neural networks

Deepdriving: Learning affordance for direct perception in autonomous driving

Cumulative attribute space for age and crowd density estimation

Learning spatial awareness to improve crowd counting

Boosting adversarial attacks with momentum

Optimization and global minimization methods suitable for neural networks

A Dual Approach to Scalable Verification of Deep Networks

Robust physicalworld attacks on deep learning visual classification

Interpretation of neural networks is fragile

Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples

On the effectiveness of interval bound propagation for training verifiably robust models

Towards deep neural network architectures robust to adversarial examples

Extremely overlapping vehicle counting

Learning with a strong adversary

Multisource multi-scale counting in extremely dense crowd images

Composition loss for counting, density map estimation and localization in dense crowds

Perceptual losses for realtime style transfer and super-resolution

Towards proving the adversarial robustness of deep neural networks

On physical adversarial patches for object detection

Robustness Certificates for Sparse Adversarial Attacks by Randomized Ablation

Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes

Density Map Regression Guided Detection Network for RGB-D Crowd Counting and Localization

Detecting adversarial attacks on neural network policies with visual foresight

Recurrent Attentive Zooming for Joint Crowd Counting and Precise Localization

Crowd Counting With Deep Structured Scale Integration Network

Context-aware crowd counting

Using Depth for Pixel-Wise Detection of Adversarial Attacks in Crowd Counting

Dpatch: An adversarial patch attack on object detectors

Crowd counting and profiling: Methodology and evaluation. In Modeling, simulation and visual analysis of crowds

Safetynet: Detecting and rejecting adversarial examples robustly

No need to worry about adversarial examples in object detection in autonomous vehicles

Bayesian loss for crowd count estimation with point supervision

Towards Deep Learning Models Resistant to Adversarial Attacks

Generating images from captions with attention

Magnet: a two-pronged defense against adversarial examples

Differentiable abstract interpretation for provably robust neural networks

Deepfool: a simple and accurate method to fool deep neural networks

The limitations of deep learning in adversarial settings

Distillation as a defense to adversarial perturbations against deep neural networks

On adversarial patches: real-world attack on ArcFace-100 face recognition system

Semanticadv: Generating adversarial examples via attribute-conditional image editing

Adversarial Patches Exploiting Contextual Reasoning in Object Detection

Cnn-based cascaded multitask learning of high-level prior and density estimation for crowd counting

A survey of recent advances in cnn-based single image crowd counting and density estimation

Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks

Robustness May Be at Odds with Accuracy

Deep belief networks for spam filtering

Fingerprint classification based on depth neural network

Perspective-Guided Convolution Networks for Crowd Counting. arXiv: Computer Vision and Pattern Recognition

Design and Interpretation of Universal Adversarial Patches in Face Detection

Adversarial examples: Attacks and defenses for deep learning

Relational Attention Network for Crowd Counting

Attentional Neural Fields for Crowd Counting

Wide-Area Crowd Counting via Ground-Plane Density Maps and Multi-View Fusion CNNs

Single-image crowd counting via multi-column convolutional neural network

Crowd counting in public video surveillance by label distribution learning

DA-Net: Learning the fine-grained density distribution with deformation aggregation network

This work is supported by National Natural Science Foundation of China (NSFC) under grant no. 61972448. (Corresponding author: Pan Zhou). We thank anonymous reviewers for the constructive feedbacks. We thank Xiaodong Wu, Hong Wu and Haiyang Jiang for the helps in physical experiments. We thank Shengqi Chen, Chengmurong Ding, Kexin Zhang and Chencong Ren for their valuable discussions with the work.