key: cord-0919600-zmq4e0k1 authors: Wu, Chongke; Shao, Sicong; Tunc, Cihan; Satam, Pratik; Hariri, Salim title: An explainable and efficient deep learning framework for video anomaly detection date: 2021-11-23 journal: Cluster Comput DOI: 10.1007/s10586-021-03439-5 sha: e32bb9503fa9debda46dd4a464749ede7477a4d2 doc_id: 919600 cord_uid: zmq4e0k1 Deep learning-based video anomaly detection methods have drawn significant attention in the past few years due to their superior performance. However, almost all the leading methods for video anomaly detection rely on large-scale training datasets with long training times. As a result, many real-world video analysis tasks are still not applicable for fast deployment. On the other hand, the leading methods cannot provide interpretability due to the uninterpretable feature representations hiding the decision-making process when anomaly detection models are considered as a black box. However, the interpretability for anomaly detection is crucial since the corresponding response to the anomalies in the video is determined by their severity and nature. To tackle these problems, this paper proposes an efficient deep learning framework for video anomaly detection and provides explanations. The proposed framework uses pre-trained deep models to extract high-level concept and context features for training denoising autoencoder (DAE), requiring little training time (i.e., within 10 s on UCSD Pedestrian datasets) while achieving comparable detection performance to the leading methods. Furthermore, this framework presents the first video anomaly detection use of combing autoencoder and SHapley Additive exPlanations (SHAP) for model interpretability. The framework can explain each anomaly detection result in surveillance videos. In the experiments, we evaluate the proposed framework's effectiveness and efficiency while also explaining anomalies behind the autoencoder’s prediction. On the USCD Pedestrian datasets, the DAE achieved 85.9% AUC with a training time of 5 s on the USCD Ped1 and 92.4% AUC with a training time of 2.9 s on the UCSD Ped2. Security cameras are becoming widely used and powered with networking technologies, improved surveillance capabilities, and advancements in storage systems. It has been observed that the installation of surveillance cameras significantly reduces the crime rate. For example, the total crime in downtown Baltimore (Maryland, USA) reduced about a quarter in four months after installing the surveillance camera. Similarly, violent crime declined about 20% in Chicago (USA) [1] . Besides the public security field, surveillance cameras are also applied on business operations, health care, smart home applications, etc. The industry research HIS Markit reported that there existed approximately 770 million security cameras worldwide in 2019, and the total number of security cameras will increase to 1 billion before 2022 [2] . However, storing and manually evaluating a large amount of data from many surveillance cameras are no longer practical, which started the discussions and studies in anomaly detection in surveillance. Moreover, the detection using opaque models, such as the deep learning models, lacks explanations of how the model decides the results. Therefore, the interpretability of anomaly detection in the video has become a main challenge in the surveillance system. In this paper, the anomaly is defined as the abnormal behavior and event in the surveillance videos. The objective of explainable video anomaly detection is to autonomously detect an anomalous event in the video recording with supportive explanations (i.e., not only giving the result of if an anomaly occurs but also explaining why it is considered as an anomaly-current studies mainly focus on just detection without sufficient explanation). Powered by the huge performance improvement of deep learning methods, many autoencoder-based video anomaly detection approaches have been studied in the last few years. For example, Appearance and Motion DeepNet (AMDN) [3] trained deep convolutional neural networks for processing the input raw RGB (Red, Green, Blue) image and the optical flow map. Two-Stream Variational AutoEncoder (VAE) [4] improves the detection accuracy by adapting the VAE. (Spatio-Temporal Adversarial Network) STAN [5] detects the anomaly by using the Generative Adversarial Network (GAN). For explaining the anomaly detection results, currently, many methods rely on highlighting the suspicious region without further description. For instance, the University of California San Diego (UCSD) Ped1 and Ped2 datasets provide pixel-level anomaly localization as the dataset. Some methods [3, 4, [6] [7] [8] not only show the framelevel anomaly detection but also shows the pixel-level anomaly detection result. However, current approaches have faced a number of challenges. First, many deep-learning-based approaches need to train models using large-scale datasets [9] and require large model complexity [10] . For example, STAN has 17 convolutional layers [5] . However, they pay little attention to model complexity reduction. Hence, these methods may lead to high overhead, long training time, and therefore impede the development and slow the deployment. The high model complexity also requires careful parameter tuning [11, 12] . Second, many real-world video anomaly detection tasks are still suffering from insufficient training data (i.e., anomaly detection requires enough training data to represent regular patterns). As a result, it is hard to reach the claimed performance on the benchmark dataset for the complex models when applying them in many real tasks. Third, the deep learning model is mostly treated as a ''black-box'' whose decision-making process is hard to interpret. In video anomaly detection, this problem is reflected in the insufficient explanation for the anomaly detection results. The pixel-level anomaly detection can be used to interpret the anomalies. Yet, the detection performance is much lower when using them to explain the anomaly detection result than the frame-level results. For example, AMDN [3] achieves a framelevel AUC (area under the receiver operating characteristic curve) 92.1% on the Ped 1 dataset, but it only has pixel-level AUC 67.2%; Lu et al. [8] achieve frame-level AUC 91.8% and pixel-level AUC 63.8% on Ped1 dataset. Also, pixel-level anomalies usually propose higher computation workload requirements to the hardware because of the patch-based testing scheme. In Two-Stream VAE [4] , the running time of pixel-level anomaly detection is 50-100 times slower than the frame-level. Furthermore, the anomaly localization only reflects the abnormal spatial relationship in the same image. The localization is less explainable when presenting the temporal anomalies such as a sequence of unusual activities, the combination of objects, and the crowd activities, where the contextual features could be more explainable with the selfexplaining descriptive feature. Therefore, it is imperative to provide a deep learning framework with a lightweight model, workable with a small training dataset, and explainable for the anomaly detection results. Inspired by the recent studies showing that SHapely Additive exPlanations (SHAP) are capable of interpreting model prediction [13] , we propose a novel deep learning framework that uses high-level features from existing pretrained CNN models to train the anomaly detection model and combine SHAP and autoencoder to explain the anomaly alerts. This leads to a significant complexity reduction in our anomaly detection model without losing its model interpretability. Further, we integrate contextual features in our video analysis by exploring the inter-object relationship and further improving detection accuracy and performance. In video analysis, context is used to define the semantics (meaning) of the observed motion and interactions between humans and objects [14] . Hence, we combine the features derived from pre-trained Convolutional Neural Networks (CNNs) (such as object position category in background segmentation, multi-object tracking, and object classification) to obtain the context information. To our knowledge, this is the first work using SHAP of autoencoders to explain video anomaly alerts. Our contextual mining provides highlevel features as the autoencoder input for SHAP interpretation. The integration of SHAP to video anomaly detection provides a more transparent and interpretable decisionmaking process for video anomaly detection. The remainder of this paper is organized as follows. In Sect. 2, we discuss the related research on exploring contextual information in video anomaly detection and interpreting video anomaly detection results. In Sect. 3, we describe our anomaly surveillance system architecture, the anomaly detection model, as well as the video anomaly explanation approach. In Sect. 4, we present the experimental results of our video anomaly analysis. Finally, we conclude this paper in Sect. 5. In this section, we first introduce the current research and application of video anomaly detection. Then, we discuss the interpretability of video anomaly detection and the approach of deep learning model interpretation. Finally, we show the related fundamental work to generate meaningful contextual features. Traditional video anomaly detection methods proposed non-deep learning models using low-level features, such as probability model with dynamic textures [6] or optical flow [15] , Social Force model with grid particle on image [16] , and Gaussian Mixture model with compact feature set [17] . Here optical flow is the motion of objects between consecutive frames and the grid particle is the anchor point for tracking the motion. These features are hard to explain since they do not contain the descriptive information of anomalous events. In recent years, deep learning-based approaches have gained popularity due to their excellent performance on model accuracy. The deep learning methods introduce CNNs for feature extraction and autoencoder for anomaly detection [3, 4, 18] . Based on the CNN and autoencoder, applying Generative Adversary nets (GAN) achieves state-of-art performance with 97.4% AUC for the UCSD Ped1 dataset [5, 19] , while GAN is notoriously computationally intensive. These deep learning methods focus more on detection accuracy but suffer from the insufficient explanation of the model decision due to the ''black-box'' nature of the deep learning network. They only provide the suspicious region of the anomaly but missing further explanation. In real-world tasks, such as city surveillance, companies like Hikvision embedded anomaly detection capabilities in their video surveillance products, providing capabilities to detect abnormal behavior like sudden running or wandering [20] . Their solution also includes face recognition for blacklist alarms (e.g., trigger alert when detecting a fugitive face). However, they use simple anomaly detection logic and cannot handle complex scenarios such that if an event is abnormal but has not been listed on the blacklist, then this event will never be alerted. The interpretation of the detected event is only decided by the user-defined blacklist [20] . In video anomaly explanation, most deep-learning methods explain anomalies by displaying error maps (i.e., the distance map between reconstrued input and original input). Zhao et al. [21] display the reconstruction error map while highlighting the anomaly regions with rectangles. Nguyen et al. [22] show the error map of optical flow and the anomalous regions has deeper color. Xu et al. [3] use both image and optical flow maps as the input, then compute the error map by pixel-level fusing. Instead of using the error map. Some other methods use explainable features to explain video anomalies. Mahadevan et al. [6] locate the anomaly regions with the discriminant saliency criteria [23] and provides the spatial abnormality map by computing the saliency at each location. Zhu et al. [14] propose a structural model to learn the patterns of the interrelationship between activity classes. Scene graph consists of object nodes and the relationship between nodes and has better interpretability and reasoning capability. Chen et al. [24] propose an interpretable video anomaly detection approach by using scene graphs as input. The anomaly detection approach in [24] is also more transparent than the deep learning model since it consists of multiclass SVM and multinominal Naïve Bayes. Those methods provide interpretability but performance is relatively low since there is not a methodology to adapt the deep learning method while maintaining interpretability. Our approach achieves a comparable performance of the state-of-the-art method while keeping the interpretability by integrating SHAP. SHAP is a unified explanation method to interpret model prediction and it has been widely used on model-agnostic prediction interpretation, especially on the deep learning model interpretation. Bulathwela et al. use SHAP to explain the model prediction of video lecture engagement [25] . Zhou et al. explain the model prediction of factors affecting injury severity by using SHAP [26] . Kristjanpoller et al. interpret the model prediction of evaluating the quarantine policy for COVID-19 by using SHAP plots [27] . The method proposed by Antwarg et al. [28] is the first work to use SHAP to explain autoencoder for anomaly detection. The method attributes the anomaly detection to the input feature SHAP value. It verifies the effectiveness of the method with four datasets and expert evaluation. SHAP for autoencoder proposed by Antwarg et al. [28] provides us a perspective to convert existed deep video anomaly detection algorithm for improving the interpretability. Since SHAP for autoencoder explains the model output by spotlighting the important input features, the understandable features should be considered first. The input features can be classified as low-level features and high-level features [14] by the content of semantic information. For example, the RGB value and Optical flow are low-level features since the user cannot get meaningful information from those values; as a comparison, the object label and annotation are the high-level features. The highlevel features provide semantically meaningful activities, though they could have a higher error rate in classification tasks. With the development of the convolutional neural net (CNN)-based computer vision applications, the accuracy of image classification, object detection, and image tracking has achieved better performance compared to the traditional methods like the post-processing method proposed by Gao et al. [29, 30] . This fact inspires many researchers to use CNNs to extract features [22, 31] . Using high-level features for anomaly detection can reduce model complexity and improve anomaly alert interpretability [24] . Contextual features are semantically meaningful features that can be mined from other high-level features. It captures relationships among the basic event such as the semantic relationships between action, activities, human pose, social role, etc. Wang and Ji propose event recognition methods by contextual features [32] . Zhang et al. use the semantic context information, such as motion pattern and path, to improve abnormal event detection in traffic scenes where an abnormal event is defined as vehicles breaking the traffic rules by considering the trajectories [33] . Pasini et al. present a semantic anomaly detection method to detect anomalies and provides an interpretable explanation [34] . They construct the semantic vector from the textual labels obtained from the pre-trained image labeling software. To reduce the training workload and improve the model performance, many deep learning approaches integrate the pre-trained models. Computer vision tasks with meaningful output (object detection, object tracking, background segmentation, etc.) are wildly using pre-trained models. For object detection, He et al. introduced ResNet [35] in 2015 and the model was extremely successful by winning the first price of several object detection tasks including ILSCRC 2015 (with 3.57% top-5 error rate) and COCO 2015 (with 48.4% mean average precision). It has been wildly used as the pre-trained model and can be found in the machine learning platform TensorFlow and Pytorch. For background segmentation, Kirilov et al. propose Panoptic Feature Pyramid Networks (PFPN) [36] to solve the panoptic segmentation task (unifying instance segmentation and semantic segmentation). This model and its variant show great segmentation performance and have been used as the pre-trained models in tomography diagnose [37] , real-time object detection [38] , person detection [39] , etc. Although ResNet and PFPN are popular when used as the pre-trained models, to our best knowledge, our work is the first video anomaly detection approach that uses them directly as pre-trained models without further fine-tune training process. Most of the existing deep learning studies for video anomaly detection require a large volume of normal video stream training data, resulting in high model complexity and a long training time [11] . Also, the explanation process for detection is difficult since there are no semantic features that can be easily interpreted. Most of them only provide the abnormal event localization, which cannot reflect the temporal causal or the unusual human-object relationship [22] . To address these limitations, we propose an explainable and efficient deep learning framework for video anomaly detection. This framework uses pre-trained models with meaningful outputs for visualization and interpretability and captures the required features related to abnormal events. Our proposed architecture is shown in Fig. 1 , where our system is divided into three layers: Hardware, processing, and application layers. We consider the hardware layer as a set of distributed cameras and related drivers, which will transfer raw video streams into the system. The camera selection, position, and orientation decide the overall monitoring area and provide the associated coordination of the region of interest. For example, if the user needs to monitor the car plates, then the highresolution camera will be selected. However, these problems of camera orientation and focus areas are not the main focus of this work. The camera operation-related tasks like camera hand-off and data fusion are handled in the processing layer. In the processing layer, the raw video data are preprocessed and made appropriate representation based on the deep learning model selection, further required by the surveillance task. The surveillance task in the processing layer may vary depending on the application layer's explanation requirement and the task definition. For example, the main focus in a supermarket is preventing shoplifting, whereas, in a train station, we may use multi-object tracking to provide crowd statistics. Then, the outputs of the selected surveillance tasks are combined and sent to the application layer for anomaly detection and explanation. The application layer provides the user interface of the surveillance system where it includes the functionalities like video visualization and camera control. The user can provide more information for the anomaly behavior based on the defined rules. For instance, in the traffic system, there must be rules that govern the movement of vehicles; for example, when the traffic light color is red, the car should stop. The rules can be implemented as relationships between traffic light colors and vehicles in the object classification task [40] . The high-level features will be fed into the anomaly detection module to generate alerts to the user whenever an abnormal event is detected. Furthermore, the sensor tasking module in the application layer receives commands from the user to control the behavior of the cameras, like turning and zooming in/out to receive more detailed information on the region of interest. To illustrate the deep learning framework for video anomaly detection shown in Fig. 1 , we select the outdoor surveillance task. Compared to the indoor surveillance task [41] , the outdoor surveillance task is more complex, with more objects to be analyzed, a larger region to monitor, and more variations in the background (parking lot, avenue, playground, etc.). To address the outdoor surveillance task with the proposed deep learning framework, we present a novel explainable video anomaly detection method (summarized in Fig. 2 ). This method processes the contextual features (such as human location and background categories relationship) directly from the pretrained model outputs. For our case of crowd surveillance, we choose pre-trained models for background segmentation, object tracking, and object classification. By learning features from pre-trained model output, we focus our research effort on only developing the anomaly detection method to study the individual frames, reducing the complexity of the anomaly detection model. We also propose an algorithm to explain the video anomaly detection results of abnormal behaviors and events by integrating SHAP for autoencoder [28] : (1) The proper representation of anomalies in videos. For example, a keyframe containing the most anomalous activity information can be selected as the anomaly representation. (2) Explanation of the decision-making of the anomaly detection. To address these problems, our anomaly explanation utilizes video summarization and generates interpretations of the abnormal keyframes with SHAP for autoencoder. We denote a video as H ¼ fh 1 ; h 2 ; . . .; h n g, where h i represents the i th video frame and i ¼ 1; 2; . . .; n. The problem Fig. 2 The architecture of the proposed explainable anomaly detection framework. In the upper part, the CNN pretrained models generate the high-level features from the video stream and feed it into the denoising autoencoder with anomaly temporal denoising. In the lower part, the video anomaly explanation consists of keyframe selection, SHAP for auto-encoder, and the sorted explainable features Cluster Computing of frame-level video anomaly detection can be defined as follows. where F denotes the prediction function and Score i represents the prediction score of the video frame. Conventional deep learning-based video anomaly detection methods get F through learning end-to-end deep model [11] . More specifically, the models use the input video frame to directly get Score i . However, deep model training in an end-to-end fashion lacks interpretability and needs a long training time. To attain accurate, explainable, and efficient results, we extract high-level features from pretrained models when designing F . Generally, video analysis tasks need to perform image segment, object identification, and tracking. Besides, context mining is often used for video analysis. With this kind of data, we can predict and interpret a video frame comprehensively. Hence, we design a function F that firstly uses pre-trained CNN models to obtain the high-level concept and context features based on background segmentation, object classification, multi-object tracking, and semantic context information, and then, uses DAE with temporal denoising process to predict the video frame base on these features. The features are also used by DAE to explain anomalies through SHAP and video summary. According to the above consideration, an explainable and efficient deep learning framework is proposed. The architecture of the proposed framework for video anomaly detection is shown in Fig. 2 . There exist many possible causes of abnormal events, such as abnormal object appearance, abnormal motion, and abnormal object location. We use pre-trained models such as background segmentation, object classification, and multi-object tracking to extract the anomalies in a video. To build the background segmentation feature, we consider the Panoptic Feature Pyramid Network (PFPN) [36] . As discussed in Sect. 2, PFPN provides instance segmentation and background segmentation and has proved to be a stable solution by being wildly used as a pre-trained model on many other fields [37] [38] [39] . We run this CNN-based model on the Detectron2 platform (The Facebook AI Research software system) [42] . It provides state-of-art detection and segmentation algorithms and a large set of baseline results and pre-trained models. PFPN solves the unified task of instance segmentation and semantic segmentation (for stuff classes: amorphous background regions, e.g., rivers, wall). The model is pre-trained on the COCO train2017 dataset and validated on COCO val2017 [43] . COCO dataset is a large-scale object detection dataset and proving over 330,000 images and 1.5 million object instances. The large volume of training data provides better accuracy and generalization for the pre-trained model. This model has an inference speed of 0.067 s per image and masks average precision (AP) of 38.5 on COCO val2017 with GPU V100. The speed allows us to have near-realtime (up to 15 FPS) visualization of the background segmentation results. We only select semantic segmentation for background segmentation. The output can be written as where for the input image at time T, the PFPN model F bg outputs a matrix with C b classified background labels as well as height M b and width N b information. Here we note that this model can be trained on different datasets to improve the segmentation result. For the video anomaly detection task, the background segmentation will only update their results when the vision content changes (e.g., the changing of the ambient light, turning the camera direction, switching the camera). We did not directly utilize the matrix output of background segmentation into the anomaly detection model. Instead, we perform a contextual feature extraction method to process the output and then convert it to a scalar output. Considering most outdoor activities involving pedestrian movement, we use the Joint Detection and Embedding (JDE) model [44] to get the pedestrian detection and tracking feature. JDE is a variant of real-time object detection YOLOv3 (you only look once, version 3) [45] for real-time multi-object tracking. The JDE model is pretrained on the MOT-16 training set. The model inference speed is around 38 FPS with the input frame size 576 Â 320 pixels on an Nvidia Titan Xp GPU. The output is person tracking results, which can be written as: where p b c ; s b c ; v b c represent the box coordinates, size (width and height), and the velocity of the person with ID b c. Given an image at time T as the input of the multiple objects tracking model F ot , we will obtain the above outputs for each person. The tracking feature could provide statistical information for each person (trajectories and average speed). We will use these features as the crowd activity analysis in the context mining module. For the appearance feature, we consider the model ResNet-101 (R101) [35] implemented on the Detectron2 platform. As we mentioned in Sect. 2, ResNet is one of the most successful object detection architectures and has been integrated into many official machine learning platforms, such as TensorFlow and Pytorch. We choose ResNet as the object detection backbone of our deep learning framework by considering its extensive usage and outstanding performance. It has been pre-trained on the COCO train2017 dataset. The output includes 80 object categories. The R101 model is a CNN-based model that is 101 layers deep. The pre-trained model has an inference speed of 0.051 s per image and the box AP of 42.0 on COCO val2017 with GPU V100. The output of the R101 model is written as where the output is a vector with a length equal to the output categories number K od . When given the frame input at time T, the R101 model F od ðTÞ will produce category outputs as a vector. We directly use this vector as an input for the anomaly detection model. We note here the object classification model is crucial to the performance of the video anomaly detection since many abnormal frames are followed by the appearance of the unseen object. We choose the COCO dataset to make it the baseline for the context mining comparison. Even though pre-trained models provide useful features, we still need inter-relationship information between objects. Hence, we process contextual features to improve anomaly detection performance. For that, we classified the extracted contexts as spatial context, temporal context, and group context. The contextual features can reflect prior knowledge from the user who evaluates the pre-trained models' visualization results. If the visualization shows the pre-trained model result is wrong, then the related erroneous context should be adjusted or removed. For example, the user can add a weapon appearance into a blacklist to trigger an alert when a weapon shows up in the video frame. By allowing users to add self-defined contextual features, the searching space for anomaly events can be significantly reduced. Features that capture the relative spatial relationships among persons or objects of interest are defined as spatial context. We denote the mining spatial relationship process between different pre-trained models result as R spatial . The spatial relationship including the intra-spatial relationship and the inter-spatial relationship. The intra-spatial relationship represents the inclusion result S 1 ; S 2 ; . . .; S C of regional classifications L with height M b and width N b and the n ot object detection/tracking results with coordinates p i ; i ¼ 1; . . .; n ot . The inter-spatial relationship consists of the adjacent object combinations. One type of spatial anomaly is a certain type of object that is not allowed to appear in a certain type of region. For instance, ''trucks are not allowed to drive on the sidewalk''. In our case, we use the following formula to represent the spatial relationship between object tracking and background segmentation: where the output represents the regional relationship between C b types region and C t types of tracking objects. The R spatial F bg ; F ot À Á denotes considering the intra-spatial relationship R spatial between models F bg and F ot . Some approaches learned trajectories in training data to determine feasible areas, which means that regions without moving objects will be treated as prohibited regions. For example, Zhao et al. predict car trajectory and label the moving car on the Traffic dataset [21] . This kind of mapping has two major shortages. Firstly, it needs to collect enough trajectories in training data to cover the feasible region, which is hard, especially when the monitoring area is large. Secondly, the location will degenerate when the camera position or orientation is adjusted. By using the spatial relationship between tracking objects and the background type, the above shortages will be overcome since we do not consider the absolute coordinates but the categorized relationship. Features that capture the relative temporal relationships among the temporal attribute of persons or objects of interest are defined as temporal context. We denote the mining temporal relationship process among the pre-trained models result with timestamps as R temporal . The temporal context is widely used in the activity recognition task since the current action could imply the next action. For example, ''get off the car'' is likely to have ''closed-door'' behavior followed. In our case, we could consider the speed history of each person then update the Overspeed sign: where S Temp is the frame-level Overspeed sign in the time range T. R temporal F ot ð Þ denotes the relative relationship R temporal among the results of object tracking output F ot . This feature smooths the speed measurement of the object tracking output. In frame-level anomaly detection, the object speed in each frame is not a reliable feature since many movement speeds are periodic (walking, running, riding a bicycle with changing direction, etc.). In this case, we consider the maximal average speed for each person and find the corresponding appearance in each frame. Finally, we consider mining the group context R group ðF ot Þ (frame-level crowd activity statistic) from object tracking features. It includes the min, max, and median value of the coordinates, and speed. We also use the sum of residuals in the least-squares solution of coordinates and speeds to measure the crowd sparsity. When all persons move in the same direction, then the sum of residuals will equal zero since the moving direction falls into line (each residual is zero). For anomaly detection, we are mainly focusing on the behavior analysis of pedestrians by applying denoising autoencoder (DAE), which is a variant of the basic autoencoder (AE) [46] . DAE is trained through reconstructing a clean input x by a corrupted input b x, whereb x ¼ x þ s Á t, s is the noise factor, and t is the noise data distribution. In a basic one-layer DAE, the forward propagation for a basic AE with one hidden layer is: where h is the vector of the hidden layer unit activities, y is the reconstruction feature vector in the output layer, Q is an activation function, W 1 ð Þ is the weight matrix between the input layer and the hidden layer, W 2 ð Þ is the weight matrix between the hidden layer and output layer, and b 1 and b 2 are the offset vectors. A basic DAE is learned by minimizing the loss function L x; y ð Þ. Deep DAE can be achieved by using multiple hidden layers that can learn the complicated distribution by given samples due to its multiple feature representation spaces [47] . The backpropagation algorithm [48] is used to train DAE. Our DAE uses the sigmoid activation function for each hidden layer and identity function for the output layer. One important aspect of our version of DAE is that we use batch normalization (BN) that enables performance improvement and more stable training of DAE [49] . BN uses the mean and variance of batches of training data to perform batch normalization. As a single unit in DAE, its output is given by: where w 0 is the learned weight, b 0 is the learned bias, and x 0 is the input. After applying BN, its output is given by: where x 0 is a batch training data that can compute the mean l and the standard deviation r. In the test phase, the parameters c and b learned by the original model parameters are used to represent the ranges of inputs to g. Our DAE architecture is shown in Fig. 3 . The number of units in the input is determined by the input feature space. To reconstruct observations, the output layer also has the same number of nodes in the input layer. We add three fully connected hidden layers into DAE to form deep DAE. The layer nodes numbers are 50, 30, and 50, respectively (this configuration set provided the best results based on our experiments). The code layer (The middle layer with Fig. 3 The architecture of DAE for video anomaly detection Cluster Computing the 30 nodes) stores the compressed representation space for the input features. Gaussian distribution noise matrix is added into the input vector. Our version of DAE learns the parameters using Adam gradient-based optimization algorithm [50] with mini-batch training to minimize the mean squared error (MSE) used as the reconstruction error. After completing the training phase with denoising Gaussian noise, our DAE can detect the anomaly. An observation that belongs to normal or abnormal is determined by reconstruction error. During the test phase, an observation is normal if it has a low reconstruction error while it is abnormal if its reconstruction error is large. The frame-level output anomaly scores are determined by the reconstruction error between feature input and autoencoder feature output. In our pre-trained-CNN method, feature values are easily affected by the false alerts of the pre-trained model, such as the misclassification of the object detection results. The outlier value in the anomaly score curve is more likely introduced by the false alert of the pre-trained model input. To remedy this problem, we present the post-temporal denoising scheme after the autoencoder output based on the assumption that the consecutive frames have similar feature distribution. Compared with other smoothing functions (such as Triangular filtering and average smoothing), the Savitzky-Golay filter (S-G filter) could better preserve the area, position, and width of the peak. Hence, in our temporal-denoising process, the reconstruction errors of a series of frames e 1 ; e 2 ; . . .; e n is filtered by the Savitzky-Golay filter [51] : where N norm is the normalizing factor, $ a i is the convolution coefficient determined by the polynomial degree, w s is the window size. Note here the window size and the polynomial degree of the S-G filter are decided by the pretrained model accuracy. If the false-alert rate of the pretrained model is higher, the required smoothness decided by $ a i and w s can be increased accordingly. To increase the interpretability of our video anomaly detection method, we propose a video anomaly explanation method by using SHAP (see the lower part of Fig. 2) . We first introduce the background of our method, including SHAP, and using SHAP to explain the autoencoder. Then, we demonstrate how our method integrates SHAP and autoencoder to explain the anomaly detection results. With the rapid growth of deep learning research, the accuracy of the method has been significantly improved. However, there is also an urgent need for a more transparent model to explain the model decision-making. In some applications that emphasize interpretability, researchers prefer to use a simple model like the linear model to predict even its accuracy is lower than other complex models. For explaining the prediction results of blackbox models, many methods have been proposed to interpret the model output, such as DeepLIFT [52] and LIME [53] . To generalize those related methods, Lundberg and Lee propose SHAP (SHapely Additive exPlanations) as a unified approach to interpret model prediction [13] . An explanation model g ex can be expressed as: The weighting kernel p x s can be used to approximate the Shapely value [13] , which are given by: where z 0 is the length of non-zero elements. The loss function L for optimization is defined as: By minimizing the loss function L over the training dataset Z, the approximation of Shapely value can be calculated: where X is the penalty term of g ex complexity. Autoencoder has been widely used on anomaly detection tasks [54] . However, there is little research on explaining the results of the autoencoder. Based on the model-agnostic explanation method kernel SHAP, Antwarg et al. propose a method to explain the anomalies detected by autoencoder [28] . The main procedure of using SHAP to explain the autoencoder are summarized as follows: (1) Given the trained autoencoder and input instance, the features with top reconstruction errors are selected as the target output features. (2) For each selected high error feature, fit the SHAP explainer with training background set, then use SHAP explainer to attribute the input features for predicting these high error features. In this step, the target function for SHAP to approximate is the selected feature element in the autoencoder output. This method inspired us to design and implement an explainable autoencoder for video anomaly detection. We propose a novel method for explaining video anomaly detection. Our method first uses a video summary to find the representative frames in a video with anomalies, then explains the autoencoder output of the keyframes using SHAP. The output is the sorted features by the importance of contributing to the anomalies in the video, which explains the video anomaly detection decision-making process. The algorithm calculates the explainable video features using pre-trained models and trained denoising autoencoder to process the raw video frames (see Algorithm 1). Usually, consecutive frames are similar; thus, in practice, to reduce the complexity, it is widely accepted to remove the consecutive frames when they have the minimum amount of difference. Hence the first step of Algorithm 1 is to uniform sample the input video in a fixed interval to eliminate the redundant frames. In video summary, the algorithm still can summarize the major information of the video even the FPS is lowered to 5 [55] . Next, we get the feature set from the sampled frames using the pre-trained models. Then, the feature set is clustered by a K-means algorithm [56] . The nearest frame to the cluster center is the representative frame. Finally, the frames are filtered with the autoencoder anomaly scores since we focus on explaining the anomalies. We then explain the filtered output keyframes. Once we find the keyframes that we are interested in, we use Algorithm 2 to get the most important features for each frame contributing to the anomalies. Algorithm 2 requires the input frame features, trained autoencoder, and the background set. The background set is part of the samples that represent the dataset for training the SHAP model. In our case, it consisted of the samples of the high-level and contextual features. The reconstruction errors are first calculated by comparing the distance between autoencoder input and output. Then we get the top error features by sorting the reconstruction errors. The features with a high error are considered as the significant factor in deciding whether the instance is abnormal. For each high error feature, we treat the autoencoder as a multi-input-singleoutput function to use SHAP to explain the prediction concerning the certain input instance. Depending on the positiveness of the reconstruction error, the related features of the high error feature can be classified into contributing features and offsetting features. We sorted the feature by two different methods: the mean of SHAP value and the mean of absolute SHAP value. Intuitively, the sum of the SHAP value should indicate the importance of a certain feature contributing to the anomaly decision. However, many factors can affect the accuracy of the explainable model (such as the background set selection, the selection of the number of error features, etc.). Some major features have a small mean of SHAP values because the contributing value and offsetting value from different error features are neutralized. So we consider the feature importance from the sorted order of the mean of SHAP value and the mean of absolute SHAP value. Finally, the important features of each key abnormal frame are summarized as the explanation of the anomalies in a video. We present the workflow of video anomaly explanation in Fig. 4 . In shorts, our proposed method can be summarized as follows. First, select keyframes by uniform sampling the raw video and clustering the rest frames by high-level features. Then, select anomalous keyframe by the anomaly score and explain the keyframe with SHAP One of the most common outdoor activities is the movement of pedestrians. To evaluate our proposed method for outdoor activities surveillance, we show the anomaly detection result on the UCSD Ped1 and Ped 2 datasets [6] . The UCSD datasets provide video of people on pedestrian walkways at the University of California San Diego. As a popular video anomaly detection public dataset, it has been wildly used as the evaluation of video anomaly detection algorithm. The Ped1 dataset has 34 training videos and 36 testing videos. Each video consists of 200 frames with 238 Â 158 pixels at 30 FPS. The Ped2 dataset has 16 training videos and 12 testing videos. The video frame number of the Ped2 dataset ranges from 120 to 180 frames with 360 Â 240 pixels. The training video only includes pedestrians. Both Ped1 and Ped2 provide completed framelevel abnormal labels and partial pixel-level abnormal labels. In this experiment, we only consider the frame-level samples since our work mainly considering the contextual features. The abnormal event includes unexpected entities (bicycle, skateboard, motorcycle, etc.), irregular trajectory (deviate from the major moving direction), and entering the prohibitive region (walking on the grass). We get high-level features from the pre-trained models. The details are demonstrated in Sect. 3. The inference of the pre-trained model is running on the Google Colaboratory [57] server. The DAE is implemented on Tensorflow and Keras. We use the Adam optimizer and the MSE loss function to optimize the model. The epoch of training for Ped1 and Ped2 was set to 25 and 28, respectively. The batch size was set to 120. In the experiments, we set the noise factor as To understand the outputs of context mining, we visualize the results of the embedded computer vision task on both datasets. Figures 5 and 6 present examples of the visualization results on both training datasets. For each figure, the images in the first row show the background segmentation results. In the implementation, the user is supposed to select the frames with clear segmentations since their segmentation results are not affected by the ambient light. Only when the camera position is adjusted, the background segmentation should be updated. The images in the second row show the multiple object tracking results. The model assigns a unique ID to each pedestrian. By calculating the difference between the frames, we can get the movement of each person. In the images of the third row, we present the object classification result used as baseline features of our video anomaly detection model, and the accuracy determines the lower bound of our model performance since most anomaly event comes from the occurrence of abnormal objects. When the embedded pre-trained model results are visualized, the user can evaluate the quality of the Fig. 4 The workflow of video anomaly explanation Cluster Computing outputs and decide the principle of formulating contextual features. For instance, if the background segmentation results are unqualified (obvious boundary mismatch or misclassification in pre-trained model evaluation), we should not consider the relative position context as the anomaly detection feature. In our case, we keep all the pretrained model outputs to generate the contextual features on the Ped1 dataset and we discard the background segmentation results in the Ped2 dataset since the visualization shows that most of the background segmentation results are unsatisfactory. Since we removed the background segmentation, the relevant mined spatial contexts are also removed from the features. In Ped1, the dimension of input features is 100 while in Ped 2 it is 81 since we remove the unreliable features by checking the visualization results. We evaluate the performance of our video anomaly detection method by considering the effect of the contextual features and training data volume. Receiver Operative Characteristic curve (ROC curve), Area Under the ROC curve (AUC), and Equal Error Rate (EER) are the used metrics since they are widely used metrics for the UCSD Ped1 and UCSD Ped2 datasets [6, 58] . To study the effectiveness of our approach, we compare it with state-ofthe-art approaches. The ROC curve results are shown in Figs. 7 and 8 . The AUC and EER results are summarized in Table 1 . Here we use the bold text to highlight the stateof-the-art performance. For the method without contextual features, we only keep the appearance feature (for more information, refer to the approach in [31] ). The result shows that the contextual feature effectively integrates the information of movement Note: the first row is using the same image since the background segmentation should keep constant when the camera is fixed Cluster Computing and semantic result and improves the performance of the anomaly detection method. Without contextual feature, the AUC of our approach is 73.1% and 80.1% in the UCSD Ped1 dataset and the UCSD Ped2 dataset, respectively. Our approach with contextual feature has achieved the AUC of 85.9% and 92.4% in the UCSD Ped1 dataset and the UCSD Ped2 dataset. As shown in Table 1 , our model outperforms the approaches with low model complexity (MDT [6] , Adam [6] , Social force [16] , Compact feature set [17] , convex polytope ensemble [59] , and RBM [60] ) and several approaches with large model complexities by adding convolutional layers (ConvAE [11] , ConvLSTM-AE [12] , Two-Stream R-ConvVAE [4] ), and can achieve comparable performance compared to ST-AE [58] , and AMDN [3] . Our method achieves 92.4% AUC on the Ped2 data set and 85.9% AUC on the Ped1 dataset. Hence, the DAE with relatively low model complexity can achieve comparable results using the features derived from the pre-trained deep models. Our model without contextual features achieves 80.1% AUC in Ped2 while 73.1% on Ped1, which means an accurate pre-trained model will improve our final model performance. Most of the competing methods in this study trained the large model while we only consider using the high-level and contextual features derived from pre-trained models to reduce the model complexity for the anomaly detection model. For example, in the Ped1 dataset, the ConvAE model uses the fully convolutional autoencoder [11] . It has 6 convolutional layers and 4 pooling layers in the encoder and decoder. The input layer dimension is 238 Â 158 Â 10. The training process requires up to 16,000 epochs to converge. ConvLSTM-AE model adds 10 convolutional long short-term memory layers that are interconnected in addition to the convolutional layers [12] . The training process requires up to 60,000 epochs. In our case, we only use 3 fully connected layers with an input We also list the state-of-the-art approaches (STAN [5] , ST-CaAE [18] , and Optical flow-GAN [19] ). In addition to training CNN to learn the spatial features, STAN and Optical flow-GAN takes the Generative Adversarial Network architecture to improve the performance. However, it increases the model complexity. For example, STAN has 17 convolutional layers with kernel size between 5 Â 5 and 3 Â 3 where the number of layers has almost tripled compared to ConvAE. ST-CaAE consists of adversarial network ST-AAE and convolutional network ST-CAE. ST-AAE has four 3D convolutional layers and the corresponding four 3D deconvolutional layers, while ST-CAE has three 3D convolutional layers and three 3D deconvolutional layers. Each convolution layer uses kernels with the size 3 Â 3 Â 3, and the number of kernels is 16 in the input convolutional layer. The ST-CaAE also needs to be trained on appearance stream and motion stream, respectively, which further increases the model complexity. Compared to the above models, our approach extracts the complicated part into pre-trained models and only needs to train the decision model with the fully connected layers. Our model also shows the advantages of the interpretability of abnormal event decisions. The other models such as Two-Stream R-ConvVAE use the reconstruction error on each pixel to locate the anomaly region [4] . This method only reflects the spatial features of decision-making and cannot explain the temporal or group anomalies. Since our input features are high-level features and semantically meaningful features, we can directly show the reconstruction error vector to explain the decision-making process. Note that here we just use three pre-trained deep models to extract features, and we have shown in the experiments that they are already beneficial. It is expectable that more profit can be attained by using more pre-trained models that can be used to derive varied features. We leave the possibilities for future exploration. We compare our method running time with several algorithms, as shown in Table 2 . It reports the average running time of each frame during the test phase. Our method is significantly faster than MDT [6] , AMDN [3] , Xu et al.'s method without GPU [61] , and Hierarchical framework [62] . Our method is also faster than ST-CNN [63] , AED [64] , and ICN [65] . Compared to the state-of-the-art method like XU et al. method with GPU and Two-Stream R-ConvVAE [4] , our method achieves comparable run time speed. Note that the GPU is only used on the pretrained models for getting the high-level features in our approach. We calculate the pre-trained models running time by taking the maximum value of inference time per frame from PFPN, JDE, and R101 since these pre-trained models can run simultaneously by their corresponding GPU described in Sect. 3.2.1, which is 0:067s. The pretrained model inference speed can be improved by the advance of computer vision research and the improvement of GPU technologies. Our lightweight denoising autoencoder does not need GPU, and the average inference time for each frame is 2:18 Â 10 À5 s on the UCSD Ped1 dataset and 3:71 Â 10 À5 s on the UCSD Ped2 dataset. We also show the training time comparison among algorithms, as shown in Table 3 . Since our method relies on the pre-trained model features, a large amount of training time can be saved. We only need to focus on the training of the lightweight denoising autoencoder. We take the mean value of the 10 repetitive measurements and our method only requires 5s; 2:9 s; and 9:6 s for the Ped1, Ped2, and Avenue dataset (here Avenue dataset only be used for training time analysis, the epoch is set as 28), which is significantly faster than AMDN, ConvAE, TSC, and sRNNAE [66] . The experiment results of the video anomaly explanation method are discussed. We demonstrate our method by displaying three cases from the USCD Pedestrian dataset. We set the upper limit of the number of keyframes as 3 and the number of top important features as 6. It is worthwhile to mention that the number of keyframes is a hyperparameter, and the value varies by the setting of video, such as the video length, event number, etc. As described in Algorithm 2, we skip the keyframe without abnormal events since we are only interested in explaining anomaly. The video summary result of video Ped1_Test017 is shown in Fig. 9 . In this video, there is one anomaly that occurred. The video summary presents three keyframes: frames 11, 76, and 151, where frames 76 and frame 151 are discarded because of the lower anomaly score. The remaining keyframes 11 summarize the major activities on video Ped1_Test017. The ground truth of anomaly in this video consists of a person riding a bicycle on the pavement, as presented in Fig. 10 . Here Fig. 11 shows the explanation result of frame 11 for demonstration. We select top 3 most important features from Fig. 11a and b to explain the frame. Therefore, frame 11 can be explained by the important features, including ''tracked_occur'', ''bicycle'', ''on_tree'', ''speed_std'', ''x_max'', and ''person''. Obviously, ''bicycle'' matches the ground truth description. In addition, other output features also assist the anomalies decision-making. For example, ''tracked_occur'' means some people are moving much faster in a period and are tracked automatically; this feature is highly positively related to ''bicycle'' and ''skateboard''. The features of ''person'' and ''on_tree'' show how many people on this frame and how many people are walking near the tree. The feature, ''speed_std'' means the speed standard deviation of the moving objects in this frame. It implies this scenario has an abnormal event since the moving object (such as skateboarder, bicycle, and motorcycle) is faster than the walking person. Since we use pre-trained CNN model, we did not use GPU Fig. 9 The anomaly score of video Ped1_Test017, the unfilled start marker is the discarded frame. The read shadow interval represents the ground truth abnormal interval The video summary result of video Ped2_test004 yields three keyframes: frames 26, 46, and 96, as shown in Fig. 12 . Frame 26 is discarded since it represents the normal case. The abnormal keyframes are correctly fallen into the ground truth abnormal interval and can represent the major activities of video anomaly in video Ped2_test004. Here we discuss the explanation results of Frame 46. The Fig. 13 . shows the ground truth content of frame 46, which is the abnormal keyframe of video Ped2_Test004: A car occurs on the right side of the pavement. Both sorted graph results explain this frame, as shown Fig. 14 . In this case, the sorted mean absolute SHAP value features in Fig. 14a and the sorted mean SHAP value features in Fig. 14b have the same top 3 features, including ''car'', ''bicycle'', and ''peed_min''. The feature ''car'' correctly reflects the ground truth of the anomaly of frame 46. However, the feature ''bicycle'' is a false alert since frame 46 does not include a bicycle. The feature ''speed_min'' is the minimum speed of all objects in the frame. In this frame, ''car'' feature is rank 1st in both Fig. 14a and b, which increase the confidence of the decision to make the ''car'' features as the anomalies. Fig. 12 The anomaly score of video Ped2_Test004, the unfilled start marker is the discarded frame. The read shadow interval represents the ground truth abnormal interval The video summary result of video Ped2_Test005 consists of three keyframes: frame 21, frame 76, and frame 136, as shown in Fig. 15 . It covers the major normal and abnormal events in the video. Similarly, frame 136 will be skipped during the explanation process because of the low anomaly score. Here we present the anomaly explanation results of frame 76. The ground truth content consists of a person riding a bicycle from right to left, as presented in Fig. 16 . The pretrained object detection model successfully detected the bicycle. Figure 17a shows the ''bicycle'' is one of the most contributing features to the anomaly detection result of frame 76. Other features like ''speed_min'', ''speed_mean'', and ''speed_max'' also support the possible occurrence of the moving object from a contextual perspective. This work presents a novel design of an explainable and efficient video anomaly detection framework based on the high-level features from the pre-trained models and using a denoising autoencoder to detect anomalous video events Fig. 15 The anomaly score of video Ped2_Test005, the unfilled start marker is the discarded frame. The read shadow interval represents the ground truth abnormal interval Fig. 16 The ground truth of anomalies in video Ped2_Test005 frame 76 Cluster Computing and provide anomaly explanations. Our method selects three pre-trained models (background segmentation, object classification, and object tracking) to get the appearance feature and Spatio-temporal feature. The UCSD pedestrian datasets are used to evaluate our approach and to compare it with several state-of-the-art methods. Our experimental results show that contextual features improve model performance and interpretability. Moreover, our proposed model achieves comparable results and provides more accurate anomalies explanation with low model complexity, short training time, and low computational overhead. Our approach is not developed to replace state-of-the-art approaches; instead, it offers a better understanding of how pre-trained deep learning models can be used for video anomaly detection, especially when a large volume of training data is unavailable for complex models. Our method can also increase model interpretability, which is crucial to modern machine learning. In addition, the run time analysis shows our method is significantly efficient in the training process. He is also the co-founder of the IEEE/ACM International Conference on Cloud and Autonomic Computing. He has coauthored three books on autonomic computing, parallel, and distributed computing, and edited Active Middle-ware Services, a collection of articles from the second annual AMS Workshop published by Kluwer, in 2000. He serves as the Editor-in-Chief of the scienti_c journal Cluster Computing, which presents research and applications in parallel processing, distributed computing systems, and computer networks. Cluster Computing Evaluating the use of public surveillance cameras for crime control and prevention. US Department of Justice, Office of Community Oriented Policing Services A world with a billion cameras watching you is just around the corner Learning deep representations of appearance and motion for anomalous event detection Abnormal event detection from videos using a two-stream recurrent variational autoencoder STAN: spatio-temporal adversarial networks for abnormal event detection Anomaly detection in crowded scenes Observe locally, infer globally: a spacetime MRF for detecting abnormal activities with incremental updates Abnormal event detection at 150 FPS in MATLAB End-to-end learning of driving models from large-scale video datasets An ensemble of ensembles approach to author attribution for internet relay chat forensics Learning temporal regularity in video sequences Remembering history with convolutional LSTM for anomaly detection A unified approach to interpreting model predictions Context-aware activity recognition and anomaly detection in video Robust real-time unusual event detection using multiple fixed-location monitors Abnormal crowd behavior detection using social force model Video anomaly detection with compact feature sets for online performance Spatial-temporal cascade autoencoder for video anomaly detection in crowded scenes Abnormal event detection in videos using generative adversarial nets Spatiotemporal autoencoder for video anomaly detection Anomaly detection in video sequence with appearance-motion correspondence Decision-theoretic saliency: computational principles, biological plausibility, and implications for neurophysiology and psychophysics Scene Graphs for Interpretable Video Anomaly Classification. Conference on Neural Information Processing Systems Workshop on Visually Grounded Interaction and Language Predicting Engagement in Video Lectures Comparing factors affecting injury severity of passenger car and truck drivers A causal framework to determine the effectiveness of dynamic quarantine policy to mitigate COVID-19 Explaining anomalies detected by autoencoders using SHAP A post-processing scheme for the performance improvement of vehicle detection in wide-area aerial imagery Performance evaluation of automatic object detection with post-processing schemes under enhanced measures in widearea aerial imagery Deep appearance features for abnormal behavior detection in video Hierarchical context modeling for video event recognition Mining semantic context information for intelligent video surveillance of traffic scenes Detecting anomalies in image classification by means of semantic relationships Deep residual learning for image recognition Panoptic feature pyramid networks Segmentation of COVID-19 pneumonia lesions: a deep learning approach Recursive hybrid fusion pyramid network for real-time small object detection on embedded devices Scale match for tiny person detection Spatio-temporal processing for automatic vehicle detection in wide-area aerial video SeVA: an AI solution for age friendly care of hospitalized older adults Detec-tron2 Microsoft COCO: Common Objects in Context. Computer Vision-ECCV Towards real-time multiobject tracking Yolov3: an incremental improvement A connection between score matching and denoising autoencoders Speech enhancement based on deep denoising autoencoder Deep learning Understanding batch normalization Adam: A method for stochastic optimization Smoothing and differentiation of data by simplified least squares procedures Not just a black box: Learning important features through propagating activation differences Why should i trust you? Explaining the predictions of any classifier One-class classification with deep autoencoder neural networks for author verification in internet relay chat Unsupervised video summarization framework using keyframe extraction and video skimming Automated Twitter author clustering with unsupervised learning for social media forensics Google Colaboratory. Building Machine Learning and Deep Learning Models on Google Cloud Platform Abnormal Event Detection In Videos Using Spatiotemporal Autoencoder Convex polytope ensembles for spatio-temporal anomaly detection Energy-based Models for Video Anomaly Detection An efficient anomaly detection system for crowded scenes using variational autoencoders Video anomaly detection based on a hierarchical activity discovery within spatio-temporal contexts Spatial-temporal convolutional neural networks for anomaly detection and localization in crowded scenes. Signal Process Abnormal event detection and localization via adversarial event prediction An intra-frame classification network for video anomaly detection and localization Video anomaly detection with sparse coding inspired deep neural networks Video anomaly detection using pre-trained deep convolutional neural nets and context mining Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations