key: cord-0032774-jxgmepdt
authors: Ingle, Palash Yuvraj; Kim, Young-Gab
title: Real-Time Abnormal Object Detection for Video Surveillance in Smart Cities
date: 2022-05-19
journal: Sensors (Basel)
DOI: 10.3390/s22103862
sha: 6c493cb44594a08d90ec4b456bcbe96749a5b3e7
doc_id: 32774
cord_uid: jxgmepdt

With the adaptation of video surveillance in many areas for object detection, monitoring abnormal behavior in several cameras requires constant human tracking for a single camera operative, which is a tedious task. In multiview cameras, accurately detecting different types of guns and knives and classifying them from other video surveillance objects in real-time scenarios is difficult. Most detecting cameras are resource-constrained devices with limited computational capacities. To mitigate this problem, we proposed a resource-constrained lightweight subclass detection method based on a convolutional neural network to classify, locate, and detect different types of guns and knives effectively and efficiently in a real-time environment. In this paper, the detection classifier is a multiclass subclass detection convolutional neural network used to classify object frames into different sub-classes such as abnormal and normal. The achieved mean average precision by the best state-of-the-art framework to detect either a handgun or a knife is 84.21% or 90.20% on a single camera view. After extensive experiments, the best precision obtained by the proposed method for detecting different types of guns and knives was 97.50% on the ImageNet dataset and IMFDB, 90.50% on the open-image dataset, 93% on the Olmos dataset, and 90.7% precision on the multiview cameras. This resource-constrained device has shown a satisfactory result, with a precision score of 85.5% for detection in a multiview camera.

Recently, advancements in video surveillance have driven many new challenges, such as storage, security, and content extraction from videos. Monitoring the videos requires manual human-powered resources, which are error-prone. Many criminal activities, such as robberies, terrorist activities, bomb-blasts, hijackings, and crowd brawl fights, could have been prevented by predicting the threats in advance using video surveillance in real time. Smart city infrastructure is highly dependent on the surveillance system for smooth traffic management and public space monitoring, making the pathways safer and more efficient for every user. Video surveillance, coupled with an object detection mechanism, is a vital tool for analyzing the road networks, intersections, and how people move in the city. In addition, abnormal object detection enables monitoring and tracking of the object of interest more efficiently, so real-time decisions can be made to safeguard or alter the occurring event cautiously. Frequently used objects to commit crimes are guns or knives; globally, there are significant concerns due to increased gun usage for criminal activities, which is validated by statistical reports issued by the united nations office on drugs and crime (UNODC) [1] .

One way to prevent such incidents is the early detection of guns and knives at a potential crime scene. A traditional method based on machine learning to detect a gun is using X-rays or scanners (i.e., low-power millimeter-wave radar) [2, 3] . Recently, advancements One way to prevent such incidents is the early detection of guns and knives at tential crime scene. A traditional method based on machine learning to detect a g using X-rays or scanners (i.e., low-power millimeter-wave radar) [2, 3] . Recently, adv ments in deep learning, specifically in convolutional neural networks (CNNs), achieved significant results compared with the traditional machine learning algor such as corner detection and color segmentation techniques employed for object cla cation, localization, and detection [4] [5] [6] . Unlike manually selecting features, CNNs on extracting rich features automatically [7, 8] . When video/image detecting different ses of guns and knives in a real-time situation, various issues can arise, such as more trast in viewpoints, posture estimation, occlusion, and lighting conditions [9] . These i lead to difficulties in adequately accomplishing object detection. A large dataset quired to make CNNs more robust for object detection.

Existing studies dealing with these problems are based on different types of learning and non-deep learning/classical algorithms for detecting handguns or k [10] [11] [12] . Although these studies have produced convincing results for simple cases, ability to meet compelling circumstances with tricky situations is somewhat limited. of the existing studies performed training with a small dataset with fixed constrain detecting particular objects (e.g., handguns or knives), and most of the techniques applied on a single camera view. Using an existing algorithm on resource-constr devices increases the computational cost, leading to poor performance during obje tection. Most video surveillance panels comprise multiple cameras; thus, a simultan object detection mechanism is essential.

In this study, to deal with the problems mentioned above, extensive sequences experiment were carried out to propose a customized lightweight CNN archite called multiclass subclass detection CNN (MSD-CNN). We specifically designed M CNN architecture to detect an abnormal frame in multicamera view using a graphic cessing unit (GPU) and dynamic programming for efficiency. The MSD-CNN detec normal and normal frames in real time, as shown in Figure 1 , and its architecture picted in Section 2. To make the MSD-CNN model robust, we significantly increase training dataset by performing data augmentation [13] on an ImageNet dataset, a scribed in Section 4. The proposed model can detect and classify the abnormal and the normal class. T are two prominent (i.e., normal and abnormal) classes for classification, which is ref to as multiclass. The subclass states the significant class corresponding to an approp subclass activity, whereas the abnormal class is a frame in which abnormal object acti are present, based on the characteristics of these activities; it gets categorized into The proposed model can detect and classify the abnormal and the normal class. There are two prominent (i.e., normal and abnormal) classes for classification, which is referred to as multiclass. The subclass states the significant class corresponding to an appropriate subclass activity, whereas the abnormal class is a frame in which abnormal object activities are present, based on the characteristics of these activities; it gets categorized into two different abnormal subclasses. Similarly, the normal class is a frame where normal object activities are present, further segregated into two normal subclasses. The primary purpose of our model is to detect the different types of subclasses of guns and knives, such as handguns, automatic/semi-automatic rifles, kitchen knives, and army knives, considered

Our study introduces a new lightweight multiclass-subclass detection CNN model to effectively and efficiently extract and detect abnormal features in a real-time scenario from both multiview and single view cameras; •

To facilitate the learning of the model for real-time detection, we constructed a custom dataset for training; •

We have summarized the insights of different algorithms used to detect handguns and knives and construct a taxonomy; •

We introduce a new evaluation method, Detection Time per Interval (DTpI), to evaluate an object's emplacement concerning the multiclass evaluation real-time score in a multiview camera; •

The proposed model has achieved a better result than the state-of-the-art framework for detecting abnormal frames in real time.

To the best of our knowledge, this is the first study to consider a multiclass-subclass detection classifier for detecting different types of guns and knives in real time. In Section 2, we discuss the existing detection algorithms and separate them according to their respective characteristics. A detailed description of the proposed architecture of MSD-CNN and the learning algorithm is explained in Section 3. In Section 4, we evaluate the model with existing standard datasets and compare the result with the state-of-the-art algorithms, whilst future enhancement and weaknesses are mentioned in Section 5. Section 6 concludes the study.

Object detection [14] in videos or images can be classified into two broad research areas. The first addresses gun and knife detection using classical/non-deep learning algorithms, whilst the second focuses on improving the object detection accuracy using deep learning algorithms. Basically, non-deep learning/classical algorithms are based on color-based segmentation, coroner detector, and appearance. A disadvantage of using the existing classical algorithm is that it is highly dependent on the quality of frames/images. Frames with occlusion and noise are difficult to interpret; in addition, when foreground and background color segments are matching, interpretation is difficult when using color-based segmentation [15] . Neural networks are basically used in deep learning algorithms. An advantage of using a neural network model is it learns feature extraction automatically through training. A model trained on larger data can detect occluded frames. In a model such as CNN or R-CNN, before training, the data must be labeled, referred to as supervised learning. The algorithms discussed in the following subsections elaborate on the methods and mechanisms used for detection.

The classical algorithms used for detecting guns and knives are based on various frame segmentation methodologies for detecting and extracting key features from images. In this section, we describe them in detail.

The AAM [16] uses a statistical model to match features. Basically, it is used in facial detection. AAM annotates the image and then represents the image in vector form [17] . It uses principle component analysis to normalize the images. T. Rohit et al. [15] used the AAM and trained it on a customized image dataset for detecting knives, thus leading to maximum false positives when detecting. Specifically, the AAM could detect knives with a sharp edge in an image; it requires clear visibility of objects in the image. A disadvantage of AAM is that it poorly detects objects in noisy images. HCD extracts features from the corners of images. Steps involved in the Harris detection are as follows. First, the image is converted into grayscale. Second, using spatial derivatives, the corners of the image are identified. The detected object's tensor structure is generated using Harris calculation, and finally, using non-max suppression, the object is determined. A. Glowacz et al. [18] synergistically used the AAM and HCD for detecting guns and knives; for training, they used a customized image dataset and achieved better results than T. Rohit et al. [15] . However, it is time-intensive for processing and thus slow when performing real-time detection.

CBS uses k-means [19] to find the cluster of the subset. This helps to remove the unwanted colors from the images, and then the HCD algorithm is used to find the object in the image. In their respective studies, T. Rohit et al. [15] and P. Pratihar et al. [16] used both CBS and HCD to detect knives and guns. The model was trained on a customized image dataset. After excluding the unwanted color, HCD was used to detect the appropriate object. The model was only used to detect X-ray images with a maximum number of false positive as the model was trained on a less significant dataset.

DNN learning algorithms are built on top of neurons [5] . A DNN comprises several layers, and each of these layers contains neurons; each neuron is defined by input points, hidden points, and output points. These layers are interconnected with each other based on the weights of neurons. The previous neuron output is the input of the next neuron in the layer, which is multiplied by the corresponding weight. All values are summed and added to the defined bias value. The obtained sum becomes an input to the next neuron. The resultant value is passed to an activation function that transforms the parameters and is passed to the next neuron. Likewise, all input values are propagated through the entire neural network. Consequently, the neural network is used to predict the result. The predicted result and the actual result difference is called an error, which is calculated by an error function based on the error value that is generated when updating the weights; this process is repeated until the obtained error does not minimize. A deep learning algorithm is used to detect an object based on the features on which it is trained. The neural network architectures for detecting handguns and knives are CNN, Overfeat, Region-based CNN (R-CNN), Fast R-CNN, and Faster R-CNN. The taxonomy of this algorithm is shown in Figure 2 . 

Overfeat is based on the sliding window approach of CNN. The sliding window classifier is trained to detect the object at the center and then other parts of the image. Based on this, L. Justin et al. [20] achieved a satisfactory result for detecting a handgun. The model is trained on a standard ImageNet dataset. Notably, it is still significantly slow Overfeat is based on the sliding window approach of CNN. The sliding window classifier is trained to detect the object at the center and then other parts of the image. Based on this, L. Justin et al. [20] achieved a satisfactory result for detecting a handgun. The model is trained on a standard ImageNet dataset. Notably, it is still significantly slow while detecting the frames in real time.

Asrith et al. [21] used low-resolution images for training CNNs to detect faces and weapons. Basically, their study focused on face detection and did not contemplate weapon detection. A. Castillo et al. [10] proposed a CNN algorithm to detect a cold metallic weapon in video surveillance. Similarly, work was carried out by F. Gelana et al. [22] to detect a handgun using edge information of the object as a feature, which was based on CNN; the accuracy achieved to classify a frame in CCTV videos was 97.78%.

Dhillon et al. [23] proposed a handgun detector that was trained using an internet movie firearm database (IMFDB) dataset using an R-CNN model whose classification head was constructed on top of VGG16 architecture. They used a support vector machine and ensemble tree classifier for classification, regression, and outlier detection.

In their respective studies, Mikolaj et al. [24, 25] and Akcay et al. [26] proposed an X-ray baggage screening system used to classify and detect objects. These studies explored multiple object detection mechanisms such as Faster R-CNN, sliding window CNN, and You Only Look Once (YOLO). This system was proposed to classify the object into six classes such as laptops, guns and their parts, and knives and their parts. However, their proposed model could not detect objects with accuracy if it had high occlusion under different lighting conditions. An action recognition method is used for detecting the anomaly using timed image-based CNN [27] .

Kanehisa et al. [28] proposed the YOLO algorithm for handgun detection. The dataset used for training was IMFDB. The study to detect a knife with the most pertinent result was obtained from the common object in context (COCO) challenges released in 2017. Object detection in COCO [29] was based on a very large-scale dataset.

Bhatti et al. [30] constructed a customized dataset for training the YOLOv4 model and compared the results with the state-of-the-art methodology. They achieved a satisfactory result by testing their methods on a few videos. Their study mainly focused on detecting a pistol, revolver, wallet, metal detector, and cell phone. In their study, they compared the results with single shot multibox detector (SSD), R-CNN, and a different version of YOLO; some of the classification models have shown promising results in static mode, but in real-time scenarios, the models were slow and less accurate when converging on a resource-constrained device. These studies demonstrated an excellent F1 score on the initial dataset, but the models are not suitable for scenarios with background objects.

For only detecting handguns, the most prevalent result was obtained using Faster R-CNN [31] ; for knives, this was obtained using CNN [32] . As most existing algorithms struggle to detect smaller objects and test them on constrained lightweights devices, the test time significantly increases. As a novelty, this study focused on classifying different subclasses of guns and knives; Figure 3 shows the sequential flow of the different subclass detection. tial dataset, but the models are not suitable for scenarios with background objects.

For only detecting handguns, the most prevalent result was obtained using Faster R-CNN [31] ; for knives, this was obtained using CNN [32] . As most existing algorithms struggle to detect smaller objects and test them on constrained lightweights devices, the test time significantly increases. As a novelty, this study focused on classifying different subclasses of guns and knives; Figure 3 shows the sequential flow of the different subclass detection. 

We first describe the entire network of MSD-CNN architecture in this section, as shown in Figure 4 , followed by the architecture implementation flow. Finally, we describe the detailed MSD-CNN methodology and the proposed learning algorithm. 

We first describe the entire network of MSD-CNN architecture in this section, as shown in Figure 4 , followed by the architecture implementation flow. Finally, we describe the detailed MSD-CNN methodology and the proposed learning algorithm. 

MSD-CNN stands for multiclass subclass detection convolutional neural network architecture, as shown in Figure 4 . Two fully connected (FC) heads are present in MSD-CNN, and each head has a specific classification task. As there are two separate branches, the branch at the first edge is responsible for classifying the abnormal subclass images such as guns (e.g., handguns and automatic/semi-automatic rifles) and knives (e.g., kitchen knives and army knives). The second edge branch is responsible for classifying the normal subclass images such as activities (e.g., walking and cycling) and work (e.g., office work and housework). The architecture of CNN used here is a simpler version of VGGNet [33, 34] . Table 1 shows the notation for presenting the quantities and their respec- 

MSD-CNN stands for multiclass subclass detection convolutional neural network architecture, as shown in Figure 4 . Two fully connected (FC) heads are present in MSD-CNN, and each head has a specific classification task. As there are two separate branches, the branch at the first edge is responsible for classifying the abnormal subclass images such as guns (e.g., handguns and automatic/semi-automatic rifles) and knives (e.g., kitchen knives and army knives). The second edge branch is responsible for classifying the normal subclass images such as activities (e.g., walking and cycling) and work (e.g., office work and housework). The architecture of CNN used here is a simpler version of VGGNet [33, 34] . Table 1 shows the notation for presenting the quantities and their respective definitions used in this study. 

CNN, and each head has a specific classification task. As there are two separate branches, the branch at the first edge is responsible for classifying the abnormal subclass images such as guns (e.g., handguns and automatic/semi-automatic rifles) and knives (e.g., kitchen knives and army knives). The second edge branch is responsible for classifying the normal subclass images such as activities (e.g., walking and cycling) and work (e.g., office work and housework). The architecture of CNN used here is a simpler version of VGGNet [33, 34] . Table 1 shows the notation for presenting the quantities and their respective definitions used in this study. chitecture, as shown in Figure 4 . Two fully connected (FC) heads are present in MSD-CNN, and each head has a specific classification task. As there are two separate branches, the branch at the first edge is responsible for classifying the abnormal subclass images such as guns (e.g., handguns and automatic/semi-automatic rifles) and knives (e.g., kitchen knives and army knives). The second edge branch is responsible for classifying the normal subclass images such as activities (e.g., walking and cycling) and work (e.g., office work and housework). The architecture of CNN used here is a simpler version of VGGNet [33, 34] . Table 1 shows the notation for presenting the quantities and their respective definitions used in this study. chitecture, as shown in Figure 4 . Two fully connected (FC) heads are present in MSD-CNN, and each head has a specific classification task. As there are two separate branches, the branch at the first edge is responsible for classifying the abnormal subclass images such as guns (e.g., handguns and automatic/semi-automatic rifles) and knives (e.g., kitchen knives and army knives). The second edge branch is responsible for classifying the normal subclass images such as activities (e.g., walking and cycling) and work (e.g., office work and housework). The architecture of CNN used here is a simpler version of VGGNet [33, 34] . Table 1 shows the notation for presenting the quantities and their respective definitions used in this study. 

In the proposed architecture, two forks are stacked on top of each other. The right branch/second fork in the network is shallower compared to the left branch/first fork. Predicting the normal class is easier than predicting the abnormal class; thus, the right branch is more superficial. The stacking of multiple layers of CONV (convolutional) and rectified linear unit (RELU) helps the system to learn features with richer characteristics. The architecture accepts input images with dimensions 96 × 96 × 3. In the MSD-CNN network, each branch is responsible for its set of tasks, such as convolution, activation, batch normalization, max pooling, and dropout. There are 32 filters in the CONV layer, with a kernel of 3 × 3 and activation function as a RELU. The RELU activation function is mathematically stated in Equation (1).

Thus, 25% of dropout is applied in the MSD-CNN network to optimize the hyperconnectivity between the neural network. Dropout is required for disconnecting the nodes randomly from the existing layer to the next layer. Randomly disconnecting nodes helps to reduce overfitting, as a single node is not responsible for predicting a class, edge, or object. In tandem, the filter kernels and pool size are progressively changed to reduce the spatial size and increase the depth.

While training in the first fork, we used grayscale images because we were concerned with very small objects (e.g., handgun, knife) for detection. Also, when training the second fork, we used a colored image and were concerned with larger parameters (e.g., house and building). If we had converted it to grayscale, color information would have been lost, so in order to preserve the tiny object's information, retaining the color is essential. The stacking of convolutional layers on top of each other helps to increase the depth of the network. Max pooling is used to reduce the volume size. Pooling layers use a 3 × 3 pool size to quickly reduce the spatial dimension from 96 × 96 to 32 × 32. Thus, increasing the filter size of a node from 32 to 64 results in a smaller the spatial dimension of volume; the smaller it becomes, the deeper we go into the network, and the more filters we learn. Mathematically, an image represented in tensor form is

At the lth layer, it can be denoted as 

Thus,

At the lth layer, the learned parameters are

C parameters The pooling layer is responsible for downsampling the features of input without affecting the channels; basically, pooling layers have no parameter to learn, FC layers are a finite number of neurons that accept input as a vector and return output as a vector:

The input a [ l−1 ] is a result of convolution or a pooling layer with a dimension of

To pass the value to the FC layer, we flatten the tensor to 1D dimension: n

Thus,

The MSD-CNN methodology, as shown in Figure 5 , is comprised of frame classification, localization, and detection. The following component works together in a sequence to detect the desired object. Frame detection is basically used to accurately find the location and dimension of an object in an image, which is essential for frame classification [35] .

Thus,

At the lth layer, the learned parameters are

The pooling layer is responsible for downsampling the features of input without affecting the channels; basically, pooling layers have no parameter to learn, FC layers are a finite number of neurons that accept input as a vector and return output as a vector:

The input [ −1 ] is a result of convolution or a pooling layer with a dimension of

To pass the value to the FC layer, we flatten the tensor to 1D dimension:

Resultant parameters that are learned are

The MSD-CNN methodology, as shown in Figure 5 , is comprised of frame classification, localization, and detection. The following component works together in a sequence to detect the desired object. Frame detection is basically used to accurately find the location and dimension of an object in an image, which is essential for frame classification [35] . The position and scale of an object are determined by frame localization [36] . After defining the architecture, the learning algorithm is conducted in the following steps: In the neural network learning algorithm, it is basically a stepwise calculation of each defined layer's parameters' weights. The goal is to achieve the best parameters for prediction. The loss function denoted by J defines the difference between the actual value and the predicate value on the entire training set. To minimize the value of J, first, in forward The position and scale of an object are determined by frame localization [36] . After defining the architecture, the learning algorithm is conducted in the following steps: In the neural network learning algorithm, it is basically a stepwise calculation of each defined layer's parameters' weights. The goal is to achieve the best parameters for prediction. The loss function denoted by J defines the difference between the actual value and the predicate value on the entire training set. To minimize the value of J, first, in forward propagation, the data is iterated through the entire network in a sequence of batches, so that the loss function (L) for each batch is calculated, where m is the size of the training set and θ is the model parameter; consequently, it defines the sum of the errors committed at the predicated outcome of each batch. Second, backpropagation is used for calculating the gradients of the cost function, so that the parameters can be updated using the descent algorithm.

Algorithm 1 defines object detection in multiview cameras using GPU: For detecting an abnormal object in multiview cameras in real time, we use the concept of threading and GPU for computation. MSD-CNN is a considerably lightweight network; it is possible to create multiple instances of the MSD-CNN model, thus simultaneously applying the model individually on each video sequence using dynamic programming. Thus, we are able to detect the abnormal object in multiple cameras simultaneously without increasing the computation overhead. The video sequence is transferred from the main memory to global memory to implement the threading and improve the optimization of computational resources. Each instance is responsible for detecting, classifying, and localizing the object in the desired frame for a particular video sequence-evaluating the methods discussed in the experimental Section 4. Detected_frame_video_sequence_1= MSD-CNN applied on video_1 /* object classification and localization*/ 6

Detected_frame_video_sequence_n= MSD-CNN applied on video_n /* object classification and localization*/

To evaluate the proposed MSD-CNN, we used three different standardized datasets: ImageNet, Open Image dataset V4 [11] , and Olmos dataset [36] .

We used TensorFlow [37] , which is a platform for training the neural network, and also Keras API (as a library for different transformations). As a front end, python programming was used, and training was conducted using Nvidia Geforce RTX 2060. For testing a model in real-time, a Logitech C920 Pro HD webcam was used. We also used the raspberry pi 4 to evaluate the feasibility of the MSD-CNN model on resource-constrained devices. For the used raspberry pi cameras, the camera feed was extracted from multiple slave raspberry pi and got processed on the master raspberry pi for abnormal object detection in sequence.

After data augmentation on the dataset denoted as M was split into three parts, Mtrain set was used for training the algorithm, Mdev set was used for finetuning and for evaluating the variance and bias, and Mtest was used for checking the trained model's precision. First, we used a smaller batch (100-200) of images for training. For this, the accuracy ranged from 20 to 30%. We increased the dataset batch size slowly after reaching a size of 800 images with 4000 iterations, and the accuracy achieved ranged from 90 to 99%, as shown in Figure 6 (the accuracy of each subclass with respect to the trained batches of images). 

In this section, we describe in how the data augmentation technique, as shown in Figure 7 , significantly increases training and testing datasets for detection. 

The dataset for the abnormal class and normal class is extracted from the ImageNet dataset and IMFDB. ImageNet dataset comprises 15 million labeled images that belong to 220,000 categories. A dataset comprises 58,000 different view images of guns. Thus, it increases the detection accuracy and robustness of the MSD-CNN model data augmentation method used on a dataset. Similarly, after augmentation on IMFDB images, we obtained 75,000 gun images. From the total obtained dataset, 75% of the data was used for training, 15% of the data was assigned to training validation, and the remaining 10% was used for testing.

Flipping: Flipping an image is moving the image in a mirror-reversal horizontally or vertically; • Shearing: Shearing an image is shifting part of an image; • Scaling: Image scaling is the resizing of images.

With data augmentation, the dataset increased to 315,000 images for guns and the same for knives; the dataset increased to 175,000 images from 25,000 images. The images were extracted from the handgun, rifle, kitchen knife, army knife, cycling, walking, office work, and housework classes of the ImageNet dataset. Thus, the obtained dataset was an extensive custom dataset used for training.

Most of the images selected in the corresponding dataset for training comprise different viewpoints' variation, illumination, deformation, occlusion, and interclass variation of images. 

In this section, we describe in how the data augmentation technique, as shown in Figure 7 , significantly increases training and testing datasets for detection. 

In this section, we describe in how the data augmentation technique, as shown in Figure 7 , significantly increases training and testing datasets for detection. 

The dataset for the abnormal class and normal class is extracted from the ImageNet dataset and IMFDB. ImageNet dataset comprises 15 million labeled images that belong to 220,000 categories. A dataset comprises 58,000 different view images of guns. Thus, it increases the detection accuracy and robustness of the MSD-CNN model data augmentation method used on a dataset. Similarly, after augmentation on IMFDB images, we obtained 75,000 gun images. From the total obtained dataset, 75% of the data was used for training, 15% of the data was assigned to training validation, and the remaining 10% was used for testing.

Flipping: Flipping an image is moving the image in a mirror-reversal horizontally or vertically; • Shearing: Shearing an image is shifting part of an image; • Scaling: Image scaling is the resizing of images.

With data augmentation, the dataset increased to 315,000 images for guns and the same for knives; the dataset increased to 175,000 images from 25,000 images. The images were extracted from the handgun, rifle, kitchen knife, army knife, cycling, walking, office work, and housework classes of the ImageNet dataset. Thus, the obtained dataset was an extensive custom dataset used for training.

Most of the images selected in the corresponding dataset for training comprise different viewpoints' variation, illumination, deformation, occlusion, and interclass variation of images. 

The dataset for the abnormal class and normal class is extracted from the ImageNet dataset and IMFDB. ImageNet dataset comprises 15 million labeled images that belong to 220,000 categories. A dataset comprises 58,000 different view images of guns. Thus, it increases the detection accuracy and robustness of the MSD-CNN model data augmentation method used on a dataset. Similarly, after augmentation on IMFDB images, we obtained 75,000 gun images. From the total obtained dataset, 75% of the data was used for training, 15% of the data was assigned to training validation, and the remaining 10% was used for testing.

Flipping: Flipping an image is moving the image in a mirror-reversal horizontally or vertically; • Shearing: Shearing an image is shifting part of an image; • Scaling: Image scaling is the resizing of images.

With data augmentation, the dataset increased to 315,000 images for guns and the same for knives; the dataset increased to 175,000 images from 25,000 images. The images were extracted from the handgun, rifle, kitchen knife, army knife, cycling, walking, office work, and housework classes of the ImageNet dataset. Thus, the obtained dataset was an extensive custom dataset used for training.

Most of the images selected in the corresponding dataset for training comprise different viewpoints' variation, illumination, deformation, occlusion, and interclass variation of images.

For guns and knives, testing sets were generated from existing datasets with a total of almost 48,000 images:

The Open Image Dataset V4 comprised 55,000 handgun images [11] and 26,700 knives images; •

The small test set of Olmos [36] comprised 608 images.

Evaluation and comparison of the existing studies with the proposed methodology are mentioned in Table 2 ; the parameters used for evaluating are the type of deployment [38] and off-time [15, 16, 18] , which means the model used static data stored on the hard drive. Real time [20, 23, 31] implies that the model is used on runtime where the data is a live camera video feed. We must also consider the detection type: different types of guns and knives (i.e., the model can detect and classify different types of guns and knives), handguns [15, 16] (model can only detect handguns), knives [18, 32] (model can only detect knife). Notably, until now, there has been no study on efficiently detecting and classifying different types of guns and knives; the neural network could not differentiate between different types of guns and knives with similar characteristics.

Most models lack robustness because they are trained on a very small dataset, ranging from 50 to 2400 images only. Considering camera view, most detection algorithms work properly on a single view camera [32] (use of a single camera with a single-object view); existing studies' algorithms do not work on multiview cameras (using multiple cameras with multiple-object view) [32] . The proposed MSD-CNN model could achieve 90.7% precision with respect to parameters such as different types of guns and knives, real-time deployment, and multiview cameras.

We used a confusion matrix, which is a table for describing the performance of the classification model. Table 3 Additionally, mean average precision (mAP) was also calculated for the test set. The mAP is an average precision over multiple intersections over the union. The obtained mAP is 97.50%. The result obtained on the Olmos [35] benchmark dataset is shown in Table 4 . Notably, the model was not trained on the OpenImage Net V4 and Olmos dataset, and still, the model could detect and classify the guns and knives efficiently and effectively, with accuracy ranging from 90 to 98%. When the model was tested on this dataset, only the classes of data for handguns, knives, and walking were used. The OpenImage Net V4 dataset knife class mainly comprised kitchen knives, with the sharp edges of the knives defined in the image, which is also the reason for good accuracy. However, in the Olmos dataset, the occluded image was also detected accurately. Maximum FP during detection was observed in army knife occluded images with noise. Since the model was trained on a larger dataset, while training the model, the results obtained for each major subclass are as follows. First, for the abnormal subclass, training set (Mtrain) accuracy was 98.90%, and 96.20% accuracy was obtained for the testing set (Mtest). Second, for the normal subclass, the training set (Mtrain) accuracy obtained was 98.60%, and 95.20% accuracy was obtained for the testing set (Mtest). The accuracy obtained was 96.20%, and 95.20% accuracy was obtained for the testing set (Mtest). The plot in Figure 8 shows multiple accuracies for the abnormal subclass and normal subclass. In Figure 9 , the plots are given for multiple losses for both abnormal and normal subclasses. It also contains the total loss during training. 

Since the model was trained on a larger dataset, while training the model, the results obtained for each major subclass are as follows. First, for the abnormal subclass, training set (Mtrain) accuracy was 98.90%, and 96.20% accuracy was obtained for the testing set (Mtest). Second, for the normal subclass, the training set (Mtrain) accuracy obtained was 98.60%, and 95.20% accuracy was obtained for the testing set (Mtest). The accuracy obtained was 96.20%, and 95.20% accuracy was obtained for the testing set (Mtest). The plot in Figure 8 shows multiple accuracies for the abnormal subclass and normal subclass. In Figure 9 , the plots are given for multiple losses for both abnormal and normal subclasses. It also contains the total loss during training. 

We propose a new parameter, DTpI, to evaluate the model in a real-time environment or a video. DTpI determines how accurately a model can detect the normal and abnormal subclass in the live video feed. DTpI can be denoted as i, and it indicates the TP frame that is being detected. For analysis, we used a publicly available YouTube video containing 20 frames of guns. In the video, the guns are clearly visible to a viewer. The MSD-CNN successfully detected i = = 18 frames, with an average time of DTpI = = 0.3 s. 

We propose a new parameter, DTpI, to evaluate the model in a real-time environment or a video. DTpI determines how accurately a model can detect the normal and abnormal subclass in the live video feed. DTpI can be denoted as i, and it indicates the TP frame that is being detected. For analysis, we used a publicly available YouTube video containing 20 frames of guns. In the video, the guns are clearly visible to a viewer. The MSD-CNN successfully detected i = = 18 frames, with an average time of DTpI = = 0.3 s. We intentionally tested the model on challenging videos, which contain occlusion and illumination. We tested the model on live camera feed and videos, and the model showed good performance in different scenarios. Figure 10 illustrates the detection accuracy of the abnormal subclass in the test videos. The detection result obtained on test videos shows two rectangle boxes. The red rectangle box determines an abnormal frame and its subclass gun, and the blue rectangle box defines a handgun subclass. The detection confidence score is 85% for Figure 10a . Similarly, other images depict the bounding box and its confidence score for the automatic gun and knife subclass. The false negative representation shows that the detected frames are low in brightness and contrast (as depicted in Figure 11 ). Specifically, they arise when the image quality is unclear or when the automatic gun is moved very fast in the background pixels. We can analyze that the accuracy of detecting smaller objects depends on the quality of the frame sequence.

(c) Loss of Multiclass Normal subclass category Figure 9 . Experimental Training losses of multiclass subclass output classification plotted using matplotlib. They are plotted differently for analysis.

We propose a new parameter, DTpI, to evaluate the model in a real-time environment or a video. DTpI determines how accurately a model can detect the normal and abnormal subclass in the live video feed. DTpI can be denoted as i, and it indicates the TP frame that is being detected. For analysis, we used a publicly available YouTube video containing 20 frames of guns. In the video, the guns are clearly visible to a viewer. The MSD-CNN successfully detected i = = 18 frames, with an average time of DTpI = = 0.3 s. We intentionally tested the model on challenging videos, which contain occlusion and illumination. We tested the model on live camera feed and videos, and the model showed good performance in different scenarios. Figure 10 illustrates the detection accuracy of the abnormal subclass in the test videos. The detection result obtained on test videos shows two rectangle boxes. The red rectangle box determines an abnormal frame and its subclass gun, and the blue rectangle box defines a handgun subclass. The detection confidence score is 85% for Figure 10a . Similarly, other images depict the bounding box and its confidence score for the automatic gun and knife subclass. The false negative representation shows that the detected frames are low in brightness and contrast (as depicted in Figure  11 ). Specifically, they arise when the image quality is unclear or when the automatic gun is moved very fast in the background pixels. We can analyze that the accuracy of detecting smaller objects depends on the quality of the frame sequence. Table 5 shows which images the model was trained on, which images the model was tested on, and the confidence score generated for each output image. For the first, the confidence score generated for detecting a handgun gun is 98%, and for the other remaining subclasses, the score is in the range of 0.1-18%, respectively. Thus, we can say that the model has perfectly detected the handgun. Similarly, for automatic/semi-automatic rifle output images, the score is 96.5; for kitchen knife output images, the score is 98.54%; for army knife output image, the score is 96.01%; and for all tested images, the score was in the range of 95-98.90% for abnormal subclasses. For the normal subclass, the confidence scores for detection are as follows: for walking output images, the score is 96.01%; for cycling output images, the score is 95.90%; for office output images, the score is 95.32%; and for work output images, the score is 95.20%. For the normal subclass, the confidence was in the range of 94-97%. Thus, most images were detected with a satisfactory score. A satisfactory score is where the model is confident in the detection and is greater than or equal to 50%. Table 5 shows which images the model was trained on, which images the model was tested on, and the confidence score generated for each output image. For the first, the confidence score generated for detecting a handgun gun is 98%, and for the other remaining subclasses, the score is in the range of 0.1-18%, respectively. Thus, we can say that the model has perfectly detected the handgun. Similarly, for automatic/semi-automatic rifle output images, the score is 96.5; for kitchen knife output images, the score is 98.54%; for army knife output image, the score is 96.01%; and for all tested images, the score was in the range of 95-98.90% for abnormal subclasses. For the normal subclass, the confidence scores for detection are as follows: for walking output images, the score is 96.01%; for cycling output images, the score is 95.90%; for office output images, the score is 95.32%; and for work output images, the score is 95.20%. For the normal subclass, the confidence was in the range of 94-97%. Thus, most images were detected with a satisfactory score. A satisfactory score is where the model is confident in the detection and is greater than or equal to 50%.

To evaluate the feasibility and efficiency of the proposed model, we compared the results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] (R2), RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on the customized dataset. For object detection, the R1 and R2 methods use regression techniques. R1 was fast while detecting the object, but it performed poorly in detecting tiny objects. In comparison with R1, R2 showed a minute difference in detecting the desired object. The techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sensitive position score map for classification; a major disadvantage of this technique is the computational cost, as it uses two-stage detectors. A comparison of results for each subclass is shown in Table 6 .

As the study's primary goal was to create a lightweight abnormal object detection model that can detect smaller objects such as guns and knives, we evaluated the proposed model on different platforms such as GPU, CPU, and on resource-constrained devices such as the raspberry Pi 4 model. To gain insights on different platforms, we compared the study with lightweight networks such as MobileNet [39] (defined as T1), and Tiny-YOLO [40] (T2). The rate at which the object is predicted can be determined by the inference and loading, which taken together represent the computational cost. The T2 method performed better than T1 and MSD-CNN when loading was considered. In contrast, MSD-CNN = 2.10 s performed slightly better than the T1 = 2.31 s and T2 = 3.0 s methods when inference was considered. The DTpI method was employed to check the inference in multiview cameras at the same time. We tested the MSD-CNN model on raspberry pi 4 for object detection in multiple cameras feeds, and the model outperformed the existing algorithm. A comparison of computational costs for different platforms is shown in Table 7 .

In summary, we evaluated the model on different platforms for model effectiveness and efficiency and compared the result with different algorithms on a custom dataset for insights. The model exhibits good performance on different datasets. 

The confidence score during detection. Ab means abnormal subclass, No means normal subc means gun subclass of abnormal class, K means knife subclass of abnormal class, A means a subclass of normal class. W means work subclass of normal class. For gun and knife class, th each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling work, and housework.

To evaluate the feasibility and efficiency of the proposed model, we compare results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on th tomized dataset. For object detection, the R1 and R2 methods use regression techn R1 was fast while detecting the object, but it performed poorly in detecting tiny obje comparison with R1, R2 showed a minute difference in detecting the desired objec techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sen position score map for classification; a major disadvantage of this technique is the co tational cost, as it uses two-stage detectors. A comparison of results for each subc shown in Table 6 . 

The confidence score during detection. Ab means abnormal subclass, No means normal subclass, G means gun subclass of abnormal class, K means knife subclass of abnormal class, A means activaty subclass of normal class. W means work subclass of normal class. For gun and knife class, there are each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and army knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling, office work, and housework.

To evaluate the feasibility and efficiency of the proposed model, we compared the results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] (R2), RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on the customized dataset. For object detection, the R1 and R2 methods use regression techniques. R1 was fast while detecting the object, but it performed poorly in detecting tiny objects. In comparison with R1, R2 showed a minute difference in detecting the desired object. The techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sensitive position score map for classification; a major disadvantage of this technique is the computational cost, as it uses two-stage detectors. A comparison of results for each subclass is shown in Table 6 . 

The confidence score during detection. Ab means abnormal subclass, No means normal subc means gun subclass of abnormal class, K means knife subclass of abnormal class, A means a subclass of normal class. W means work subclass of normal class. For gun and knife class, th each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling work, and housework.

To evaluate the feasibility and efficiency of the proposed model, we compare results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on th tomized dataset. For object detection, the R1 and R2 methods use regression techn R1 was fast while detecting the object, but it performed poorly in detecting tiny obje comparison with R1, R2 showed a minute difference in detecting the desired objec techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sen position score map for classification; a major disadvantage of this technique is the co tational cost, as it uses two-stage detectors. A comparison of results for each subc shown in Table 6 . 

The confidence score during detection. Ab means abnormal subclass, No means normal subclass, G means gun subclass of abnormal class, K means knife subclass of abnormal class, A means activaty subclass of normal class. W means work subclass of normal class. For gun and knife class, there are each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and army knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling, office work, and housework.

To evaluate the feasibility and efficiency of the proposed model, we compared the results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] (R2), RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on the customized dataset. For object detection, the R1 and R2 methods use regression techniques. R1 was fast while detecting the object, but it performed poorly in detecting tiny objects. In comparison with R1, R2 showed a minute difference in detecting the desired object. The techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sensitive position score map for classification; a major disadvantage of this technique is the computational cost, as it uses two-stage detectors. A comparison of results for each subclass is shown in Table 6 . 

The confidence score during detection. Ab means abnormal subclass, No means normal subc means gun subclass of abnormal class, K means knife subclass of abnormal class, A means a subclass of normal class. W means work subclass of normal class. For gun and knife class, th each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling work, and housework.

To evaluate the feasibility and efficiency of the proposed model, we compare results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39 RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on th tomized dataset. For object detection, the R1 and R2 methods use regression techn R1 was fast while detecting the object, but it performed poorly in detecting tiny obje comparison with R1, R2 showed a minute difference in detecting the desired objec techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sen position score map for classification; a major disadvantage of this technique is the co tational cost, as it uses two-stage detectors. A comparison of results for each subc shown in Table 6 . 

The confidence score during detection. Ab means abnormal subclass, No means normal subclass, G means gun subclass of abnormal class, K means knife subclass of abnormal class, A means activaty subclass of normal class. W means work subclass of normal class. For gun and knife class, there are each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and army knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling, office work, and housework.

To evaluate the feasibility and efficiency of the proposed model, we compared the results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] (R2), RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on the customized dataset. For object detection, the R1 and R2 methods use regression techniques. R1 was fast while detecting the object, but it performed poorly in detecting tiny objects. In comparison with R1, R2 showed a minute difference in detecting the desired object. The techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sensitive position score map for classification; a major disadvantage of this technique is the computational cost, as it uses two-stage detectors. A comparison of results for each subclass is shown in Table 6 . 

The confidence score during detection. Ab means abnormal subclass, No means normal subc means gun subclass of abnormal class, K means knife subclass of abnormal class, A means a subclass of normal class. W means work subclass of normal class. For gun and knife class, th each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling work, and housework.

To evaluate the feasibility and efficiency of the proposed model, we compare results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on th tomized dataset. For object detection, the R1 and R2 methods use regression techn R1 was fast while detecting the object, but it performed poorly in detecting tiny obje comparison with R1, R2 showed a minute difference in detecting the desired objec techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sen position score map for classification; a major disadvantage of this technique is the co tational cost, as it uses two-stage detectors. A comparison of results for each subc shown in Table 6 . 

The confidence score during detection. Ab means abnormal subclass, No means normal subclass, G means gun subclass of abnormal class, K means knife subclass of abnormal class, A means activaty subclass of normal class. W means work subclass of normal class. For gun and knife class, there are each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and army knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling, office work, and housework.

To evaluate the feasibility and efficiency of the proposed model, we compared the results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] (R2), RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on the customized dataset. For object detection, the R1 and R2 methods use regression techniques. R1 was fast while detecting the object, but it performed poorly in detecting tiny objects. In comparison with R1, R2 showed a minute difference in detecting the desired object. The techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sensitive position score map for classification; a major disadvantage of this technique is the computational cost, as it uses two-stage detectors. A comparison of results for each subclass is shown in Table 6 .

The confidence score during detection. Ab means abnormal subclass, No means normal subc means gun subclass of abnormal class, K means knife subclass of abnormal class, A means a subclass of normal class. W means work subclass of normal class. For gun and knife class, th each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling work, and housework.

To evaluate the feasibility and efficiency of the proposed model, we compare results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39 RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on th tomized dataset. For object detection, the R1 and R2 methods use regression techn R1 was fast while detecting the object, but it performed poorly in detecting tiny obje comparison with R1, R2 showed a minute difference in detecting the desired objec techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sen position score map for classification; a major disadvantage of this technique is the co tational cost, as it uses two-stage detectors. A comparison of results for each subc shown in Table 6 . 

The confidence score during detection. Ab means abnormal subclass, No means normal subclass, G means gun subclass of abnormal class, K means knife subclass of abnormal class, A means activaty subclass of normal class. W means work subclass of normal class. For gun and knife class, there are each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and army knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling, office work, and housework.

To evaluate the feasibility and efficiency of the proposed model, we compared the results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] (R2), RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on the customized dataset. For object detection, the R1 and R2 methods use regression techniques. R1 was fast while detecting the object, but it performed poorly in detecting tiny objects. In comparison with R1, R2 showed a minute difference in detecting the desired object. The techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sensitive position score map for classification; a major disadvantage of this technique is the computational cost, as it uses two-stage detectors. A comparison of results for each subclass is shown in Table 6 . 

The confidence score during detection. Ab means abnormal subclass, No means normal subc means gun subclass of abnormal class, K means knife subclass of abnormal class, A means a subclass of normal class. W means work subclass of normal class. For gun and knife class, th each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling work, and housework.

To evaluate the feasibility and efficiency of the proposed model, we compare results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on th tomized dataset. For object detection, the R1 and R2 methods use regression techn R1 was fast while detecting the object, but it performed poorly in detecting tiny obje comparison with R1, R2 showed a minute difference in detecting the desired objec techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sen position score map for classification; a major disadvantage of this technique is the co tational cost, as it uses two-stage detectors. A comparison of results for each subc shown in Table 6 . 

The confidence score during detection. Ab means abnormal subclass, No means normal subclass, G means gun subclass of abnormal class, K means knife subclass of abnormal class, A means activaty subclass of normal class. W means work subclass of normal class. For gun and knife class, there are each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and army knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling, office work, and housework.

To evaluate the feasibility and efficiency of the proposed model, we compared the results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] (R2), RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on the customized dataset. For object detection, the R1 and R2 methods use regression techniques. R1 was fast while detecting the object, but it performed poorly in detecting tiny objects. In comparison with R1, R2 showed a minute difference in detecting the desired object. The techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sensitive position score map for classification; a major disadvantage of this technique is the computational cost, as it uses two-stage detectors. A comparison of results for each subclass is shown in Table 6 .

To evaluate the feasibility and efficiency of the proposed model, we compare results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39 RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on th tomized dataset. For object detection, the R1 and R2 methods use regression techn R1 was fast while detecting the object, but it performed poorly in detecting tiny obje comparison with R1, R2 showed a minute difference in detecting the desired objec techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sen position score map for classification; a major disadvantage of this technique is the co tational cost, as it uses two-stage detectors. A comparison of results for each subc shown in Table 6 . 

To evaluate the feasibility and efficiency of the proposed model, we compared the results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] (R2), RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on the customized dataset. For object detection, the R1 and R2 methods use regression techniques. R1 was fast while detecting the object, but it performed poorly in detecting tiny objects. In comparison with R1, R2 showed a minute difference in detecting the desired object. The techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sensitive position score map for classification; a major disadvantage of this technique is the computational cost, as it uses two-stage detectors. A comparison of results for each subclass is shown in Table 6 .

To evaluate the feasibility and efficiency of the proposed model, we compare results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on th tomized dataset. For object detection, the R1 and R2 methods use regression techn R1 was fast while detecting the object, but it performed poorly in detecting tiny obje comparison with R1, R2 showed a minute difference in detecting the desired objec techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sen position score map for classification; a major disadvantage of this technique is the co tational cost, as it uses two-stage detectors. A comparison of results for each subc shown in Table 6 . 

According to our results, a deep neural network can achieve a good result on a challenging dataset. Notably, the performance of the network degrades when a single convolution layer is removed. For clarity, suppose we are removing a midsection convolu'.tion layer; a drop of 4% is seen in the accuracy. So, for an accurate result, neural network depth is required, thus ensuring that the network is large enough to achieve the appropriate conclusion. For comparison, we revisited most of the detection studies, and we used standard methods such as precision, recall, F1 measure, and mAP on the benchmark dataset to evaluate the proposed methodology. Also, we proposed a new evaluation matrix (DTpI) for better understanding. We also distinguished that a consequential drop in accuracy exists when the dataset for training is too small, as the detected object would be sensitive and small. Besides this, a major disadvantage is that the model only works well on a very highly configured HD camera video feed. In such scenarios, the accuracy of detection was more comparable to the low-quality video feed/low-quality images. The detection was done on high-quality video; as the model is lightweight, less processing power is required.

The use of GPU and threading helps to achieve significantly better results compared with other algorithms. DTpI is employed to understand the inference of the detected object per camera. The model runs exceptionally well on resource-constrained devices for detecting the object in real-time for multiview cameras. Compared to FRCNN, MSD-CNN showed slightly better performance while detecting abnormal objects in multiview cameras for resource-constrained devices. More abnormal objects can be classified into different subclasses for detection. More extensive training can increase the robustness of the network. The detection works appropriately in a static camera view. This study can be extended to continuously moving cameras, which change in view direction. Sometimes the video frames are too dense or too overcrowded, and most views overlap in these scenarios, thereby making it difficult to detect.

Many criminal activities that occur in public areas involve the usage of a gun or knife. Video surveillance can be used for earlier detection of such events. In this study, we discussed the different methods used for detecting knives and guns, involving various parameters. We proposed an MSD-CNN model and tested it on benchmark datasets for detecting abnormal frames (dangerous events such as carrying guns or knives) and normal frames (events such as walking or officework). An essential advantage of the MSD-CNN is that it takes full account of tiny features in the images while training, thus detecting smaller objects more efficiently. Furthermore, as the MSD-CNN is a lightweight model, multiple model instances can be parallel executed on low-powered computational devices. In the proposed model, we considered two primary tasks: classification and automated detection of abnormal frames. For training the model, we significantly increased the existing datasets to make the model robust. Furthermore, the proposed model showed outstanding results in real-time scenarios. In addition, we evaluated the model on different computing platforms to check the feasibility of the model. A major disadvantage of the model is that it can detect only a specific set of objects on which it is trained.

Future work includes improving our method further, such as installing the model directly on the edge devices for computation. The future scope will be expanding this study to cloud and edge devices to distribute the computation power and make the model more robust for real-time scenarios. We plan to extend this algorithm to work on video synopsis to develop robust CCTV surveillance solutions based on testing the model on different sets of cameras, infrared cameras, low-resolution cameras, and continuously moving cameras. We are also obliged to include more sets of abnormal subclasses (e.g., hockey sticks, cylinders, and explosive materials) for detection.

Ab:G: handgun: 98 AbK

Ab:G: handgun: 10 Ab:K: kitchen: 0.5 No:A: walking: 0.4 No:W: office: 0.2 Ab:G: automatic: 96.5 Ab:K

Ab:G: handgun: 6.2 Ab:K: kitchen: 98

Ab:G: automatic: 2.4 Ab:K

Abnormal Ab:G: handgun: 98 AbK

Ab:G: handgun: 10 Ab:K: kitchen: 0.5 No:A: walking: 0.4 No:W: office: 0.2 Ab:G: automatic: 96.5 Ab:K

Ab:G: handgun: 6.2 Ab:K: kitchen: 98

Ab:G: automatic: 2.4 Ab:K

Abnormal Ab:G: handgun: 98 AbK

Ab:G: handgun: 10 Ab:K: kitchen: 0.5 No:A: walking: 0.4 No:W: office: 0.2 Ab:G: automatic: 96.5 Ab:K

Ab:G: handgun: 6.2 Ab:K: kitchen: 98

Ab:G: automatic: 2.4 Ab:K

Abnormal Ab:G: handgun: 98 AbK

Ab:G: handgun: 10 Ab:K: kitchen: 0.5 No:A: walking: 0.4 No:W: office: 0.2 Ab:G: automatic: 9 Ab:K

Ab:G: handgun: 6.2 Ab:K: kitchen: 98

Ab:G: automatic: 5. Ab:K

Ab:G: automatic: 2. Ab:K

Ab:G: automatic: 0. Ab:K: army: 0.45 No:A: cycling: 95

United Nations Office on Drugs and Crime (UNODC)

Automatic image analysis process for the detection of concealed weapons

A comparison of 3D interest point descriptors with application to airport baggage object detection in complex C.T. imagery. Pattern Recognit

Visual place recognition: A survey from deep learning perspective

Explainable Deep Learning for Efficient and Robust Pattern Recognition: A Survey of Recent Developments

Attentive Layer Separation for Object Classification and Object Localization in Object Detection

Crime Scene Prediction by Detecting Threatening Objects Using Convolutional Neural Network

Bag of words-based surveillance system using support vector machines

A computer vision-based framework for visual gun detection using SURF

Brightness guided preprocessing for automatic cold steel weapon detection in surveillance videos with deep learning

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv 2018

Object Detection in Videos by High Quality Object Linking

Improving the accuracy of global forecasting models using time series data augmentation

A review and an approach for object detection in images

A computer vision based framework for visual gun detection using harris interest point detector

Detection techniques for human safety from concealed weapon and harmful EDS

Active appearance models

Visual detection of knives in security applications using Active Appearance Models

Colour Based Image Segmentation Using Hybrid Kmeans with Watershed Segmentation

Developing a Real-Time Gun Detection Classifier

Face Recognition and Weapon Detection from Very Low-Resolution Image

Firearm detection from surveillance cameras using image processing and machine learning techniques

A handheld gun detection using faster r-cnn deep learning

On using feature descriptors as visual words for object detection within X-ray baggage security screening

Terahertz image detection with the improved faster region-based convolutional neural network

Using deep convolutional neural network architectures for object classification and detection within x-ray baggage security imagery

Timed-image based deep learning for action recognition in video sequences

A. Firearm Detection using Convolutional Neural Networks. ICAART 2019

Microsoft coco: Common objects in context

Weapon detection in real-time cctv videos using deep learning

Automatic Handgun Detection Alarm in Videos Using Deep Learning

Crime Scene Prediction by Detecting Threatening Objects Using Convolutional Neural Network

Deep Neural Networks Ensemble to detect COVID-19 from CT Scans

Image retrieval using BIM and features from pretrained VGG network for indoor localization

Object detection with discriminatively trained part-based models

Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems

Video synopsis: A survey

Single shot multibox detector

Real-Time Object Detection Using Pre-Trained Deep Learning Models MobileNet-SSD

Tiny-YOLO object detection supplemented with geometrical data

The confidence score during detection. Ab means abnormal subclass, No means normal subc means gun subclass of abnormal class, K means knife subclass of abnormal class, A means a subclass of normal class. W means work subclass of normal class. For gun and knife class, th each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling work, and housework. The confidence score during detection. Ab means abnormal subclass, No means normal subclass, G means gun subclass of abnormal class, K means knife subclass of abnormal class, A means activaty subclass of normal class. W means work subclass of normal class. For gun and knife class, there are each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and army knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling, office work, and housework. The confidence score during detection. Ab means abnormal subclass, No means normal subc means gun subclass of abnormal class, K means knife subclass of abnormal class, A means a subclass of normal class. W means work subclass of normal class. For gun and knife class, th each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling work, and housework. The confidence score during detection. Ab means abnormal subclass, No means normal subclass, G means gun subclass of abnormal class, K means knife subclass of abnormal class, A means activaty subclass of normal class. W means work subclass of normal class. For gun and knife class, there are each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and army knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling, office work, and housework.To evaluate the feasibility and efficiency of the proposed model, we compared the results with state-of-the-art networks such as YOLO [25] (defined as R1), SSD [39] (R2), RFCN [24] (R3), R-CNN [31] (R4), and FRCNN [23] (R5), which were tested on the customized dataset. For object detection, the R1 and R2 methods use regression techniques. R1 was fast while detecting the object, but it performed poorly in detecting tiny objects. In comparison with R1, R2 showed a minute difference in detecting the desired object. The techniques of R3, R4, and R5 showed better results than R1 and R2. It employs a sensitive position score map for classification; a major disadvantage of this technique is the computational cost, as it uses two-stage detectors. A comparison of results for each subclass is shown in Table 6 . The confidence score during detection. Ab means abnormal subclass, No means normal subclass, G means gun subclass of abnormal class, K means knife subclass of abnormal class, A means activaty subclass of normal class. W means work subclass of normal class. For gun and knife class, there are each two subclasses each: handgun, automatic/semi-automatic rifles, kitchen knives, and army knives. Similarly, for activity and work class, there are two subclasses each: walking, cycling, office work, and housework.

The authors contributed to this paper as follows: P.Y.I. wrote the article, designed the system framework, and conducted experimental evaluation; Y.-G.K. supervised and coordinated the investigation. All authors have read and agreed to the published version of the manuscript. 

Informed Consent Statement: Not applicable.

The data presented in this study are available on request from the corresponding author.

The authors declare no conflict of interest.