Federated Learning and Mel-Spectrograms for Physical Violence Detection in Audio

de S. Silva, Victor E.; Lacerda, Tiago B.; Miranda, Péricles; Câmara, André; Chagas, Amerson Riley Cabral; Furtado, Ana Paula C.

doi:10.1007/978-3-031-45392-2_25

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14197))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

544 Accesses
1 Citation

Abstract

Domestic violence has increased globally as the COVID-19 pandemic combines with economic and social stresses. Some works have used traditional feature extractors to identify features from sound signals to detect physical violence. However, these extractors have not performed well at recognizing physical violence in audio. Besides, the use of Machine Learning is limited by the trade-off between collecting more data while keeping users privacy. Federated Learning (FL) is a technique that allows the creation of client-server networks, in which anonymized training result can be uploaded to a central model, responsible for aggregating and keeping the model up to date, and then distribute the updated model to the client nodes. In this paper, we proposed a FL approach to the violence detection problem in audio signals. The framework was evaluated on a newly proposed synthetic dataset, in which audio signals are represented as mel-spectrograms images, augmented with violence extracts. Thereby, it treats it as a problem of image classification using pre-trained Convolutional Neural Networks (CNN). Inception v3, MobileNet v2, ResNet152 v2 and VGG-16 architectures were evaluated, with the MobileNet architecture presenting the best performance, in terms of accuracy (71.9%), with a loss of 3.6% when compared to the non-FL setting.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Analysis of Machine Learning Algorithms for Violence Detection in Audio

Violence Detection Through Fusing Visual Information to Auditory Scene

Domestic Violence Detection Using Smart Microphones

1 Introduction

Violence against women is widely recognized as a serious public health problem and a violation of women’s human rights [21, 22]. The COVID-19 pandemic combined with economic and social tensions, together with measures to restrict contact and movement is increasing violence against women and girls globally. Before the pandemic, there was an estimate that for every three women, one will experience violence throughout life. During the pandemic, these women are in their homes, cornered with their abusers, in which they exploit the inability of the woman to make a call asking for help or escape, on the other hand, health services are overloaded, non-governmental organizations and support houses are crowded, closed or were reused as a health unit [19].

In the face of this problem, the detection of physical violence is a challenging task, since it depends on the detection of changes in behavior caused by a disagreement of idea, injustice or serious disagreement [34]. A way to make such detection mechanism readily accessible, it can be embedded in a mobile application in order to help to detect and call the police to prevent the violence situation to go further through rapid intervention. Several papers in this topic were proposed recently [25, 34], in which various techniques are used to detect violence in videos. However, there are few studies related to the detection of physical violence by audio. According to [3], only 2 works were found between the years 2015 and 2020 related to the theme, evidencing the complexity and lack of studies in the area.

However, it is necessary to respect the privacy of the data collected, stored and processed in a Machine Learning application, in accordance with European data protection laws (General Data Protection Regulation) and Brazilian (General Data Protection Law), where in some cases it is necessary authorization of the owner of the information for manipulation of information. In addition, it takes high computational power to keep Machine Learning (ML) models running, experiments, training, and even retraining with new data. Federated Learning [18] (FL) is gaining a lot of attention lately, since it allows the decentralization of the learning process to the user’s own devices. Collective knowledge is then aggregated in a centralized model, built over several users’ models on the federated network. Thus, the privacy of each user’s data is maintained on the device, and there is no storage of sensitive information in a centralized location. In view of this, it raises the following research question that will guide the development of this work: Is it possible to maintain similar results by using the FL approach [18], compared with the traditional approach, in identifying scenes of physical violence through audio mel-spectrograms and CNN’s architectures?

The experiments were conducted based on HEAR Dataset [15], a synthetic dataset that has 70,000 instances of 10-second audio clips, divided equally between two classes: presence or not of physical violence. In order to keep the experiments computationally feasible, only 12,500 records were used. The experiments conducted in this work considered the following CNN architectures: Inception [32], MobileNet [27], ResNet152 [8] and VGG-16 [29], and the results showed that CNN MobileNet was the best among the other models when evaluated in HEAR dataset, reaching 66.8% accuracy, with a loss of 8.6% when compared to non-FL experiments.

2 Background

In this Section, the concepts of Convolutional Neural Networks (CNN), FL and Mel-spectrogram to be used in this work will be presented.

2.1 Convolutional Neural Networks (CNN)

A Convolutional Neural Network [5] (CNN) is a deep neural network architecture, very popular in image classification tasks. It can learn to assign importance to relevant aspects of an image, allowing the learning of these characteristics and the distinction between them. For this, unlike traditional algorithms where filters/features needed to be implemented manually, CNN is able to learn a large amount of filters, allowing it to learn the best representation by itself. After the emergence of the AlexNet [14] architectures, exposed in the ImageNet challenge in 2012, and GoogleNet [31], presented in the ImageNet challenge in 2014, convolutional neural networks became popular influencing the emergence of architectures with upper 22 layers of GoogleNet, as is the case with ResNet [8] with 152 layers.

2.2 Federated Learning (FL)

Federated Learning [18] (FL) consists of the decentralization of the model learning process to customers (user devices), allowing a global model to be trained from various models of users of a network. In addition, the privacy of each user’s data, often sensitive, is maintained on the device, with no sharing of raw data with the central server. Only the resulting trained models are shared on the federated network [7].

Furthermore, the FL technique can be applied when data for training is concentrated on other devices, allowing the centralizing server not to require a high computational power. Another scenario is when the data is sensitive and privacy needs to be respected, so only the training results are shared and the raw data remains on the source device. The FL technique also supports the constant updating of the model, since new data can be collected by the network of devices and used to improve the model.

The training process using FL works through learning rounds, where network devices receive the updated global model and start the learning process with data from their respective devices, and at the end, each device will send the training results to the centralizing server, which will aggregate these results into a global model using an aggregation algorithm, for example Federated Averaging (FedAvg) [18]. This process is repeated until the end of the defined training rounds or until a certain evaluation metric is achieved.

2.3 Mel-spectrogram

The problem addressed in this work belongs to the Sound Event Classification (Acoustic Scene Classification), which consists of adding markers to identify the existence or not of a particular characteristic present at any time in the audio. One instance of this problem is the detection of the presence of violence in audio samples. For this, HEAR Dataset uses the extraction of Mel-spectrogram from audios, which is a Mel scale spectogram [11] that represents sound components focused on frequencies accessible to human hearing, eliminating low frequency noise observed in the linear spectogram. From there, it can create a bi-dimencional image to be used in image classifiers. The Fig. 1 demonstrates the mel-spectrograms extracted from two short-lived audios containing physical violence (“punch in the face” and “slap in the face” respectively).

3 Related Works

The Acoustic Scene Classification problem is receiving a lot of attention lately. In [15], the authors introduce an application of Mel-spectrogram sound representation to be fed in a Deep Learning Neural Network for detecting physical violence in audio by converting audio into images. For that, the authors built a synthetic dataset with about 70,000 images, called HEAR dataset, resulting from the extraction of audio using the Mel-spectrogram technique. Finally, a comparison of the MobileNet, VGG16, ResNet-152 and InsceptionV3 architectures were realized, in which MobileNet achieved the best results, with an accuracy of 78.9% and \(f_{1}\) score of 78.1%.

[28] presents an experiment to detect violence inside the car using audio signal. For that context, the author built the In-Car video dataset with the presence of ambient sounds and music to try to simulate the desired environment, and used Mel-spectrogram to generate images that represent each audio. The author then submitted the dataset to the ResNet, DenseNet, SqueezeNet, and Mobile architectures, in which he obtained the best result when applied to the ResNet-18 model with 92.95% accuracy.

[9] displays an emotion recognition system using deep learning approach from an emotional audiovisual big data. To this end, the audios were processed to obtain the Mel-spectrogram, allowing to treat the domain as an image problem, and for the videos some frames were extracted to both feed a CNN each. Then, the output of the two CNNs were joined using two consecutive extreme learning machines (ELMs). The result of this union is an input source for a support vector machine (SVM) for final classification of emotions. Finally, the experiment demonstrates the efficacies of the proposed system involving CNNs and ELMs.

[20] demonstrates the content-based audio auto-tagging experiment using deep learning. To do that, it used deep neural network architectures such as Fully Convolutional Neural Network (FCN) and Convolutional Recurrent Neural Network (CRNN) in conjunction with MagnaTagATune (MTT) dataset. As input for the models, the author used the extraction of Mel-spectrogram, transforming the audios into images. As a result, the CRNN model obtained a better result with 88.56% of AUC-ROC (area under the curve - receiver operating characteristic) as opposed to 85.53% of the FCN model.

The authors in [3] presents two papers related to the detection of violence through audio published between 2015 and 2020, in addition to other works that used only video or audio and video, called multi-modal. The first work, produced by [26], depicts a scream detection system in public transport vehicles, in which it uses the characteristics of Mel Frequency Cepstral Coefficients (MFCC), energy, delta and delta-deltas. The author uses the Gaussian Model Mixture (GMM) and Support Vector Machine (SVM) models to construct the experiments from its own dataset with about 2500 s seconds of audio. The experiment demonstrated that SVM generated a low rate of false alarms and GMM approach had a better identification rate. It is observed that [3] exposes this work even though it is outside the established period.

[30] proposes the classification and detection of acoustic scenes evolving domestic violence using machine learning. To this end, the author proposes the use of the SVM classifier to detect scenes of domestic violence of a man against a woman through the audio parameters MFCC (Mel Frequency Cepstral Coefficients), ZCR (Zero Rate Crossing) and Energy, which are low-level acoustic parameters. As a result, achieved 73.14% of accuracy for the audio parameter MFCC against 71.3% and 66.54% of average was obtained for Energy and ZCR, respectively.

[24] also presents two papers on the processing of audio signals from deep learning using Mel-spectrogram published between 2016 and 2017. The first work, performed by [2], presents an automatic music tagging system using fully convolutional neural networks (FCNs) in conjunction with the MagnaTagATune and Million Song datasets. The author also used the Mel-spectrogram technique to extract characteristics from the audios and convert them to image. The experiment demonstrated that the use of Mel-spectrogram as input representation resulted in better performance compared to STFTs (Short-time Fourier Transform) and MFCCs.

[16] demonstrates in the experiment a comparison between Mel-spectrogram, Frame-level strided convolution layer and Sample-level strided convolution layer as input for models using the deep convolutional neural networks (DCNN) architecture. To this end, the author used the MagnaTagATune and Milion Song datasets as the input source, and made adjustments to all audios, reducing to 29.1 s of duration and reaming to 22050 Hz when necessary. The results showed better results when compared to Mel-spectrogram with the other, obtaining 0.9059 of AUC (Area Under the Curve) against 0.8906 and 0.9055 of Frame-level strided convolution layer and Sample-level strided convolution layer respectively.

However, to the best of our knowledge, all previous works evaluated architectures that required either the upload of raw user data to train the classifiers, or trained the classifiers on public or synthetic datasets to produce a model to be distributed to users. Nevertheless, such models tend to become obsolete over time, due to changes in the examples, or to the lack of personalization to the users real usage.

Therefore, the goal of this experiment is to use the FL technique and HEAR dataset (audio mel-spectrograms) to train CNN’s architectures to identify scenes of physical violence and compares the results with the traditional approach to identify whether or not similar results are maintained. The Table 1 summarizes everything that has been presented and positions our work in relation to others.

Table 1. Related works as well as our experiment describing Algorithm, Characteristics extractor, Data type and Dataset.

Full size table

4 Materials and Methods

As demonstrated in the past sections, FL and CNN’s architectures are able to decentralize model training from a clients network (devices), permitting a worldwide model to be trained by them, respecting the protection of every client’s information, and lessening the need to have a major infrastructure to train it. Hence, this paper points, in light of the past works, to train four CNN based on a transfer learning approach utilizing FL and compare these results to standard Deep Learning models.

The Fig. 2 shows the research process for this work. To detect violence in audios, we proposed to evaluate four CNN architectures under a transfer learning setting. We characterized the FL settings with three clients and the central server, which permits the decentralized training of models from a network. The HEAR dataset was set up as the data source for the experiments, and it was processed to extract mel-spectrograms from the audio. The dataset was parted into three equivalent parts for FL approach, one for each client. We used the Google Colab platform to run the experiments on the dataset, with a highly RAM-enabled environment. At last, we compare the results of the FL approach using Accuracy, \(f_{1}\) score, Precision, Recall metrics. We also show the static (non-parametric) test of Friedman [4], Pairwise comparisons and compare the FL and non-FL results to analyze the metric results between both approaches.

The experiment scheme shown in Fig. 3 describes the step to prepare the dataset as an input to the model in which it is the extraction of the mel-spectrogram from the audio. After that, we can handle the problem as image to be model inputs. For the FL approach, as displayed in Fig. 4, the mel-spectograms were split in the FL clients into three folders equally, and every client has its folder of the mel-spectrograms to train the models. Then, it would send the outcome to the centralized server after each round to build a worldwide model for classifying the presence or absence of violence.

4.1 Dataset

The HEAR Dataset [15], is a synthetic dataset that features 70,000 mel-spectro-grams audio representations, constructed from transformations of 35,045 audios combined from three audio open databases from Google AudioSet and Free-sound: 15,000 Google AudioSet foreground audios from the inside, conversation and speech classes; 20000 background audios also from Google AudioSet of the classes pets, television, toilet flush, door, radio, water, vacuum cleaner, sobbing, noise, sink and frying; and, 45 short-lasting physical violence sounds from Free-sound.

With the set of mel-spectrograms, the dataset was divided into 60,000 training audios, 50% with the presence of physical violence and 50% without the presence of violence, 5000 test and 5000 for evaluation basis (50% of each class respectively). For the present experiment, only 12500 records were used, 80% for training and 20% for testing (1250 for validation and 1250 for testing), both with 50% of audios labeled violence and non-violence, for reasons of computational limitation.

4.2 Compared Methods

For the experiment, this paper compared the performance between FL, and non-FL approaches using four CNN’s architectures that were used to verify which one presents the best result from the HEAR dataset. The selection criteria were: the most recent, from 2015, and representative according to the taxonomy performed by [12]. They are:

Inception v3 [32]: With a complex architecture, Inception v3 is the third edition of Google’s Inception Convolutional Neural Network, which began as a module for GoogleNet. The architecture has a high quality in results and uses bottleneck layer and asymmetric filters to lower computational cost.
MobileNet v2 [27]: With the proposal of being a light and deep neural network, also published by Google, MobileNet v2 allows the construction of highly efficient models with high precision and low memory consumption, allowing use on mobile devices.
ResNet152 v2 [8]: Coming from a Residual Neural Network, ResNet v2 differs by introducing a connection called Identity, proposed to avoid escape gradient problems or vanishing gradient. In addition, it has its residual blocks called “skip connection”, in which it allows the model to be extremely deep, having up to 152 layers.
VGG-16 [29]: convolutional neural network model with increased depth (up to 19 layers), of simple and homogeneous topology, which uses 3 connected layers, but of high computational cost, allowing to deal with large-scale images.

To perform the experiment, this paper used the pre-trained models and, through the technique of transfer learning, taking advantage of the weights trained in other contexts to perform this work, including only a few additional layers at the end of each architecture. First it was added a layer of Flatten, followed by a 20% layer of dropout, to prevent overfitting. Soon after, the experiment included one each of dense layer, with 1024 neurons and RELU activation function. Finally, a layer completely connected to output with 2 neurons with SOFTMAX activation function, according to the two classes (existence or not of violence), for the display of the probabilities of the classes.

4.3 Flower Framework

Flower is an open source FL framework that enables experiments to be run using machine learning frameworks such as PyTorch, TensorFlow, PyTorch Lightning, MXNet, scikit-learn, TFLite, among others, specifically designed to advance FL research, enabling heterogeneous FL workloads at scale [1]. Using Flower, you can run experiments simulating the existence of clients/devices that, connected to the federated network, can evolve their datasets and send the parameters of what has been learned to the centralizing server, which in turn will receive each update by aggregating into a single model.

The framework also allows the customization of different configurations for the experiment, such as waiting for a minimum number of clients to start a training cycle together, select assorted customers who participated in a cycle, perform the measurement of evolution both on the client and on the server, send the initial parameters to each client when starting an experiment. This flexibility allows for a diversification of scenarios in a given experiment.

Settings. To perform the experiment, this paper used the FedAvg [18] strategy, with 3 clients running 1 season per cycle, with 40 cycles, each client having 1/3 of the training dataset. In addition to these settings, it defined the existence of at least 3 clients for the beginning of each cycle, as well as the execution of validations, and also enabled the sending of the initial parameters for each client and set the validation configuration of the updated model using the test dataset, which is performed at each end of the cycle.

4.4 Experimental Setup

The experiment was conducted on the Google Colab^{Footnote 1} platform, Pro version, with 27 GB of RAM and an Nvidia Tesla P100 PCIe GPU with 16 GB. MLFlow^{Footnote 2} v1.20 was used to make annotations, log records, experiment tracking, metric storage and model in a central repository, being an open source platform enabling integration with several libraries used in the life cycle of machine learning solutions. The Python^{Footnote 3} v3 language and the TensorFlow^{Footnote 4} v2.6 framework were used to build the CNN models.

5 Results

The four chosen models were evaluated in the HEAR Dataset, and the results obtained are shown in the Table 2. The table also exposes the number of clients and rounds executed, as well as the total execution time (in hours). The MobileNet architecture performed better in terms of accuracy (71.9%), but Inception V3 showed better results in the \(f_{1}\) score, precision and recall, 89.1%, 85.8%, 92.7% respectively.

Table 2. Results obtained along with the number of times and runtime.

Full size table

The Table 2 presents that the MobileNet and Inception V3 architectures had a 1 h difference in run time, a much lower value compared to the time spent by the VGG16 and ResNet-152 architectures, which was an average of 12.3 h. An experiment was also performed with the same architectures without using FL techniques, presented in Table 3, to check for performance loss when using the technique.

Another important fact is the standard deviation, presented in the Table 2, where it indicates how close to the average the accuracy data is, they are: ResNet-152 has the lowest value with 0.013, followed by Inception V3 which obtained 0.029, VGG16 with 0.052 and MobileNet with 0.069.

Table 3. Comparison between training with and without FL with the number of times and lead time.

Full size table

In addition, the Table 3 presents a loss of percentage points of almost 10% in the accuracy result of the MobileNet and ResNet-152 architectures when comparing the results between FL and non-FL. However, for the VGG16 and Inception V3 architectures the loss of percentage points is not as significant as in the MobileNet and ResNet-152 architectures, since the VGG16 presented 68.4% for training without FL and 65.2% with FL. InceptionV3 presented 61.5% for training without FL and 58.7% for FL. Another point to be highlighted is the execution time between the two approaches, where in the FL approach the MobileNet architecture presented shorter execution time and in the non-FL approach the MobileNet and VGG16 architectures presented the same execution time.

The static (non-parametric) test of Friedman [4] was applied for comparison of multiple models on a single dataset, in which the calculation of significance (p-value) was presented below the significance level \(\alpha \) = 0.050, concluding that the distributions differ and the null hypothesis can be rejected.

Another point is that the MobileNet architecture presents better performance in terms of accuracy (75.5%), \(f_{1}\) score (75.0%), precision (76.3%) and recall (79.0%) when fl was not used (Table 4 and 5).

Table 4. Friedman Test Statistics for FL.

Full size table

Table 5. Friedman Test Statistics for Non-FL.

Full size table

Table 6. Pairwise Comparisons for Non-FL.

Full size table

Table 7. Pairwise Comparisons for FL.

Full size table

Wilcoxon’s [35] post hoc method was applied to verify the null hypothesis for the samples, as a result critical difference diagrams were generated, illustrated in Fig. 5. Another way to visualize the data and identify the most critical differences is through pairwise comparisons, displayed in the Table 6 for Non-FL approach and the Table 7 for FL.

6 Conclusion and Future Work

This work aims to investigate the performance of CNNs architectures in the detection of physical violence using FL through HEAR Dataset [15], which is a synthetic dataset with 70,000 audios converted to mel-spectrograms (images). However, due to computational limitations, only 10,000 images divided equally into two classes were used: presence or not of physical violence. Thus, it was possible to apply consolidated techniques and tools to the audio context. [15] presented the investigation and application of the use of mel-spectrograms in the field of detection of audio violence, opening a gap for their application using FL. In this experiment, the dataset used was divided among 3 clients, each running 1 season each round, in a 40-round cycle, considering four models of CNN: Inception v3, MobileNet v2, ResNet 152 v2, and VGG-16. Finally, the results showed that MobileNet obtained a better result when used with the FL technique, presenting a performance of 71.9% in the accuracy metric, with a loss of 3.6% when compared to the experiment without FL.

For future work, it is aimed at conducting tests with models in audios of real violence, application of techniques to provide privacy to data trafficked between client-servers [6, 10, 17, 23, 33], use the model in a larger sample of customers. In addition, it is intended to build comparisons through other models developed via Transfer Learning from a Large-Scale Pretrained Audio Neural Networks (PANNs) [13].

Notes

References

Beutel, D.J., Topal, T., Mathur, A., Qiu, X., Parcollet, T., Lane, N.D.: Flower: a friendly federated learning research framework. arXiv preprint arXiv:2007.14390 (2020)
Choi, K., Fazekas, G., Sandler, M.: Automatic tagging using deep convolutional neural networks (2016)
Google Scholar
Durães, D., Marcondes, F.S., Gonçalves, F., Fonseca, J., Machado, J., Novais, P.: Detection violent behaviors: a survey. In: Novais, P., Vercelli, G., Larriba-Pey, J.L., Herrera, F., Chamoso, P. (eds.) ISAmI 2020. AISC, vol. 1239, pp. 106–116. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-58356-9_11
Chapter Google Scholar
Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937). https://doi.org/10.1080/01621459.1937.10503522. https://www.tandfonline.com/doi/abs/10.1080/01621459.1937.10503522
Fukushima, K.: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36(4), 193–202 (1980). https://doi.org/10.1007/BF00344251
Article MATH Google Scholar
Gu, B., Xu, A., Huo, Z., Deng, C., Huang, H.: Privacy-preserving asynchronous vertical federated learning algorithms for multiparty collaborative learning. IEEE Trans. Neural Netw. Learn. Syst. 33, 1–13 (2021). https://doi.org/10.1109/TNNLS.2021.3072238
Article MathSciNet Google Scholar
Hard, A., et al.: Training keyword spotting models on non-iid data with federated learning (2020). https://arxiv.org/abs/2005.10406
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). http://arxiv.org/abs/1512.03385
Hossain, M.S., Muhammad, G.: Emotion recognition using deep learning approach from audio–visual emotional big data. Inf. Fusion 49, 69–78 (2019). https://doi.org/10.1016/j.inffus.2018.09.008, https://www.sciencedirect.com/science/article/pii/S1566253517307066
Hu, R., Guo, Y., Gong, Y.: Concentrated differentially private federated learning with performance analysis. IEEE Open J. Comput. Soc. 2, 276–289 (2021). https://doi.org/10.1109/OJCS.2021.3099108
Article Google Scholar
Volkmann, J., Stevens, S.S., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8, 208 (1937). https://doi.org/10.1121/1.1901999
Article Google Scholar
Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S.: A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 53(8), 5455–5516 (2020). https://doi.org/10.1007/s10462-020-09825-6. http://arxiv.org/abs/1901.06032
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2880–2894 (2020). https://doi.org/10.1109/TASLP.2020.3030497. https://ieeexplore.ieee.org/document/9229505/
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/10.1145/3065386
Article Google Scholar
Lacerda, T. B., Miranda, P., Camara, A., Furtado, A.P.C.: Deep learning and mel-spectrograms for physical violence detection in audio. In: The 18th National Meeting on Artificial and Computational Intelligence, pp. 268–279 (2021). https://sol.sbc.org.br/index.php/eniac/article/view/18259/18093
Lee, J., Park, J., Kim, K.L., Nam, J.: Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms (2017)
Google Scholar
Lu, Y., Huang, X., Dai, Y., Maharjan, S., Zhang, Y.: Differentially private asynchronous federated learning for mobile edge computing in urban informatics. IEEE Trans. Ind. Inf. 16(3), 2134–2143 (2020). https://doi.org/10.1109/TII.2019.2942179
Article Google Scholar
McMahan, B., Moore, E., Ramage, D., Hampson, S., Arcas, B.A.y.: Communication-efficient learning of deep networks from decentralized data. In: Singh, A., Zhu, J. (eds.) Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 54, pp. 1273–1282. PMLR (2017). https://proceedings.mlr.press/v54/mcmahan17a.html
Nations, U.: Policy brief: the impact of covid-19 on women (2020). https://www.un.org/sexualviolenceinconflict/wp-content/uploads/2020/06/report/policy-brief-the-impact-of-covid-19-on-women/policy-brief-the-impact-of-covid-19-on-women-en-1.pdf
Nayyar, R.K., Nair, S., Patil, O., Pawar, R., Lolage, A.: Content-based auto-tagging of audios using deep learning. In: 2017 International Conference on Big Data, IoT and Data Science (BID), pp. 30–36 (2017). https://doi.org/10.1109/BID.2017.8336569
Organization, W.H.: Violence against women (2021). https://www.who.int/news-room/fact-sheets/detail/violence-against-women
Organization, W.H.: Violence against women prevalence estimates, 2018: global, regional and national prevalence estimates for intimate partner violence against women and global and regional prevalence estimates for non-partner sexual violence against women (2021). https://www.who.int/publications/i/item/9789240022256
Paul, S., Sengupta, P., Mishra, S.: Flaps: Federated learning and privately scaling. In: 2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), pp. 13–19 (2020). https://doi.org/10.1109/MASS50613.2020.00011
Purwins, H., Li, B., Virtanen, T., Schluter, J., Chang, S.Y., Sainath, T.: Deep learning for audio signal processing. IEEE J. Sel. Topics Signal Process. 13(2), 206–219 (2019). https://doi.org/10.1109/jstsp.2019.2908700
Article Google Scholar
Ramzan, M., et al.: A review on state-of-the-art violence detection techniques. IEEE Access 7, 107560–107575 (2019). https://doi.org/10.1109/ACCESS.2019.2932114
Article Google Scholar
Rouas, J.L., Louradour, J., Ambellouis, S.: Audio events detection in public transport vehicle. In: 2006 IEEE Intelligent Transportation Systems Conference, pp. 733–738. IEEE (2006). https://doi.org/10.1109/ITSC.2006.1706829. http://ieeexplore.ieee.org/document/1706829/
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. IEEE (2018). https://doi.org/10.1109/CVPR.2018.00474. https://ieeexplore.ieee.org/document/8578572/
Santos, F.: In-car violence detection based on the audio signal. In: Yin, H., et al. (eds.) IDEAL 2021. LNCS, vol. 13113, pp. 437–445. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91608-4_43
Chapter Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015). https://arxiv.org/abs/1409.1556
Souto, H., Mello, R., Furtado, A.: An acoustic scene classification approach involving domestic violence using machine learning. In: Anais do ENIAC, pp. 705–716 (2019). https://doi.org/10.5753/eniac.2019.9327. https://sol.sbc.org.br/index.php/eniac/article/view/9327
Szegedy, C., et al.: Going deeper with convolutions (2014)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision (2015). http://arxiv.org/abs/1512.00567
Triastcyn, A., Faltings, B.: Federated learning with bayesian differential privacy. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 2587–2596 (2019). https://doi.org/10.1109/BigData47090.2019.9005465
Tripathi, G., Singh, K.V.D.K.: Violence recognition using convolutional neural network: a survey. J. Intell. Fuzzy Syst. 39, 7931–7952 (2020). https://doi.org/10.3233/JIFS-201400. https://content.iospress.com/articles/journal-of-intelligent-and-fuzzy-systems/ifs201400
Wilcoxon, F.: Individual comparisons by ranking methods. Biometr. Bull. 1(6), 80–83 (1945). http://www.jstor.org/stable/3001968

Download references

Author information

Authors and Affiliations

Center for Advanced Studies and Systems of Recife, Recife, Brazil
Victor E. de S. Silva, Tiago B. Lacerda & Amerson Riley Cabral Chagas
Federal Rural University of Pernambuco, Recife, Brazil
Péricles Miranda, André Câmara & Ana Paula C. Furtado

Authors

Victor E. de S. Silva
View author publications
Search author on:PubMed Google Scholar
Tiago B. Lacerda
View author publications
Search author on:PubMed Google Scholar
Péricles Miranda
View author publications
Search author on:PubMed Google Scholar
André Câmara
View author publications
Search author on:PubMed Google Scholar
Amerson Riley Cabral Chagas
View author publications
Search author on:PubMed Google Scholar
Ana Paula C. Furtado
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Victor E. de S. Silva .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Murilo C. Naldi
Centro Universitario da FEI, São Bernardo do Campo, Brazil
Reinaldo A. C. Bianchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de S. Silva, V.E., Lacerda, T.B., Miranda, P., Câmara, A., Chagas, A.R.C., Furtado, A.P.C. (2023). Federated Learning and Mel-Spectrograms for Physical Violence Detection in Audio. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-45392-2_25
Published: 12 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45391-5
Online ISBN: 978-3-031-45392-2
eBook Packages: Computer ScienceComputer Science (R0)

Federated Learning and Mel-Spectrograms for Physical Violence Detection in Audio

Abstract

Similar content being viewed by others

Analysis of Machine Learning Algorithms for Violence Detection in Audio

Violence Detection Through Fusing Visual Information to Auditory Scene

Domestic Violence Detection Using Smart Microphones

1 Introduction