1 Introduction

Violence against women is widely recognized as a serious public health problem and a violation of women’s human rights [21, 22]. The COVID-19 pandemic combined with economic and social tensions, together with measures to restrict contact and movement is increasing violence against women and girls globally. Before the pandemic, there was an estimate that for every three women, one will experience violence throughout life. During the pandemic, these women are in their homes, cornered with their abusers, in which they exploit the inability of the woman to make a call asking for help or escape, on the other hand, health services are overloaded, non-governmental organizations and support houses are crowded, closed or were reused as a health unit [19].

In the face of this problem, the detection of physical violence is a challenging task, since it depends on the detection of changes in behavior caused by a disagreement of idea, injustice or serious disagreement [34]. A way to make such detection mechanism readily accessible, it can be embedded in a mobile application in order to help to detect and call the police to prevent the violence situation to go further through rapid intervention. Several papers in this topic were proposed recently [25, 34], in which various techniques are used to detect violence in videos. However, there are few studies related to the detection of physical violence by audio. According to [3], only 2 works were found between the years 2015 and 2020 related to the theme, evidencing the complexity and lack of studies in the area.

However, it is necessary to respect the privacy of the data collected, stored and processed in a Machine Learning application, in accordance with European data protection laws (General Data Protection Regulation) and Brazilian (General Data Protection Law), where in some cases it is necessary authorization of the owner of the information for manipulation of information. In addition, it takes high computational power to keep Machine Learning (ML) models running, experiments, training, and even retraining with new data. Federated Learning [18] (FL) is gaining a lot of attention lately, since it allows the decentralization of the learning process to the user’s own devices. Collective knowledge is then aggregated in a centralized model, built over several users’ models on the federated network. Thus, the privacy of each user’s data is maintained on the device, and there is no storage of sensitive information in a centralized location. In view of this, it raises the following research question that will guide the development of this work: Is it possible to maintain similar results by using the FL approach [18], compared with the traditional approach, in identifying scenes of physical violence through audio mel-spectrograms and CNN’s architectures?

The experiments were conducted based on HEAR Dataset [15], a synthetic dataset that has 70,000 instances of 10-second audio clips, divided equally between two classes: presence or not of physical violence. In order to keep the experiments computationally feasible, only 12,500 records were used. The experiments conducted in this work considered the following CNN architectures: Inception [32], MobileNet [27], ResNet152 [8] and VGG-16 [29], and the results showed that CNN MobileNet was the best among the other models when evaluated in HEAR dataset, reaching 66.8% accuracy, with a loss of 8.6% when compared to non-FL experiments.

2 Background

In this Section, the concepts of Convolutional Neural Networks (CNN), FL and Mel-spectrogram to be used in this work will be presented.

2.1 Convolutional Neural Networks (CNN)

A Convolutional Neural Network [5] (CNN) is a deep neural network architecture, very popular in image classification tasks. It can learn to assign importance to relevant aspects of an image, allowing the learning of these characteristics and the distinction between them. For this, unlike traditional algorithms where filters/features needed to be implemented manually, CNN is able to learn a large amount of filters, allowing it to learn the best representation by itself. After the emergence of the AlexNet [14] architectures, exposed in the ImageNet challenge in 2012, and GoogleNet [31], presented in the ImageNet challenge in 2014, convolutional neural networks became popular influencing the emergence of architectures with upper 22 layers of GoogleNet, as is the case with ResNet [8] with 152 layers.

2.2 Federated Learning (FL)

Federated Learning [18] (FL) consists of the decentralization of the model learning process to customers (user devices), allowing a global model to be trained from various models of users of a network. In addition, the privacy of each user’s data, often sensitive, is maintained on the device, with no sharing of raw data with the central server. Only the resulting trained models are shared on the federated network [7].

Furthermore, the FL technique can be applied when data for training is concentrated on other devices, allowing the centralizing server not to require a high computational power. Another scenario is when the data is sensitive and privacy needs to be respected, so only the training results are shared and the raw data remains on the source device. The FL technique also supports the constant updating of the model, since new data can be collected by the network of devices and used to improve the model.

The training process using FL works through learning rounds, where network devices receive the updated global model and start the learning process with data from their respective devices, and at the end, each device will send the training results to the centralizing server, which will aggregate these results into a global model using an aggregation algorithm, for example Federated Averaging (FedAvg) [18]. This process is repeated until the end of the defined training rounds or until a certain evaluation metric is achieved.

2.3 Mel-spectrogram

The problem addressed in this work belongs to the Sound Event Classification (Acoustic Scene Classification), which consists of adding markers to identify the existence or not of a particular characteristic present at any time in the audio. One instance of this problem is the detection of the presence of violence in audio samples. For this, HEAR Dataset uses the extraction of Mel-spectrogram from audios, which is a Mel scale spectogram [11] that represents sound components focused on frequencies accessible to human hearing, eliminating low frequency noise observed in the linear spectogram. From there, it can create a bi-dimencional image to be used in image classifiers. The Fig. 1 demonstrates the mel-spectrograms extracted from two short-lived audios containing physical violence (“punch in the face” and “slap in the face” respectively).

Fig. 1.
figure 1

Examples of two short-lived, high-intensity events of different types of physical violence.

3 Related Works

The Acoustic Scene Classification problem is receiving a lot of attention lately. In [15], the authors introduce an application of Mel-spectrogram sound representation to be fed in a Deep Learning Neural Network for detecting physical violence in audio by converting audio into images. For that, the authors built a synthetic dataset with about 70,000 images, called HEAR dataset, resulting from the extraction of audio using the Mel-spectrogram technique. Finally, a comparison of the MobileNet, VGG16, ResNet-152 and InsceptionV3 architectures were realized, in which MobileNet achieved the best results, with an accuracy of 78.9% and \(f_{1}\) score of 78.1%.

[28] presents an experiment to detect violence inside the car using audio signal. For that context, the author built the In-Car video dataset with the presence of ambient sounds and music to try to simulate the desired environment, and used Mel-spectrogram to generate images that represent each audio. The author then submitted the dataset to the ResNet, DenseNet, SqueezeNet, and Mobile architectures, in which he obtained the best result when applied to the ResNet-18 model with 92.95% accuracy.

[9] displays an emotion recognition system using deep learning approach from an emotional audiovisual big data. To this end, the audios were processed to obtain the Mel-spectrogram, allowing to treat the domain as an image problem, and for the videos some frames were extracted to both feed a CNN each. Then, the output of the two CNNs were joined using two consecutive extreme learning machines (ELMs). The result of this union is an input source for a support vector machine (SVM) for final classification of emotions. Finally, the experiment demonstrates the efficacies of the proposed system involving CNNs and ELMs.

[20] demonstrates the content-based audio auto-tagging experiment using deep learning. To do that, it used deep neural network architectures such as Fully Convolutional Neural Network (FCN) and Convolutional Recurrent Neural Network (CRNN) in conjunction with MagnaTagATune (MTT) dataset. As input for the models, the author used the extraction of Mel-spectrogram, transforming the audios into images. As a result, the CRNN model obtained a better result with 88.56% of AUC-ROC (area under the curve - receiver operating characteristic) as opposed to 85.53% of the FCN model.

The authors in [3] presents two papers related to the detection of violence through audio published between 2015 and 2020, in addition to other works that used only video or audio and video, called multi-modal. The first work, produced by [26], depicts a scream detection system in public transport vehicles, in which it uses the characteristics of Mel Frequency Cepstral Coefficients (MFCC), energy, delta and delta-deltas. The author uses the Gaussian Model Mixture (GMM) and Support Vector Machine (SVM) models to construct the experiments from its own dataset with about 2500 s seconds of audio. The experiment demonstrated that SVM generated a low rate of false alarms and GMM approach had a better identification rate. It is observed that [3] exposes this work even though it is outside the established period.

[30] proposes the classification and detection of acoustic scenes evolving domestic violence using machine learning. To this end, the author proposes the use of the SVM classifier to detect scenes of domestic violence of a man against a woman through the audio parameters MFCC (Mel Frequency Cepstral Coefficients), ZCR (Zero Rate Crossing) and Energy, which are low-level acoustic parameters. As a result, achieved 73.14% of accuracy for the audio parameter MFCC against 71.3% and 66.54% of average was obtained for Energy and ZCR, respectively.

[24] also presents two papers on the processing of audio signals from deep learning using Mel-spectrogram published between 2016 and 2017. The first work, performed by [2], presents an automatic music tagging system using fully convolutional neural networks (FCNs) in conjunction with the MagnaTagATune and Million Song datasets. The author also used the Mel-spectrogram technique to extract characteristics from the audios and convert them to image. The experiment demonstrated that the use of Mel-spectrogram as input representation resulted in better performance compared to STFTs (Short-time Fourier Transform) and MFCCs.

[16] demonstrates in the experiment a comparison between Mel-spectrogram, Frame-level strided convolution layer and Sample-level strided convolution layer as input for models using the deep convolutional neural networks (DCNN) architecture. To this end, the author used the MagnaTagATune and Milion Song datasets as the input source, and made adjustments to all audios, reducing to 29.1 s of duration and reaming to 22050 Hz when necessary. The results showed better results when compared to Mel-spectrogram with the other, obtaining 0.9059 of AUC (Area Under the Curve) against 0.8906 and 0.9055 of Frame-level strided convolution layer and Sample-level strided convolution layer respectively.

However, to the best of our knowledge, all previous works evaluated architectures that required either the upload of raw user data to train the classifiers, or trained the classifiers on public or synthetic datasets to produce a model to be distributed to users. Nevertheless, such models tend to become obsolete over time, due to changes in the examples, or to the lack of personalization to the users real usage.

Therefore, the goal of this experiment is to use the FL technique and HEAR dataset (audio mel-spectrograms) to train CNN’s architectures to identify scenes of physical violence and compares the results with the traditional approach to identify whether or not similar results are maintained. The Table 1 summarizes everything that has been presented and positions our work in relation to others.

Table 1. Related works as well as our experiment describing Algorithm, Characteristics extractor, Data type and Dataset.

4 Materials and Methods

As demonstrated in the past sections, FL and CNN’s architectures are able to decentralize model training from a clients network (devices), permitting a worldwide model to be trained by them, respecting the protection of every client’s information, and lessening the need to have a major infrastructure to train it. Hence, this paper points, in light of the past works, to train four CNN based on a transfer learning approach utilizing FL and compare these results to standard Deep Learning models.

Fig. 2.
figure 2

The study workflow proposed.

The Fig. 2 shows the research process for this work. To detect violence in audios, we proposed to evaluate four CNN architectures under a transfer learning setting. We characterized the FL settings with three clients and the central server, which permits the decentralized training of models from a network. The HEAR dataset was set up as the data source for the experiments, and it was processed to extract mel-spectrograms from the audio. The dataset was parted into three equivalent parts for FL approach, one for each client. We used the Google Colab platform to run the experiments on the dataset, with a highly RAM-enabled environment. At last, we compare the results of the FL approach using Accuracy, \(f_{1}\) score, Precision, Recall metrics. We also show the static (non-parametric) test of Friedman [4], Pairwise comparisons and compare the FL and non-FL results to analyze the metric results between both approaches.

Fig. 3.
figure 3

The schematic of the proposed training procedure.

Fig. 4.
figure 4

The FL schematic of the training process.

The experiment scheme shown in Fig. 3 describes the step to prepare the dataset as an input to the model in which it is the extraction of the mel-spectrogram from the audio. After that, we can handle the problem as image to be model inputs. For the FL approach, as displayed in Fig. 4, the mel-spectograms were split in the FL clients into three folders equally, and every client has its folder of the mel-spectrograms to train the models. Then, it would send the outcome to the centralized server after each round to build a worldwide model for classifying the presence or absence of violence.

4.1 Dataset

The HEAR Dataset [15], is a synthetic dataset that features 70,000 mel-spectro-grams audio representations, constructed from transformations of 35,045 audios combined from three audio open databases from Google AudioSet and Free-sound: 15,000 Google AudioSet foreground audios from the inside, conversation and speech classes; 20000 background audios also from Google AudioSet of the classes pets, television, toilet flush, door, radio, water, vacuum cleaner, sobbing, noise, sink and frying; and, 45 short-lasting physical violence sounds from Free-sound.

With the set of mel-spectrograms, the dataset was divided into 60,000 training audios, 50% with the presence of physical violence and 50% without the presence of violence, 5000 test and 5000 for evaluation basis (50% of each class respectively). For the present experiment, only 12500 records were used, 80% for training and 20% for testing (1250 for validation and 1250 for testing), both with 50% of audios labeled violence and non-violence, for reasons of computational limitation.

4.2 Compared Methods

For the experiment, this paper compared the performance between FL, and non-FL approaches using four CNN’s architectures that were used to verify which one presents the best result from the HEAR dataset. The selection criteria were: the most recent, from 2015, and representative according to the taxonomy performed by [12]. They are:

  • Inception v3 [32]: With a complex architecture, Inception v3 is the third edition of Google’s Inception Convolutional Neural Network, which began as a module for GoogleNet. The architecture has a high quality in results and uses bottleneck layer and asymmetric filters to lower computational cost.

  • MobileNet v2 [27]: With the proposal of being a light and deep neural network, also published by Google, MobileNet v2 allows the construction of highly efficient models with high precision and low memory consumption, allowing use on mobile devices.

  • ResNet152 v2 [8]: Coming from a Residual Neural Network, ResNet v2 differs by introducing a connection called Identity, proposed to avoid escape gradient problems or vanishing gradient. In addition, it has its residual blocks called “skip connection”, in which it allows the model to be extremely deep, having up to 152 layers.

  • VGG-16 [29]: convolutional neural network model with increased depth (up to 19 layers), of simple and homogeneous topology, which uses 3 connected layers, but of high computational cost, allowing to deal with large-scale images.

To perform the experiment, this paper used the pre-trained models and, through the technique of transfer learning, taking advantage of the weights trained in other contexts to perform this work, including only a few additional layers at the end of each architecture. First it was added a layer of Flatten, followed by a 20% layer of dropout, to prevent overfitting. Soon after, the experiment included one each of dense layer, with 1024 neurons and RELU activation function. Finally, a layer completely connected to output with 2 neurons with SOFTMAX activation function, according to the two classes (existence or not of violence), for the display of the probabilities of the classes.

4.3 Flower Framework

Flower is an open source FL framework that enables experiments to be run using machine learning frameworks such as PyTorch, TensorFlow, PyTorch Lightning, MXNet, scikit-learn, TFLite, among others, specifically designed to advance FL research, enabling heterogeneous FL workloads at scale [1]. Using Flower, you can run experiments simulating the existence of clients/devices that, connected to the federated network, can evolve their datasets and send the parameters of what has been learned to the centralizing server, which in turn will receive each update by aggregating into a single model.

The framework also allows the customization of different configurations for the experiment, such as waiting for a minimum number of clients to start a training cycle together, select assorted customers who participated in a cycle, perform the measurement of evolution both on the client and on the server, send the initial parameters to each client when starting an experiment. This flexibility allows for a diversification of scenarios in a given experiment.

Settings. To perform the experiment, this paper used the FedAvg [18] strategy, with 3 clients running 1 season per cycle, with 40 cycles, each client having 1/3 of the training dataset. In addition to these settings, it defined the existence of at least 3 clients for the beginning of each cycle, as well as the execution of validations, and also enabled the sending of the initial parameters for each client and set the validation configuration of the updated model using the test dataset, which is performed at each end of the cycle.

4.4 Experimental Setup

The experiment was conducted on the Google ColabFootnote 1 platform, Pro version, with 27 GB of RAM and an Nvidia Tesla P100 PCIe GPU with 16 GB. MLFlowFootnote 2 v1.20 was used to make annotations, log records, experiment tracking, metric storage and model in a central repository, being an open source platform enabling integration with several libraries used in the life cycle of machine learning solutions. The PythonFootnote 3 v3 language and the TensorFlowFootnote 4 v2.6 framework were used to build the CNN models.

5 Results

The four chosen models were evaluated in the HEAR Dataset, and the results obtained are shown in the Table 2. The table also exposes the number of clients and rounds executed, as well as the total execution time (in hours). The MobileNet architecture performed better in terms of accuracy (71.9%), but Inception V3 showed better results in the \(f_{1}\) score, precision and recall, 89.1%, 85.8%, 92.7% respectively.

Table 2. Results obtained along with the number of times and runtime.

The Table 2 presents that the MobileNet and Inception V3 architectures had a 1 h difference in run time, a much lower value compared to the time spent by the VGG16 and ResNet-152 architectures, which was an average of 12.3 h. An experiment was also performed with the same architectures without using FL techniques, presented in Table 3, to check for performance loss when using the technique.

Another important fact is the standard deviation, presented in the Table 2, where it indicates how close to the average the accuracy data is, they are: ResNet-152 has the lowest value with 0.013, followed by Inception V3 which obtained 0.029, VGG16 with 0.052 and MobileNet with 0.069.

Table 3. Comparison between training with and without FL with the number of times and lead time.

In addition, the Table 3 presents a loss of percentage points of almost 10% in the accuracy result of the MobileNet and ResNet-152 architectures when comparing the results between FL and non-FL. However, for the VGG16 and Inception V3 architectures the loss of percentage points is not as significant as in the MobileNet and ResNet-152 architectures, since the VGG16 presented 68.4% for training without FL and 65.2% with FL. InceptionV3 presented 61.5% for training without FL and 58.7% for FL. Another point to be highlighted is the execution time between the two approaches, where in the FL approach the MobileNet architecture presented shorter execution time and in the non-FL approach the MobileNet and VGG16 architectures presented the same execution time.

The static (non-parametric) test of Friedman [4] was applied for comparison of multiple models on a single dataset, in which the calculation of significance (p-value) was presented below the significance level \(\alpha \) = 0.050, concluding that the distributions differ and the null hypothesis can be rejected.

Another point is that the MobileNet architecture presents better performance in terms of accuracy (75.5%), \(f_{1}\) score (75.0%), precision (76.3%) and recall (79.0%) when fl was not used (Table 4 and 5).

Table 4. Friedman Test Statistics for FL.
Table 5. Friedman Test Statistics for Non-FL.
Table 6. Pairwise Comparisons for Non-FL.
Table 7. Pairwise Comparisons for FL.

Wilcoxon’s [35] post hoc method was applied to verify the null hypothesis for the samples, as a result critical difference diagrams were generated, illustrated in Fig. 5. Another way to visualize the data and identify the most critical differences is through pairwise comparisons, displayed in the Table 6 for Non-FL approach and the Table 7 for FL.

Fig. 5.
figure 5

Critical difference diagram.

6 Conclusion and Future Work

This work aims to investigate the performance of CNNs architectures in the detection of physical violence using FL through HEAR Dataset [15], which is a synthetic dataset with 70,000 audios converted to mel-spectrograms (images). However, due to computational limitations, only 10,000 images divided equally into two classes were used: presence or not of physical violence. Thus, it was possible to apply consolidated techniques and tools to the audio context. [15] presented the investigation and application of the use of mel-spectrograms in the field of detection of audio violence, opening a gap for their application using FL. In this experiment, the dataset used was divided among 3 clients, each running 1 season each round, in a 40-round cycle, considering four models of CNN: Inception v3, MobileNet v2, ResNet 152 v2, and VGG-16. Finally, the results showed that MobileNet obtained a better result when used with the FL technique, presenting a performance of 71.9% in the accuracy metric, with a loss of 3.6% when compared to the experiment without FL.

For future work, it is aimed at conducting tests with models in audios of real violence, application of techniques to provide privacy to data trafficked between client-servers [6, 10, 17, 23, 33], use the model in a larger sample of customers. In addition, it is intended to build comparisons through other models developed via Transfer Learning from a Large-Scale Pretrained Audio Neural Networks (PANNs) [13].