key: cord-0675000-16d8v5eg
authors: Ngo, Dat; Pham, Lam; Hoang, Truong; Kolozali, Sefki; Jarchi, Delaram
title: Audio-Based Deep Learning Frameworks for Detecting COVID-19
date: 2022-02-10
journal: nan
DOI: nan
sha: 7a4714475e6ee357ada9cec8dd56bb1458b25f37
doc_id: 675000
cord_uid: 16d8v5eg

This paper evaluates a wide range of audio-based deep learning frameworks applied to the breathing, cough, and speech sounds for detecting COVID-19. In general, the audio recording inputs are transformed into low-level spectrogram features, then they are fed into pre-trained deep learning models to extract high-level embedding features. Next, the dimension of these high-level embedding features are reduced before finetuning using Light Gradient Boosting Machine (LightGBM) as a back-end classification. Our experiments on the Second DiCOVA Challenge achieved the highest Area Under the Curve (AUC), F1 score, sensitivity score, and specificity score of 89.03%, 64.41%, 63.33%, and 95.13%, respectively. Based on these scores, our method outperforms the state-of-the-art systems, and improves the challenge baseline by 4.33%, 6.00% and 8.33% in terms of AUC, F1 score and sensitivity score, respectively.

The COVID-19 pandemic is now continuing to spread around the world, with more than 300 million confirmed cases, and causing more than five million deaths across almost 200 countries [1] . Although many countries are in relentless efforts of launching their vaccination programs, the number of global daily new cases is still increasing as COVID-19 restrictions are being eased in many countries and people can travel worldwide. To deal with the rapid spread of infection across populations, many countries have to conduct massive daily tests for their citizens to control the virus spreads. Particularly, the molecular testing approaches, namely the reverse transcription polymerase chain reaction test (RT-PCR) [2] and the rapid antigen test (RAT) [3] , are now widely applied as primary testing methodologies in most countries. However, these methodologies present various limitations as their procedure of sample collection violates physical distancing in many countries that the ordering online lateral test at home is limited. Additionally, analysing and receiving results require high-cost, chemical equipment, and involving labour-intensive tasks. As a result, there is an urgent need for non-invasive, scalable, and cost-effective tool to detect infected individuals in a decentralized manner. As COVID-19 disease is related to primary symptoms such as fever, sore throat, cough, chest pain, etc. Therefore, many researchers and practitioners are motivated to use acoustic-related signals such as breathing, cough, speech from human respiratory system to early detect COVID-19 and non-COVID-19 individuals [4] . Indeed, calls for development of diagnostic tools were announced in the Interspeech 2021 as a special session titled 'Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge' as well as in ICASSP 2022 as a calling paper, referred to as the First DiCOVA and the Second DiCOVA Challenges, respectively.

In this paper, we aim to explore all types of human respiratory sounds: breathing, cough, speech, provided by the recent Second DiCOVA Challenge, and then propose a robust framework for detecting COVID-19. Our contributions include: (1) to conduct extensive experiments and pinpoint the most effective approach related to the extraction of wellrepresented features for each type of breathing, cough, speech sound input; (2) to evaluate how oversampling on positive samples and dimension reduction on represented features affect the system performance; and (3) to demonstrate that our proposed framework is reliable and robust for detecting COVID-19 which outperforms the state-of-the-art systems and is potential for a real-life application.

In this paper, we evaluate the dataset derived from the Second DiCOVA Challenge [5] . This dataset provides audio recordings of three different sound categories: breathing, cough, speech which were collected from both COVID-19 positive and negative patients in the age group of 15 

To explore DiCOVA dataset, we firstly present a high-level architecture of proposed framework as shown in Figure 1 . Generally, an entire framework proposed is separated into three main steps: Low-level spectrogram feature extraction, high-level embedding feature extraction, and back-end classification.

At the first step shown in the upper stream in Figure 1 , the raw audio recordings are re-sampled to 16.000 Hz, duplicated to make sure that all audio recordings have an equal duration of 10 seconds. These audio recordings are then transformed into spectrograms where both temporal and spectral features are presented.

The low-level spectrogram features are then fed into pretrained deep learning models to extract embeddings (e.g. vectors), referred to as high-level features. The pre-trained deep learning models used for extracting high-level embedding features come from two different approaches: (I) pre-trained models directly trained on DiCOVA dataset and (II) pre-trained models trained with the large-scale AudioSet dataset [7] .

(I) In the first approach, we construct deep learning models and directly train these models on DiCOVA dataset mentioned in Section II with Cross-entropy loss and Adam method [8] for optimization. After the training process, we 

reuse these models to extract the feature map at the global pooling layer, which is considered as the high-level embedding features mentioned. As the performance of high-level embedding features depends on the low-level spectrograms and the model architectures in this approach, we therefore base on our previous work [9] - [13] to evaluate: three types of spectrograms such as log-Mel [14] , Gammatonegram [15] , and Scalogram [16] that are proven effective in representing respiratory-related sounds; and a wide range of benchmark deep learning models from low footprint models such as Lenet based network [17] to high-complex architectures such as Xception [18] , InceptionV3 [19] , etc.

To identify which spectrogram features are well-represented for breathing, cough, or speech audio inputs, we evaluate three proposed spectrograms using a Lenet based network architecture as shown in Table I . To make sure that the spectrograms that are fed into the Lenet based model have the same size, we use the same setting parameters with window size = 2048, hop size = 1024, and filter number = 128; to generate the same spectrogram of 128×154. Regarding the Lenet based architecture as shown in Table I , it includes the convolutional layer (Conv [kernel size]), batch normalization (BN) [20] , rectified linear units (ReLU) [21] , average pooling (AP), global average pooling (GAP), dropout [22] (Dr(percentage)), fully connected (FC) and Softmax layer. Since certain low-level spectrograms are suitable for various types of audio inputs (i.e. breathing, cough, or speech), we therefore evaluate these spectrograms with different benchmark neural network architectures of VGG16 [23] , VGG19 [23] , MobileNetv1 [24] , ResNet50 [25] , Xception [18] , InceptionV3 [19] , and DenseNet121 [26] to achieve the best framework configuration (i.e. which low-level spectrogram and training network architecture). As evaluated network architectures are reused from the Keras library [27] , the final fully connected layer of these networks is modified from 1000 (i.e. the number of classes of image object in ImageNet datset) to 2 which matches the number of classes in DiCOVA dataset. It is worth noting that we only reuse the network architectures from the Keras library instead of using available weights trained with ImageNet dataset, all trainable parameters of these networks are initialized with mean and variance set to 0 and 0.1, respectively.

(II) In the second approach, we leverage three available pre-trained models trained with the large-scale AudioSet dataset in advance: PANN [28] , OpenL3 [29] , TRILL [30] . We evaluate whether these up-stream pre-trained models are beneficial for the down-stream task of detecting COVID-19 positive in DiCOVA dataset. While PANN [28] and OpenL3 [29] lever- age VGGish architecture and cross-entropy loss to train the large-scale Audioset dataset, TRILL [30] is based on Resnet and tripless loss. Additionally, while PANN [28] was trained using input signal with 10-second duration, both TRILL [30] and OpenL3 [29] analyse short time duration of 1 second. As using different network architectures, loss functions, as well as analysing different audio durations, these proposed pre-trained models may differently perform on three types of audio inputs (i.e. breathing, cough, and speech) from DiCOVA dataset.

Regarding the high-level embedding features extracted from these pre-trained models, while OpenL3 [29] and PANN [28] extract the feature map at the global pooling layer, TRILL [30] extracts the feature map at the final layer which proves effective for different down-stream tasks mentioned in [30] . Notably, since PANN pre-trained model works on 10-second input duration from AudioSet dataset that matches the input duration of our proposed system in Figure 1 , only one embedding feature (e.g. one vector) is extracted from each 10-second audio recordings of the input. Meanwhile, as OpenL3 and TRILL pre-trained models were trained with 1-second audio segment of AudioSet dataset, therefore, multiple embeddings are obtained when we feed one 10-second DiCOVA audio sample into these two pre-trained models. Therefore, an average of these multiple embeddings across the time dimension is computed to obtain one embedding feature which represents for each 10-second duration audio input. Additionally, only low-level log-Mel spectrogram is used in this approach as these three pre-trained models explore this type of spectrogram for training on the large-scale AudioSet dataset.

As we present two different approaches of using pre-trained models for extracting high-level embedding features, we now refer to two main frameworks as: (I) three low-level spectrograms (log-Mel, Gammatonegram, Scalogram), pre-trained models directly trained on DiCOVA dataset, LightGBM backend classification; and (II) low-level log-Mel spectrogram, pre-trained models trained on the large-scale AudioSet, and LightGBM back-end classification.

In this paper, we use Light Gradient Boosting Machine (LightGBM) [31] as the final back-end classification model to fine-tune high-level embedding features. The LightGBM is implemented by using an available toolkit [31] and the parameters are set as: learning rate = 0.02, objective = 'binary', metric = 'auc', subsample = 0.68, colsample bytree = 0.28, early stopping rounds = 1000, num iterations = 10000, subsample freq = 1. As the Track-4 in the Second DiCOVA Challenge suggests to use all audio input data (i.e. breathing, cough, speech), high-level embedding features extracted from different types of audio inputs are concatenated before feeding into the back-end LightGBM for classification.

A. Performance comparison of frameworks (I): three low-level spectrograms, pre-trained models directly trained on DiCOVA dataset, and LightGBM back-end classification

We firstly evaluate how low-level spectrograms affect the performance in frameworks (I). As the results are shown in Table II B. Performance comparison of frameworks (II): low-level log-Mel spectrogram, pre-trained models trained on AudioSet dataset, and LightGBM back-end classification As Table IV shows, it can be seen that different pretrained models works well on different types of sound input. Particularly, the best scores of 79.82/48.33 are associated with TRILL for cough sound input. PANN achieves the best scores of 84.05/55.00 for breathing. Meanwhile, speech sound input presents the best scores of 86.65/61.67 with OpenL3. Notably, using pre-trained models with AudioSet dataset outperforms the frameworks (I) analysed in Section IV-A in terms of detecting COVID-19 with only breathing, cough, or speech.

Comparing the performance of various models among audio inputs, breathing and speech show potential to detect COVID-19 rather than cough. However, when we concatenate high-level embedding features extracted from breathing, cough, and speech for Track-4 of the challenge, the performance from TRILL shows a significant improvement to 85.83/51.67. Meanwhile, OpenL3 demonstrates the highest scores of 86.14/65.00 compared to the others in this track.

As the proposed frameworks with a concatenation of highlevel features prove effective to detect COVID-19 in Track-4 of the Second DiCOVA Challenge, we further conduct experiments to evaluate how over-sampling on positive samples and reducing dimension over high-level features affect the performance. In these experiments, we concatenate (1) three TRILL based high-level features, (2) three PANN based highlevel features, (3) three OpenL3 based high-level feature, (4) three Xception based high-level features, and (5) (TRILL(c)-PANN(b)-OpenL3(s)) high-level features for cough, breathing, and speech, respectively. We conduct the combination among high-level features (TRILL(c)-PANN(b)-OpenL3(s)) as each feature shows effective for each audio data input as shown in Table IV .

To identify less significant dimensions of high-level features, we firstly compute an average of each dimension from all features associated with positive and negative COVID-19 independently. Then, two vectors representing for two groups of COVID-19 positive and negative features are obtained. Based on the absolute difference of these two vectors from each dimension (i.e. the lower absolute difference is related to the less significant dimension), the dimension of the feature set has been reduced from 10% to 90%. Note that the dimension of high-level features is reduced before concatenating. To oversample the positive cases, we apply SVM-SMOTE [32] on the high-level embedding features, then increase the the positive samples from double to five times.

As Figure 2 shows, oversampling on positive samples reduces the AUC scores. This method only helps to improve SEN. scores with PANN (double of positive samples) and Xception (five times of positive samples). Meanwhile, when we reduce the dimension of high-level embedding features from 20 to 40%, it almost helps to improve both AUC and SEN. scores. As a result, we can achieve the best scores: . Noting that we keep TRILL based embedding (at the dimensionality reduction of 20%) for cough, PANN and OpenL3 based embeddings (with no reduction) for breathing and speech, respectively. Similarly, we continue to apply the best concatenation of (TRILL(c)-PANN(b)-OpenL3(s)) (at the dimensionality reduction of 40%) for all audio inputs. As a result, the comparison in Table VI again shows that LightGBM model still achieves the best scores and outperforms other models.

E. Performance comparison across the top-10 systems submitted for the Second DiCOVA Challenge

As performance comparison to the state-of-the-art systems [6] is shown in Table V , we can achieve the top-6 in Track-1 with breathing input only and the top-3 in Track-2 with cough input only. Notably, our proposed systems outperform the state-of-the-art, achieve the top-1 in both Track-3 and Track-4 with speech input only and all audio inputs, respectively.

To evaluate the best score of the LightGBM with the bestselected embedding features for cough, speech, breathing and all types of audio inputs, we conducted ten times of running the experiments on a different randomly chosen Test set. Next, we calculate and achieve an average confidence interval ( 

This paper has presented an exploration on how to extract effectively well-represented features for breathing, cough, speech sound input via pre-trained models. By conducting extensive experiments, we achieve a robust framework for detecting COVID-19 compared with the state-of-the-art systems. This is demonstrated by the rank of our performance with top-1 in both Track-3 and Track-4, top-3 in Track-2 and finally top-6 in Track-1 in the Second DiCOVA Challenge. Our best AUC score of 89.03%, F1 score of 64.41%, sensitivity score of 66.33% from the Track-4 demonstrate the potential of detecting COVID-19 through the respiratory-related sounds.

2022-01-18) Covid map: Coronavirus cases, deaths, vaccinations by country

Detection of 2019 novel coronavirus (2019-ncov) by real-time rt-pcr

Scaling up covid-19 rapid antigen tests: promises and challenges

A generic deep learning based cough analysis system from clinically validated samples for point-of-need covid-19 test and severity levels

Dicova challenge: Dataset, task, and baseline system for covid-19 diagnosis using acoustics

Audio set: An ontology and human-labeled dataset for audio events

Adam: A method for stochastic optimization

An ensemble of deep learning frameworks applied for predicting respiratory anomalies

Deep learning framework applied for predicting anomaly of respiratory sounds

Cnn-moe based framework for classification of respiratory anomalies and lung disease detection

Inception-based network and multi-spectrogram ensemble applied to predict respiratory anomalies and lung diseases

Robust deep learning framework for predicting respiratory anomalies and diseases

librosa: Audio and music signal analysis in python

Gammatone-like spectrogram

Higher-order properties of analytic wavelets

Object recognition with gradient-based learning," in Shape, contour and grouping in computer vision

Xception: Deep learning with depthwise separable convolutions

Rethinking the inception architecture for computer vision

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Rectified linear units improve restricted boltzmann machines

Dropout: a simple way to prevent neural networks from overfitting

Very deep convolutional networks for large-scale image recognition

Mobilenets: Efficient convolutional neural networks for mobile vision applications

Deep residual learning for image recognition

Densely connected convolutional networks

Keras library

Panns: Large-scale pretrained audio neural networks for audio pattern recognition

Look, listen, and learn more: Design choices for deep audio embeddings

Towards learning a universal non-semantic representation of speech

Lightgbm: A highly efficient gradient boosting decision tree

Borderline over-sampling for imbalanced data classification

Scikit-learn: Machine learning in python