key: cord-0548284-tfaz0sam authors: Meister, Julia A.; Nguyen, Khuong An; Luo, Zhiyuan title: Audio feature ranking for sound-based COVID-19 patient detection date: 2021-04-14 journal: nan DOI: nan sha: 617cb5dbbf6ad708222f38ea21a7f5a8020cae80 doc_id: 548284 cord_uid: tfaz0sam Audio classification using breath and cough samples has recently emerged as a low-cost, non-invasive, and accessible COVID-19 screening method. However, no application has been approved for official use at the time of writing due to the stringent reliability and accuracy requirements of the critical healthcare setting. To support the development of the Machine Learning classification models, we performed an extensive comparative investigation and ranking of 15 audio features, including less well-known ones. The results were verified on two independent COVID-19 sound datasets. By using the identified top-performing features, we have increased the COVID-19 classification accuracy by up to 17% on the Cambridge dataset, and up to 10% on the Coswara dataset, compared to the original baseline accuracy without our feature ranking. A widely accessible, non-invasive, low-cost testing mechanism is the number one priority to support test-and-trace in most pandemics. The advent of COVID-19 has abruptly brought respiratory audio classification into the spotlight as a viable alternative for mass pre-screening, needing only a smartphone to record a breath or cough sample [3] . Just in the past 12 months, many universities and research institutions have set up audio data collection systems, generally reliant on voluntary submissions, resulting in a variety of smartphone applications based on audio pre-processing and ML classification. However, at the time of writing, none has yet been officially endorsed for medical usage, largely because of the high accuracy and reliability expectations for such a critical healthcare task. This paper aims to give a holistic overview, evaluation, and ranking of 15 audio features in the context of the binary COVID-19 audio classification task, which, to the best of our knowledge, has not been researched yet. This paper makes the following contributions to the binary COVID-19 respiratory audio classification task: i. Audio feature analysis and ranking. We perform an extensive comparative analysis and ranking of 15 sound features prevalent in speech and nonspeech audio classification. The evaluation is carried out on two independent datasets, allowing the findings to be generalised. ii. Highlighting effective features. We identify ML features with strong discriminative performance that go against common rules of thumb regarding audio feature selection. iii. Increasing the COVID-19 detection accuracy. A natural culmination of the previous points. Compared to the baseline results presented in the datasets' original papers, we increase the classification accuracy by up to 17%, just by incorporating new training data obtained through our feature ranking. The findings described in this paper are directly relevant to the COVID-19 sound-based classification task and would benefit future implementations using the same approach. The remainder of the paper is organised into four sections. Section 2 provides a thorough description of the relevant audio features. Section 3 includes information about the implementations and then focuses on the extensive experimental analysis. Section 4 outlines the related work in the COVID-19 classification domain. Finally, Section 5 summarises our findings and outlines further work. Feature engineering is a vital step in any ML application, as a model's predictive efficiency relies directly on the discriminating information encoded in the input vectors. Before delivering a comprehensive comparison of 15 audio features in the context of binary COVID-19 audio classification, we first provide a detailed overview and intuition of their function. The selected features cover a variety of domains, including those in speech and non-speech audio tasks. A summary is presented in Table 1 . Low-level features extracted directly from the audio signal without requiring a transformation are grouped in the time domain. While such features are often not meaningful to humans, they are commonly included in a larger feature set in audio classification tasks because they are very efficient to calculate. In the context of lung-sound classification, such features can identify explosive and discontinuous sounds (e.g. crackling) that often occur due to a buildup of fluid or secretions in the throat and lungs [21] . The selected features have been previously extracted for COVID-19 classification [3, 22] . ii. Zero-crossing rate (ZCR). The rate of the signal's sign change over time is given by Equation (2) . Here x n is the signal's amplitude at frame with index n (N frames overall), and sign(a) returns 1 if a > 0, 0 if a = 0, and −1 otherwise [17] . In its original format, digital audio is encoded as a temporal sequence of samples. Decomposing the signal into its constituent frequencies (e.g. with the Fourier Transform) reveals information about the frequency content. Because most frequency-domain features, alternatively spectral features, describe only a small aspect of the audio signal, they are rarely used individually for audio classification tasks. The selected features describe and compare the signal's intensity, which can provide information about the state of the respiratory tract, e.g. identifying abnormal lung sounds if it is affected by a respiratory disease [21] . A subset of the following features has previously been used for COVID-19 detection [3, 22] . i. Spectral bandwidth (S-BW). Also referred to as spectral spread, S-BW describes the signal's energy concentration around the centroid. Equation (3) defines bandwidth as the variance around the signal's expected frequency E given the energy P k and corresponding frequency f k in 1 ≤ k ≤ K subbands [19] . ii. Spectral centroid (S-CENT). The centroid identifies a signal's mean frequency, i.e. the band with the highest energy concentration. Equation (4) shows its breakdown into the weighted and unweighted sums of spectral magnitudes P k in the k-th of K subbands. f k is the corresponding frequency range [23] . iii. Spectral contrast (S-CONT). The audio signal's contrast is evaluated by comparing the spectral energy peaks P k and valleys V k in each frequency subband k, see Equation (5) . N represents the number of frames and x ′ k,n the FFT vector of the k-th subband in frame n with elements in descending order [7] . iv. Spectral flatness (S-FLAT). Also called a tonality coefficient, flatness measures a signal's similarity to white noise (flat spectrum). It is defined as the ratio between the geometric and arithmetic means as shown in Equation (6), where P k is the signal's energy at the k-th frequency band s.t. 1 ≤ k ≤ K [10] . v. Spectral flux (S-FLUX). A measure of a signal's change in energy between frames, estimated by Equation (7). E n,k represents the k-th normalised DFT (Discrete Fourier Transform) coefficient in frame n across K coefficients [23] . vi. Spectral rolloff (S-ROLL). A description of the relationship between frequency and energy, rolloff represents the minimum frequency f R s.t. the energy accumulated below is not less than the specified proportion S of the total energy. P k is the spectral energy in one of K frequency subbands [23] . This feature category illustrates a signal's frequency-related information as it varies over time. We consider two types of time-frequency features: cepstral features (encoding timbre or tone colour) and tonal features (describing pitch). Cepstral features This paper focuses on the Mel-frequency Cepstrum (MFC), as it is by far the most commonly used cepstral feature variant in audio classification tasks. MFC mimics the non-linear human perceptions of sound and is applied ubiquitously in both speech and non-speech classification tasks. While both spectral and cepstral features can facilitate respiratory classification by exploring a signal's frequency content, the latter's benefit is the inclusion of temporal and transitional information. MFC features have been previously used for COVID-19 detection [3, 13] . i. Mel-frequency cepstral coefficients (MFCC). MFCC features are derived from the MFC power spectrum. In Equation (9) the signal is transformed into the time-frequency domain by a discrete cosine transform. K is the number of coefficients and s(k) calculates the logarithmic energy of the k-th coefficient at frame n [20] . ii. MFCC-∆. As the first-order derivative of MFCC, also referred to as velocity, the feature represents temporal change [5] . It is often included in combination with MFCC, as it has a low extraction cost. iii. MFCC-∆ 2 . Acceleration, MFCC's second-order derivative, is commonly included when MFCC is extracted from an audio signal because it is resourceefficient to calculate and can improve audio classification [5] . Tonal features Tonal features primarily encode an audio signal's harmonics information in 12 pitch classes 2 and are based on the human perception of periodic pitch [15] . Two feature groups are considered, distinguished by their underlying representation: Chroma features (chromagram) and Tonnetz (lattice graph). While the Tonnetz encodes tone quality and height, chromagrams omit interval information. A common consequence of respiratory diseases is a narrowing of the airways by secretions. The effect is a wheezing sound because the pitch of in-and expiration is altered [21] , which can be heard in COVID-19 lung-sound recordings. i. Chroma energy normalised (C-ENS). A chromagram feature abstraction introduced in [14] by considering short-time statistics over energy distributions within the chroma bands. Normalisation of the feature vectors makes it resistant to dynamic variations, such as timbre and articulation [15] . ii. Constant-Q chromagram (C-CQT). Chroma features are extracted from a time-frequency representation of audio via a filter bank. In this case, the initial audio transformation is the constant-Q transform (CQT), which has a good resolution of low frequencies [8] . iii. Short-time Fourier Transform chromagram (C-STFT). The feature extraction process is similar to the description for C-CQT. The difference lies in the audio signal's transformation into the time-frequency domain, which in this case is calculated via the Short-time Fourier Transform (STFT) [8] . iv. Tonnetz (TN). A Tonnetz (German: tone-network) encodes harmonic data in a lattice graph. The benefit of a graphical representation is that distances between points are musically meaningful, as pitch is encoded as geometric areas in the graph [6] . The ranking of the above 15 selected audio features will be based on the empirical results and analysis of two datasets to make the findings more generalisable. The assumption is that any distinct patterns repeated across independent datasets are likely inherent to the COVID-19 breath and cough audio recordings, not the underlying datasets. Exploring the following questions is the focus of this body of work. They are centred on the binary COVID-19 audio classification task and have informed the experimental design and consequent results analysis. i. What are the most distinguishable ML audio features? ii. Are the feature rankings comparable across independent datasets? iii. What is the performance accuracy of the new ML models using the most dominant features? To answer the above research questions, this section contains a brief description of the datasets underlying the evaluated features, the data preparation and pre-processing steps, and an extensive description and analysis of the results. Finally, we compare our improved results to the baseline ML accuracies presented in the datasets' original papers. Two independent datasets are considered in parallel throughout the paper to indicate whether identified feature rankings are likely specific to the underlying dataset or generally applicable: the Cambridge and the Coswara COVID-19 audio datasets. The distribution of sample counts can be found in Table 2 . Introduced in [3] , the Cambridge dataset is a collection of voluntary web and android recordings of coughing and breathing sounds from healthy, COVIDpositive, and asthmatic people. Only the first two categories are considered, as the latter only has eight samples. The data available for this paper is a curated set of samples collected during April and May 2020. While the paper describes various metadata statistics over the entire dataset (e.g. age, gender, location, and symptom distribution), such information is not included in the curated dataset. The data comes in 2 to 30-second WAV files with a 48kHz sampling. The Coswara dataset [22] is collected and freely distributed by the Indian Institute of Science and receives its voluntary samples through a web application. The samples considered in this paper were collected between April and December 2020. The available categories and recording types are much more varied, but to remain consistent with the Cambridge dataset, we filter the data for COVIDpositive and healthy participants that have submitted breathing and coughing recordings. The 'shallow' and 'deep' variants are considered as two separate datasets. Conveniently, the data format is compatible with the same WAV format at 48kHz and 1 to 30-second long recordings. Because the recording devices and environments were not controlled, consistently cleaning the audio data is important to reduce non-discriminatory variance and improve the samples' comparability to each other. The applied pre-processing steps include converting the audio to mono, resampling it to 48kHz, trimming the leading and trailing silences, and normalising the signal's amplitude to [−1, 1]. The effects can be seen in Figure 1 . The Python-toolkit librosa [11] (version 0.8) was used for the signal processing tasks. The basis of all of our evaluations are the 15 audio features from the time, frequency, and time-frequency domains identified and described in Section 2. In general, ML models require input with a consistent format and dimension. Because the recordings have vastly different lengths (1-30 seconds, see Figure 2 ) and the selected audio features are calculated over frames, the question of how to represent the feature vector at a constant dimension was a challenge. A range of summary statistics over frames is taken to capture all available data without resorting to padding (infeasible due to the up to 29-second difference). This leads to a feature vector guaranteed to have the same number of dimensions, regardless of the sample length. The statistics we consider are the (i) minimum, (ii) maximum, (iii) mean, (iv) median, (v) variance, (vi) 1st quartile, and (vii) 3rd quartile, giving us a wide range of descriptive information about the features' distribution over frames. The total count of features analysed and ranked individually and by category is 812, as detailed in Table 3 . no dimensionality reduction to maintain interpretability of the features Sample lengths before and after pre-processing. By trimming the leading and trailing silences at 60dB (empirically identified cutoff point) we can remove non-discriminative data. Sample lengths are reduced by 1-3 seconds on average. The paper's main contribution is an extensive and in-depth analysis and ranking of 15 audio features for the binary COVID-healthy classification task. The goal is to identify particularly informative features and feature categories by carrying out the evaluation on two independent datasets in parallel: the Cambridge and the Coswara (deep and shallow variants) datasets. Due to their independence, we propose that any recurring patterns in predictive efficiency are likely independent of the underlying dataset and should therefore be strongly considered for future ML COVID-19 audio classification applications. The 15 audio features summarised in Table 3 are analysed over the following configurations to provide a detailed picture of their predictive efficiency: i. The Cambridge, Coswara-deep, and Coswara-shallow datasets. ii. 'Breath' (B), 'cough' (C), and 'breathcough' (BC) feature vectors. The latter is a concatenation of the previous two feature vectors, i.e. double the size. iii. Five ML models, selected for the variety in which they partition the label space. The models are implemented with the scikit-learn [18] package version 0.24, and optimised with parameter grid searches: AdaBoost with Random Forest (nr. estimators, criterion), K-Nearest Neighbours (K, weights), Logistic Regression (C, penalty, solver), Random Forest (max. depth, criterion), and Support Vector Machine (C, γ, kernel), referred to as ADA, KNN, LR, RF, and SVM respectively. To ensure that the generated results are reliable even on the relatively small and very imbalanced available datasets, 5-fold Cross-Validation (CV) stratified by labels is employed. We select three metrics to compare the features' impact on the audio classification task at hand: Receiver Operating Characteristic (ROC), Precision, and Recall. The latter two counteract ROC's optimism on highly imbalanced datasets, see Figure 3 for a brief overview. In the following, we provide a detailed analysis and discussion of the previously described audio features' (Section 2) performance on the selected datasets (Section 3.2) for the binary COVID-19 classification task. Finally, we compare our improved results to the baseline ML accuracies presented in the datasets' original papers. Feature categories. An initial overview of the 'breath', 'cough', and 'breathcough' full feature vectors' COVID-19 discriminatory efficiency shows promising results, as most models outperform their no-skill equivalent. Figure 4 visualises the mean ROC and Precision-Recall (PR) curves over 5 CV folds on the 'breathcough' vector for each of the considered models. It clearly establishes SVM and RF outperforming their counterparts across all datasets and metrics, with a similar trend observed for the other two data types. Even though the Cambridge and Coswara datasets have similarly shaped ROC curves, it becomes immediately apparent that the Cambridge dataset significantly outperforms its counterparts when looking at the PR curves. This illustrates ROC-AUC's optimism when applied to vastly imbalanced datasets, justifying our approach of considering multiple metrics throughout our analysis. An influential factor in Coswara's lower overall accuracies is the greater imbalance of COVID-positive samples compared to healthy ones at 13:1 compared to 2:1 in the Cambridge data (see Table 2 for sample counts). Nonetheless, models trained on the Coswara datasets still perform noticeably better than an entirely unskilled classifier with AP (Average Precision) scores between 13-38% compared to the unskilled 7% (equivalent to the positive sample ratio). The results contained in Table 4 confirm our selection of SVM and RF as the best-performing models. The table shows the same 'breathcough' feature vector's predictive efficiency, but this time only considering one feature category (time domain, spectral, cepstral, tonal) at a time. Apart from two exceptions, both SVM and RF achieve higher accuracies than the other ML models across the board. Considering SVM's mean ROC-AUC accuracies on the 'breathcough' vector across all datasets, we noted that the 4 feature categories could be broadly ranked in the following order of increasing predictive efficiency (Cambridge, Coswaradeep, Coswara-shallow): time domain (78.78%, 63.94%, 55.90%), tonal (82.59%, 72.98% 68.81%), spectral (84.84%, 74.46%, 72.32%), and cepstral (87.15%, 75.62%, 70.62%). As evidenced by the results, the spectral and cepstral categories perform equally well, where spectral slightly outperforms cepstral for Coswara-shallow by about 2%. More noteworthy is that the same ranking pattern is prevalent for all 5 considered ML models, leading to the conclusion that the cepstral and spectral feature categories encode particularly informative data for COVID-19 classification contained in the breathing and coughing signals. Individual features. Now turning our attention to the analysis of individual features, the initial focus lies on the previously identified best-performing SVM and RF classifiers before broadening again to include all models, letting us identify generally applicable patterns of predictive efficiency. The feature accuracies on which the following descriptions are based are available in Tables 5 to 7 for the Cambridge, Coswara-deep, and Coswara-shallow datasets respectively. Taking a general look at the results, we see that the majority of the 15 features significantly outperforms random guesses for the binary COVID-19 classification task across datasets and sample types ('breath', 'cough', 'breathcough'), with better accuracy on the Cambridge dataset. The lowest accuracy on average is achieved by Coswara-shallow, matching our previous findings when considering both the entire feature vector and individual feature categories. Looking at which underlying sample type performs best further underlines the similarities between the Cambridge and Coswara-deep datasets compared to Coswara-shallow. When considering the former, 'breathcough' achieves the highest mean ROC-AUC scores on average (except for time domain features), whereas Coswara-shallow is split evenly between 'breath' (time domain, spectral) and 'breathcough' (cepstral, tonal). However, given all considered features in a single feature vector, the Coswara-shallow dataset still shows its highest accuracy on 'breathcough' samples since cepstral and tonal features are very influential overall. When comparing the results within the feature categories, MFCC (cepstral), S-CONT (spectral), and C-ENS/ C-STFT (tonal) stand out as the highestscoring features in their respective categories across datasets and models. In contrast, the time domain features are much more varied in which one performs best. It is worth mentioning that spectral contrast (S-CONT) is the only composite feature (7-D) in the Spectral category, which could be part of the reason it performs better. However, the heat maps in Figure 5 clearly show that individual S-CONT features perform better or on par with other top spectral features in a Table 5 : 5-fold CV results on all audio features extracted from the Cambridge dataset. The mean µ and standard deviation σ ROC-AUC results are reported. The majority of features provide the most accurate results when considering the 'breathcough' ('BC') vector. We also find that the feature categories can be ranked in the following order of increasing accuracy across both models: time domain, tonal, spectral, and cepstral. Table 7 : 5-fold CV results on all audio features extracted from the Coswara-shallow dataset. The mean µ and standard deviation σ ROC-AUC results are reported. While the other two datasets follow very similar patterns, this one is the most different. For example, there is no one sample type that the majority of features perform best on. Nonetheless, the overall category ranking stays the same: time domain, tonal, spectral, and cepstral. majority of cases across datasets, sample types, and models, leading to the conclusion that S-CONT's overall positive COVID classification accuracy is in fact based on high-scoring sub-features, rather than just its increased dimensionality. Lastly, we note a surprising trend regarding MFCC and its derivative features. A prevalent rule of thumb concerning the number of MFCC features that should be included for audio classification tasks is 12 or 13 [3, 7, 20, 22] . However, Figure 6 shows that higher-order features actually provide remarkable discriminative information for the identification of COVID-19 respiratory sounds either on par with (Coswara-deep) or significantly outperforming (Cambridge) the first 13 features. This phenomenon is most noticeable in the 'breathcough' and 'breath' features and MFCC's derivatives. The intuition for MFCCs is that the lower-order features provide information about the signal's energy distribution between high and low frequencies, and the higher-order features contain information about finer details such as pitch and tone quality [12] . From this, we can extrapolate that timbral information is very relevant to COVID audio classification. Discussion. We have found, described, and analysed in the extensive comparison and ranking of 15 audio features in the previous section that there are distinct efficiency patterns that reoccur on multiple independent datasets. Starting with the encompassing audio feature categories, there is a distinct order of predictive efficiency that is consistent across different datasets, mod- Surprisingly and contrary to a common audio feature selection rule of thumb [3, 7, 20, 22] , higher-order MFFC features (13+) provide significant discriminatory efficiency for COVID-19 classification higher than or on par with lower-order features. This shows that pitch and timbral information is especially relevant to COVID respiratory classification. 'BC', 'B', and 'C' stand for the 'breathcough', 'breath', and 'cough' sample variants. els, and sample types (increasing): time domain, tonal, spectral, and cepstral. This does not quite follow the intuitive expectation that more complex features provide more discriminative information (e.g. tonal vs spectral features). On the other hand, it can be justified when considering that tonal features describe pitch and so are more suited to tasks with melodic content. The ranking also underlines the significance of frequency-based features by elevating the spectral and cepstral categories. Features in these categories encode an audio signal's frequency content and describe timbral aspects and tone quality or colour. In addition to the feature rankings, we have also shown that the common audio feature selection rule of thumb of using only the first 13 MFCC features [3, 7, 20, 22] is not applica-ble in this case. Indeed, the higher-order (describing timbre) features' predictive efficiency provides significantly more discriminatory information, especially for the 'breathcough' and 'breath' feature vectors. Taking a step back from the individual features, we note that the most prevailing pattern across all of the previous descriptions is that the concatenated 'breathcough' feature vector outperforms the individual 'breath' and 'cough' vectors in most cases. Given our insights, it is interesting to compare our ML results to the baselines presented when the datasets were published, summarised in Table 8 . The evaluated models are of similar type and complexity; The major difference is our introduction of new training features. We can see that, in fact, our improved feature vectors significantly outperform both the Cambridge and Coswara baseline accuracies by over 10%, validating our feature selection. While it seems that we are constantly surrounded by speech recognition in our day-to-day lives, when is the last time a digital assistant said 'bless you' after hearing and recognising a sneeze? The ubiquity of speech recognition is at least partially driven by commercial value. In contrast, non-speech sound classification, especially body sound (e.g. sneeze, cough, breathing) classification, has only recently gained traction over the past few years. The sudden emergence of the COVID-19 respiratory disease and the continual lack of testing availability have given the subfield a significant boost. COVID-19 is not the first application of respiratory classification. It has long been common knowledge that respiratory diseases and disorders affect breathing and coughing by physically altering the respiratory environment. Because many disease-related abnormalities can affect only subtle changes in auditory cues, the inherently subjective manual auscultation 3 process can be error-prone even when performed by a trained medical professional [2] . However, a literature review of existing implementations shows that ML can reliably pick up on those subtle signals for a variety of diseases. While the following is by no means a comprehensive list of existing implementations, it provides an overview of the current state of research. Smartwatches and small wearable devices have made audio monitoring for healthcare purposes feasible. Nguyen et al. apply a dynamic activated respiratory event detection mechanism to non-intrusively detect coughing and sneezing events [16] . When it comes to the diagnosis of respiratory events, Amrulloh et al. present classifiers trained on audio features such as MFCC to distinguish between asthma and pneumonia for pediatric patients, which are commonly misdiagnosed without proper diagnostic tools in third-world countries, leading to unnecessary antibiotic prescriptions [1] . Lastly, a method of non-binary classification is presented in [2] . Interestingly the audio classification task is transformed into image classification by using a spectrogram as input and achieves comparable results. Over the past year, there has been an explosion of COVID-related datasets and promising pre-screening implementations, utilising a wide range of sample types. One of the first was [3] , which collected and classified breath, cough, and breathcough samples to identify their suitability for COVID-19 classification with a small selection of common audio features. [22] considers further recording types, including vowel intonation and sequence counting. Laguarta et al. propose a different approach, instead applying classification to four biomarkers (muscular degradation, changes in vocal cords, changes in sentiment/ mood, and changes in the lungs/ respiratory tract) that have previously been used to identify the progress of Alzheimer's disease. Intriguingly, this approach has a very high success rate at identifying asymptomatic COVID-carriers [9] . While there are many promising applications available already, the novelty of COVID audio classification means there are still many aspects that need to be explored, partially because only limited and highly imbalanced datasets are publicly available at the time of writing. Many improvements still have to be made before it is reliable enough to use as a pre-screening and diagnosis tool. Our extensive comparative analysis of 15 audio features from different domains has provided significant insights into ML feature selection in the context of COVID-19 respiratory sound classification and addressed the research questions laid out in Section 3.1. As the analysis found recurring patterns of predictive efficiency across two completely independent datasets, we have identified a feature ranking and salient feature characteristics that are likely inherent to COVID-19 respiratory signals rather than the underlying datasets. These findings could be beneficial for future sound-based COVID-19 classification applications. Throughout our analysis, we have introduced new training features that were not considered in the baseline evaluations presented in the datasets' papers. Consequently, we have improved the classification results by almost 17% and 10% on the Cambridge and Coswara datasets, without significant discrepancies or differences in the evaluated ML models. Although this paper has provided a starting point for the holistic evaluation of respiratory audio features, there are still other opportunities to further analyse other relevant aspects. For example, a comprehensive strategy to regularise different sample lengths and preserve temporal information could benefit COVID-19 classification. Additionally, advanced models s.a. Deep Learning should be used as a basis for further feature ranking analysis, as the more complex architecture could reveal thus far hidden relevance of the evaluated audio features. Although sound-based COVID-19 detection is the primary purpose of this research, many other respiratory diseases and disorders could also benefit from the development and improvement of automatic audio detection systems for diagnosis, treatment, and management purposes. Therefore, the approach described in this paper could be generalised for the detection of other respiratory diseases. Cough sound analysis for pneumonia and asthma classification in pediatric population Classification of lung sounds using convolutional neural networks Exploring automatic diagnosis of COVID-19 from crowdsourced respiratory sound data openSMILE: The Munich versatile and fast open-source audio feature extractor A novel approach for MFCC feature extraction Learning a robust tonnetz-space transform for automatic chord recognition Music type classification by spectral contrast feature Feature learning for chord recognition: The deep chroma extractor COVID-19 artificial intelligence diagnosis using only cough recordings Note on measures for spectral flatness librosa: Audio and music signal analysis in Python Chapter 3 -Features for contentbased audio retrieval DiCOVA challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics Chroma toolbox: MATLAB implementations for extracting variants of chroma-based audio features Audio matching via chroma-based statistical features Cover your cough: Detection of respiratory events with confidence using a smartwatch A speech/ music discriminator based on RMS and zero-crossings Scikit-learn: Machine learning in Python. the The timbre toolbox: Extracting audio descriptors from musical signals Automatic classification of microseismic signals based on MFCC and GMM-HMM in underground mines Signal domain in respiratory sound analysis: Methods, application and future development Coswara-a database of breathing, cough, and voice sounds for COVID-19 diagnosis Detection of adolescent depression from speech using optimised spectral roll-off parameters We would like to thank Chris Watkins for the stimulating discussions, and University of Cambridge for access to the COVID-19 sound dataset. This research is funded by Brighton's Connected Futures and Radical Futures' initiatives.