key: cord-0779249-5fwdzpp2 authors: Xue, Hao; Salim, Flora D. title: Exploring Self-Supervised Representation Ensembles for COVID-19 Cough Classification date: 2021-05-17 journal: 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) DOI: 10.1145/3447548.3467263 sha: c18bc6583836a575420203f87f0ac6dba71e37a9 doc_id: 779249 cord_uid: 5fwdzpp2 The usage of smartphone-collected respiratory sound, trained with deep learning models, for detecting and classifying COVID-19 becomes popular recently. It removes the need for in-person testing procedures especially for rural regions where related medical supplies, experienced workers, and equipment are limited. However, existing sound-based diagnostic approaches are trained in a fully supervised manner, which requires large scale well-labelled data. It is critical to discover new methods to leverage unlabelled respiratory data, which can be obtained more easily. In this paper, we propose a novel self-supervised learning enabled framework for COVID-19 cough classification. A contrastive pre-training phase is introduced to train a Transformer-based feature encoder with unlabelled data. Specifically, we design a random masking mechanism to learn robust representations of respiratory sounds. The pre-trained feature encoder is then fine-tuned in the downstream phase to perform cough classification. In addition, different ensembles with varied random masking rates are also explored in the downstream phase. Through extensive evaluations, we demonstrate that the proposed contrastive pre-training, the random masking mechanism, and the ensemble architecture contribute to improving cough classification performance. By February 1 st 2021, the total number of coronavirus disease 2019 (COVID-19) confirmed cases exceeded 103 million world-wide 1 , and at the time of writing, it is still an ongoing pandemic. Given that the global vaccination effort is still in its early stage, a practical and effective defensive procedure against the highly contagious COVID-19 is large-scale and timely testing, aimed at detecting and isolating the infected individuals as soon as possible. Developing a reliable, easily-accessible, and contactless approach for preliminary diagnosis of COVID-19 is significant. It will also benefit regions where medical supplies/workers and personal protective equipment are limited. As pointed out by Imran et al. [17] , cough is one of the major symptoms of COVID-19 patients. Compared to PCR (Polymerase Chain Reaction) tests and radiological images, diagnosis using cough sounds can be easily accessed by people through a smartphone app. In the meantime, however, cough is also a common symptom of many other medical conditions that are not related to COVID-19. Therefore, automatically classifying respiratory sounds for COVID-19 diagnostic is a non-trivial and challenging task. During the pandemic, many crowdsourcing platforms (such as COUGHVID 2 [24] , COVID Voice Detector 3 , and COVID-19 Sounds App 4 ) have been designed to gather respiratory sound audios from both healthy and COVID-19 positive groups for the research purpose. With these collected datasets, researchers in the artificial intelligence community have started to develop machine learning and deep learning based methods (e.g., [5, 12, 17, 25, 27] ) for cough classification to detect COVID-19. Nevertheless, these methods share one common characteristic, that is they are all designed and trained in a fully-supervised way. On the one hand, the fullysupervised setting limits the applicability, effectiveness and impact of the collected datasets, since the method has to be trained and tested on the same dataset. This means additional datasets cannot be directly used to boost the predictive performance and the model is limited to the same source dataset. On the other hand, such fully-supervised based classification methods inevitably need to rely on well-annotated cough sounds data. The annotations are from either experts or user response surveys. There are two inherent limitations of these annotation approaches: (i) Annotation Cost: Annotation of a large-scale dataset comes at an expensive cost (both financial cost and human power cost). In addition, unlike the data labelling in other tasks such as image classification, the annotation of respiratory sounds requires specific knowledge from experts. This further aggravates the difficulty of obtaining accurate annotations. (ii) Privacy Concern: Although directly asking participants to report their health status (e.g., whether the COVID-19 is tested positive or negative) during the respiratory sounds collection avoids annotation cost, the medical information is highly sensitive. Such privacy concerns also limit the distribution and publicity of gathered datasets. For example, some datasets have to be accessed by one-to-one legal agreements and specific licences. In this work, to address the aforementioned shortcomings, we design a novel framework for COVID-19 cough classification, which can easily leverage large scale unlabelled respiratory sounds. The concept of the proposed framework is illustrated in Figure 1 . Overall, it consists of two phases: a pre-training phase and a downstream phase. The first phase is a self-supervised contrastive loss-based representation learning process with only unlabelled respiratory audios as training data. The purpose is to train a feature encoder contrastively so that it can learn discriminative representations from large amount of unlabelled sounds. In the downstream phase, the weights of the contrastive pre-trained feature encoder are transferred and fine-tuned on the labelled downstream dataset. Except for the loading of pre-trained weights, this phase is similar to other fully-supervised cough classification methods. We pose the question of whether the contrastive pre-trained weights could help the downstream classification performance. While the self-supervised contrastive representation learning has been successfully applied to other domains such as images [7, 13] , speech [18, 23] , and general audios [26] , our work is the first attempt to explore self-supervised representations for respiratory sounds and COVID-19 cough classification. For the audio feature encoder (pre-trained in the first phase and fine-tuned in the downstream phase), we adopt the popular Transformer architecture [31] which has been proved effective in many other temporal data analysing tasks such as translation [9, 31] , traffic prediction [33] , and event forecasting [32] . Considering that the demographic distributions (e.g., age, gender, nationality of participants) of the pre-training data and the downstream dataset may be different, we explicitly design and introduce a random masking mechanism to improve the generalisation of the feature encoder. This mechanism randomly masks off some timestamps' signals in the input audio so that these masked values are removed from the attention calculation inside the Transformer. It could avoid the over-fitting on the pre-training dataset. We also exploit applying the same random mechanism in the downstream phase in the experiments. Furthermore, we also investigate different ensemble configurations with different feature encoder structure and random masking rates to further improve the classification performance. In summary, our contributions are: (1) We propose a novel framework based on contrastive pre-training to take advantage of unlabelled respiratory audios for representation learning. To the best of our knowledge, this is the first paper using contrastive-based representation learning to leverage unlabelled data for COVID-19 cough classification. This framework provides a new perspective for the cough classification research. (2) We design a random masking enabled Transformer structure as the feature encoder to learn the representations. Applying the random masking in the pre-training phase could provide effective and general representations, which further boosts the classification performance in the downstream phase. (3) Figure 1 : Concept illustration of the proposed framework. It consists of two phases: a contrastive pre-training phase (the upper part, described in Section 4.1) to learn representations from unlabelled respiratory data; and a downstream phase (the lower part, described in Section 4.2) for fine-tuning and performing cough classification for screening COVID-19. proposed framework outperforms existing methods and we also discover the ensemble configuration that yields the best performance. Machine learning as well as deep learning based methods have been introduced to automatically screen and diagnose various respiratory diseases [1, 2, 4, 20, 22] . As for the deep learning neural network architectures, there are two categories for COVID-19 cough classification. The first category is Convolutional Neural Network (CNN) based. Even though typical CNNs such as ResNet [14] and VGG [29] are originally proposed for processing images in computer vision, the pre-processing techniques (e.g., Mel Frequency Cepstral Coefficients (MFCC) and log-compressed mel-filterbanks) transform audio signals into 2D matrices and make it possible to directly apply CNN structures for audio analysis. A CNN model with ResNet-18 as backbone is designed by Bagad et al. [3] , whereas Schiller et al.use an ensemble of CNNs to classify if a person is infected with COVID-19 in [27] . For respiratory sound classification, Brawn et al. [5] combines hand-crafted features and VGGish [15] (pre-trained on Audioset [10] ) extracted deep learning features. The second category is Recurrent Neural Network (RNN) based. Considering that audio data is inherently a type of temporal sequence data, modelling the recurrence dynamics [21] is another technical road map for cough classification. RNN and its variants Long Short Term Memory (LSTM) networks [16] , Gated Recurrent Units (GRU) [8] are born for handling temporal sequence data. Following this trend, Hassan et al. [12] and Pahar et al. [25] fully explore LSTM-based COVID-19 cough classification by researching and evaluating different sound features as input and LSTM hyperparameters. The proposed framework in this work differs from the above summarised CNN or RNN based COVID-19 cough classification methods in the way of pre-training. Other existing methods with pretraining step depend on conventional fully-supervised pre-training so that large scale labelled data is required, whereas we introduce and design a contrastive self-supervised pre-training phase that only requires unlabelled data for pre-training. The core idea of contrastive learning is to learn how to represent an input sample so that learned representations of positive pairs (samples considered to be similar) are much closer than representations of negative pairs (samples considered to be different) in the latent space. Recently, contrastive learning based self-supervised pre-training has been proved successful and effective to learn representations from unlabelled data in numerous work in other domains such as images [7, 13] and speech [18, 23] . Using such pre-trained representations could improve performance of downstream supervised tasks. COLA [26] , proposed by Saeed et al., is the most relevant approach in the literature to our proposed framework. It is a contrastive learning framework to learn representations from general audios (i.e., Audioset) in a self-supervised manner. However, there are two major differences between ours and COLA. First, how to leverage unlabelled respiratory data for COVID-19 cough classification remains untouched in the literature. We seek to develop a framework to learn representations for respiratory sounds based COVID-19 cough classification instead of representations for common audios in COLA. Second, COLA uses EfficientNet [30] , a CNN, as the feature encoder to extract representations from audio. Instead, our model treats the audio as sequence data and utilises the popular Transformer [31] , an effective architecture that has shown great promise in many other tasks, as the backbone for cough classification. In addition, we propose a novel random masking mechanism to work together with the Transformer as the feature encoder in out framework. Coswara dataset [28] is part of Project Coswara 5 which aims to build a diagnostic tool for COVID-19 based on respiratory, cough, and speech sounds. Upon until December 21 st 2020, there are 1,486 crowdsourced samples (collected from 1,123 males and 323 females) available at Coswara data repository 6 . The majority of the participants are from India (1,329 participants) and the rest participants are from other countries of five continents: Asia, Australia, Europe, North America, and South America. Four types of sounds (breathing, coughing, counting, and sustained phonation of vowel sounds) are gathered from each participant. Similar to Coswara dataset, COVID-19 Sounds is another crowdsourcing based respiratory sound dataset. Audios are collected world-widely with a web-based app, an Android app, and an Apple app. In our work, we choose the same curated dataset that is introduced and used in [5] . After filtering out silent and noisy samples, in this released version of dataset 7 , there are 141 COVID-19 positive audio recordings collected from 62 participants and 298 COVID-19 negative audio recordings from 220 participants. Both coughs and breaths appear in these recordings. Positive samples are from participants who claimed that they had tested positive for COVID-19. As illustrated in Figure 1 , the proposed method consists of two phases: (i) Pre-training phase: to pre-train the feature encoder with unlabelled audios through contrastive learning. (ii) Downstream phase: to fine-tune the trained feature encoder with an additional classifier for COVID-19 cough classification. The details of these phases are given in the following subsections. The pipeline of the contrastive pre-training is given in Figure 2 . The idea of contrastive learning can be summarised as: to encode audios into a latent space through the feature encoder so that the similarity of positive samples is larger than the negative samples in the latent space. Therefore, three key components in this contrastive learning phase are: (1) how to obtain positive/negative samples; (2) how to design the feature encoder; and (3) how to measure the similarity in the latent space. 4.1.1 Pre-processing and Sampling. The purpose of pre-processing is to read and transform each raw audio file into a matrix format which can be taken as input by the following feature encoder. Mel Frequency Cepstral Coefficients (MFCC) and log-compressed melfilterbanks have been widely used in the audio analysis [5, 6, 11, 12, 25, 26] . Python Speech Features package 8 is used for computing log-compressed mel-filterbanks in our framework. After the preprocessing, each raw audio file is mapped to a feature ∈ R × , where stands for the number of frequency bins and indicates the total number of time frames in this audio. Since different audios in the dataset often have different lengths and different values after pre-processing, we apply a sliding window with window size to generate multiple clips for each processed audio. The sampling of positive and negative clips is then straightforward in our task. If clip : ∈ R × and clip : ∈ R × come from the same audio file, they are considered as a positive clip pair. On the contrary, if they are sampled from different audios, they form a negative pair. It is worth noting that the sampling might be slightly different, depending on different pretraining datasets. Let's say, for example, there are four respiratory sound files (fast/slow breathing sounds and deep/shallow cough sounds) gathered from the same participant. So, if two clips are from the same participant (any one or two from the four sound files), they are the positive pair. Overall, after contrastive learning, samples from the same person has a larger similarity in the latent space than samples from different persons. Such positive/negative sampling does not involve any annotated labels regarding the health condition of participants. vector ℎ ∈ R . This step is formulated as: where (·) (light yellow box in Figure 2 ) represents the feature encoder and W is the trainable weights of the feature encoder. Similarly, ℎ ∈ R for clip is obtained. The dimension of a representation vector is . In the proposed framework, we select the popular and effective Transformer structure [31] as the feature encoder (·). As shown in Figure 3 (a), the typical Transformer structure models the input sequence ( is considered as a time steps sequence and each time step is dimension) through the attention mechanism. For each time step, the scaled dot-product attentions [31] for every other time step are calculated. Given that we need to transfer the pre-trained weights of (·) to the downstream dataset, such a densely calculated attention mechanism might cause over-fitting on the pre-training dataset. To this end, we introduce a random masking mechanism (see Figure 3 (b)) to make the feature encoder robust. For a respiratory sound, the feature at each time step might not be always meaningful. A collected sound sample often contains noises such as a short pause between two coughs. This also motivates us to design this random masking. The masking generator (the blue box in Figure 2 ) generates a masking matrix with a specific masking rate (this rate is adjustable hyperparameter). Based on the masking matrix and the masking rate, some of the inputs are randomly masked and removed from the attention calculation in the Transformer. For example, in where is the masking matrix for clip . As suggested in many other contrastive learning methods such as [7, 26] , a projection head (·) (see Figure 2) is applied to map representations (e.g., ℎ and ℎ ) to the latent space where the similarity is measured. To measure the similarity, two metrics are used in the literature: • Cosine Similarity: this metric is commonly used visual representation learning such as [7, 13] . The similarity of a clip pair sim( , ) is calculated by: • Bilinear Similarity: this similarity has been used in [23, 26] . The similarity of a clip pair is given as: where W is the bilinear parameter. Specifically, we conduct an experiment to compare the performance of these two types of similarity metrics in Section 5.3. The loss function used in this phase for contrastive learning is a multi-class cross-entropy function working together with the similarity metric. During the training in this phase, each training instance (consists of two clips from the same participant) in a batch is a positive pair. Clips from different training instances (from different participants) formulate negative pairs. Each training instance is then considered as a unique "class" (a unique participant), so the multi-class cross-entropy is applied. This loss function is calculated over the batch (batch size and 2 indicates the total number of clips in a batch as we have two clips for each training instance) and modelled as: where denotes the temperature parameter for scaling. Note that in Equation (5), and is the positive pair whereas and ( ≠ ) are all the negative pairs. In the downstream phase, a straightforward network architecture is the feature encoder ( (·)) with an additional classifier. The feature encoder is initialised with pre-trained weights (W ) in the previous phase and takes pre-processed audio clip as input. The encoded feature ℎ is then passed to the classifier. The classifier is a fullyconnected layer with (feature dimension of the encoded feature ℎ ) input neurons and one output node with sigmoid activation function to output the probability and indicate whether the input respiratory sound clip is COVID-19 positive (probability larger than a threshold, e.g. 0.5) or negative (probability smaller than the threshold). The network is fine-tuned with labelled data end-to-end with typical binary cross-entropy classification loss. Based on this straightforward architecture, we also explore and design an advanced architecture (illustrated in Figure 4 ) with the random masking mechanism and an ensemble structure. The motivation of introducing an ensemble structure for classification is related to the random masking. Since the masking matrix is generated randomly, the two branches in the ensemble structure ( 1 (·) and 2 (·) in Figure 4 ) would have different masked time steps, which leads the feature encoders to model the input audio and yield encoded features from different perspectives. Thus, the ensemble structure and the random masking mechanism is a harmonised match and beneficial to each other. Unlike the above straightforward architecture, the classifier in the ensemble architecture has 2 input neurons as it takes the concatenated feature (the concatenation of two encoded vectors from 1 (·) and 2 (·)) as input. Note that both 1 (·) and 2 (·) are initialised with the same pretrained weights from the contrastive pre-training phase. During the fine-tuning process, both feature encoders may be updated differently. In this work, we focus on investigating the following research questions: (1) RQ1: Whether introducing the contrastive pre-training has better performance than conventional fully-supervised setting and which similarity metric has the better performance in our cough classification task? (2) RQ2: Does the random masking mechanism help the cough classification performance and what is the most suitable masking configuration? (3) RQ3: By introducing the ensemble framework in the downstream phase, could we achieve a further improvement regarding the cough classification performance? 5.1.1 Data Processing. As introduced in Section 3, we focus on two public COVID-19 respiratory datasets. Considering that Coswara dataset [28] has more participants and contains more audio samples than COVID-19 Sounds dataset [5] , the Coswara dataset is adopted as the pre-training dataset in this work. Note that for this pre-training dataset, the annotated labels (indicating whether the user is COVID-19 positive or negative) are not used. Furthermore, since this work is more about respiratory sounds, breathing sounds and cough sounds are selected for pre-processing and sampling (detailed in Section 4.1.1), whereas audios of sustained phonation of vowel sounds and counting sounds are ignored in the pre-training phase. Consequently, COVID-19 Sounds is used as the dataset in the downstream phase. To be more specific, in the downstream phase, the whole COVID-19 Sounds dataset is randomly divided into the training set (70%), validation set (10%), and testing set (20%). For each raw audio sample, the same pre-processing procedure (described in Section 4.1.1) is applied as well. In the pre-processing, the shape of a processed clip is R 64×96 as the number of mel-spaced frequency bins is set to 64 and the sliding window size is 96 which corresponds to 960 ms. The feature dimension is set to 64. In the contrastive pre-training phase, the batch size is selected as a large number (1024). As suggested by other contrastive learning methods (e.g., [7] ), the contrastive learning benefits from larger batch sizes (within GPU capacity) as a larger batch allows the model to compare the positive pair against more negative pairs. In the downstream network, the dropout is also applied to avoid over-fitting in the end-to-end fine-tuning process. The validation set in the downstream dataset is used for tuning hyperparameter (the feature dimension of the feature encoder) and the dropout rate. The batch size is 128 for the downstream phase. All experiments (both the contrastive pre-training and the downstream phases) are trained with Adam optimiser [19] (a 0.001 initial learning rate with ReduceLROnPlateau 9 decay setting) and executed on a desktop with an NVIDIA GeForce RTX-2080 Ti GPU with PyTorch. To evaluate the performance of different methods, several standard classification evaluation metrics including the Receiver Operating Characteristic -Area Under Curve (ROC-AUC), Precision, and Recall are selected. In our experiments, we report the average performance as well as the standard deviation of 5 runnings of each method or configuration. In addition, we report the average F1 score which is calculated based on the average Precision and average Recall. To investigate how the dimension of encoded feature and dropout rate influence the classification performance, all the combination of the following hyperparameter values (within reasonable ranges) are evaluated on the validation set: (1) Feature Dimension: [64, 128, 256]; (2) Dropout Rate: [0.0, 0.2, 0.5] (resulting in 9 combinations in total). Please note that we also run 5 times for each combination. Figure 5 shows the average performance of four metrics and error bars indicate the standard deviations. Based on these validation results, = 64 and a 0.2 dropout rate achieve the best validation performance and are used for the rest experiments. Since the feature encoder structure should be identical in both the pre-training and the downstream phases, the same hyperparameter setting is also applied in the pre-training phase. To evaluate the performance of contrastive pre-training and the Transformer feature encoder, we compare Transformer-CP (the suffix -CP means the method is contrastive pre-training enabled) with several methods with multiple configurations. Other methods being compared include VG-Gish/GRU/Transformer (without contrastive pre-training) and GRU-CP. Recurrent Neural Networks (RNNs) are designed for handling sequence data and have been adopted for COVID-19 cough classification research in [12, 25] . So, the GRU [8] is also included in the comparison. VGGish [15] is a popular convolutional neural network for audio classification. A pre-trained version 10 that is pre-trained on a large scale general audio dataset Audioset [10] is also widely used in the community. Such a pre-trained VGGish has also been applied in [5] to extract features for COVID-19 cough classification. Note that the pre-training of VGGish is a conventional fully-supervised pre-training with labelled data, which is different from our contrastive pre-training. The configuration of with or without the pre-training is summarised in the third column of Table 1 . In addition, the second column indicates the pre-training setting. A ✓ means that the proposed self-supervised contrastive pre-training is applied. For example, both the second and third columns are ✓ for our Transformer-CP. Comparison. The experimental results of the above methods are reported in Table 1 . To be more specific, for methods using pre-trained weights (either contrastive pre-training or conventional pre-training for VGGish), we also explore the finetuning option. In the fourth column of Table 1 , a × represents the pre-trained weights W are frozen and not be updated in the downstream phase, whereas a ✓indicates W is allowed to be updated. According to the table, the proposed Transformer-CP with finetuning achieves the best performance (shown in bold) against all the other methods. There are several additional findings that can be noticed from the table. First, without pre-training, VGGish has the worst performance (the first row) compared to GRU and Transformer. Using pre-trained VGGish weights (without fine-tuning) provides almost 6% accuracy gain, which indicates that the pretrained VGGish representation is well-trained and powerful. For all configurations that using frozen pre-trained representations (the second, sixth, and eighth rows), although VGGish (the second row) is the top performer, the performance of our Transformer-CP (the eighth row) is very close to VGGish. This is remarkable as it shows that our contrastive pre-trained (on a smaller scale and unlabelled dataset) self-supervised feature representation is competitive with a well-trained fully-supervised VGGish representation (pre-trained on a much larger scale and well-annotated Audioset). Second, the fine-tuning in the downstream task is important for all pre-trained models, which is as expected. For both the conventional pre-trained VGGish and contrastive pre-trained GRU/Transformer, the fine-tuning could improve the accuracy by around 3%. Third, if we compare GRU vs. Transformer and GRU-CP vs. Transformer-CP, the Transformer-based methods outperform GRU-based methods consistently. This justifies the selection of Transformer as the feature encoder in the proposed framework. Overall, the results show that the proposed framework with contrastive pre-training achieves a superior performance of cough classification. Table 2 , two similarity metrics in contrastive learning are compared. For a fair comparison, two different feature encoder structures, GRU-CP and Transformer-CP, are explored. As shown in the table, using bilinear similarity achieves consistent better performance with both structures on all evaluation metrics, which demonstrates that the bilinear similarity is more suitable for our cough classification task. In this part of the experiments, we turn to research on the proposed random masking mechanism and different masking rates in the contrastive pre-training phase. The experiment guideline for this part is: we pre-train several Transformer-CPs with multiple masking rates (0% to 100%) and then the pre-trained models are fine-tuned in the downstream phase. The cough classification performance of these models are listed in Table 3 . Please note that in the downstream phase, we do not apply the ensemble architecture so that there is no random masking in the downstream phase for results reported in the table. As a baseline for comparison, we also include the performance of Transformer (without any pre-training) in Table 3 . In general, all pre-trained models yield better results than the baseline Transformer and 50% masking outperforms other masking rates. When the masking rate is increasing from 0% (no masking at all) to 50%, we can witness a performance gain from the table. However, when the masking rate is too large (e.g., 75% and 100%), the performance decreases. This is not surprising. For example, in the extreme 100% masking case, all the inputs are masked, which means there is no attention between any time steps. As a result, the 100% masking has the worst performance among different masking rate settings. In this section, we focus on exploring different ensembles. Table 4 summarises three ensemble methods. The first two are the ensemble of our base Transformer feature encoder and other feature encoder structures (VGGish and GRU). No pre-trained weights are applied to these two ensembles. The third ensemble combines GRU-CP and Transformer-CP with contrastive pre-trained weights. By jointly comparing results given in Table 1 and Table 4 , it can be seen that the ensemble version demonstrates a better ability than a single feature encoder based method. Moreover, we investigate networks where the random masking is incorporated with the ensemble architecture (as shown in Figure 4 ). For the ensembles presented in Table 5 , both branches are set as Transformer-CP. We manipulate the masking rate in the downstream phase (rates given in the Masking (DS) column). In addition, the pre-trained weights of the top performer in Table 3 (with 50% contrastive pre-training masking rate) are used for these ensembles. Similar to the masking in the contrastive pre-training phase, 50% masking rate in the downstream phase also performs better than other masking rates. The above results confirm that the proposed ensemble architecture with the random masking could further improve the classification performance. Table 6 lists the inference time (for one input instance) of each model or configuration. Since the fine-tuning does not affect the inference time, the fine-tuning configuration is removed for comparison in the table. Generally, for three different base feature encoder structures, the inference time of Transformer is on par with GRU, whereas VGGish leads Transformer/GRU by a small margin (around 0.002 milliseconds only). Although Transformer includes attention computation, it processes each time step in the input sequence in parallel, whereas GRU has to process each time step recurrently. This might explain the similar computation cost between Transformer and GRU. From the table, we notice also that using contrastive pre-trained weights does not introduce a longer time for inference. This is as expected as the major difference between Transformer and Transformer-CP (or GRU vs. GRU-CP) is whether loading the pre-trained weights. This weights initialisation process almost has no influence on the inference speed. An interesting and surprising finding is about the inference time of using different downstream random masking rates (the last five rows of Table 6 ). In theory, a larger masking rate should run faster as more time steps are masked and not used in attention calculation. According to the table, however, 75% rate has the largest inference time and 0% and 100% are all smaller than the rest masking rates. This can be explained by the implementation of the masking generator. In the implementation, the default masking matrix is an all-ones matrix or an all-zeros matrix (only used for masking rate 100%), where 0 means being masked and vice versa. For a given masking rate, 1 will be updated to 0 in the matrix through a for loop. This loop operation takes longer if more elements need to be updated (e.g., 75% rate), which causes the larger inference time for the 75% setting. Overall, even the largest time cost in the table is only 32.36 × 10 −6 seconds (around 0.03 milliseconds). Such a low time cost would not be a bottleneck or limit the application of the proposed framework. From another point of view, without the proposed contrastive pre-training, multiple models need to be trained if multiple datasets are available. As a result, training time will be done per model without domain transfer, which is a potential bottleneck for large-scale deployments. However, our proposed framework is able to address this training bottleneck through the contrastive pre-training phase. In this paper, we propose a novel framework for respiratory sounds based COVID-19 cough classification. This study appears to be the first study to leverage unlabelled respiratory audios in the area. In order to do so, we introduce a contrastive pre-training phase in which the Transformer-based feature encoder is pre-trained with unlabelled data in a self-supervised manner. Moreover, a random masking mechanism is explicitly proposed to work with the Transformer feature encoder, which aims to improve the robustness of the feature encoder. In addition, we have explored an ensemble-based network architecture in the downstream phase. Experimental results demonstrate that the designed ensemble network with random masking achieves top performance. The findings of this research provide a new perspective and insights for cough classification. DeepCough: A deep convolutional neural network in a wearable cough detection system Deep neural networks for identifying cough sounds Amrita Mahale, Saurabh Rane, Neeraj Agarwal, and Rahul Panicker. 2020. Cough against covid: Evidence of covid-19 signature in cough sounds Can machine learning be used to recognize and diagnose coughs Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data Feature extraction for the differentiation of dry and wet cough sounds A simple framework for contrastive learning of visual representations Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Audio set: An ontology and human-labeled dataset for audio events An efficient MFCC extraction method in speech recognition Covid-19 detection system using recurrent neural networks Momentum contrast for unsupervised visual representation learning Deep residual learning for image recognition CNN architectures for large-scale audio classification Long Short-term Memory AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning Adam: A Method for Stochastic Optimization Design of wearable breathing sound monitoring system for real-time wheeze detection Robust Detection of COVID-19 in Cough Sounds: Using Recurrence Dynamics and Variable Markov Model Energy-efficient respiratory sounds sensing for personal mobile asthma monitoring Representation learning with contrastive predictive coding The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms COVID-19 Cough Classification using Machine Learning and Global Smartphone Recordings Contrastive Learning of General-Purpose Audio Representations 2020. Detecting COVID-19 from Breathing and Coughing Sounds using Deep Neural Networks Coswara-A database of breathing, cough, and voice sounds for COVID-19 diagnosis Very deep convolutional networks for large-scale image recognition Efficientnet: Rethinking model scaling for convolutional neural networks Attention is all you need Hierarchically structured transformer networks for fine-grained spatial event forecasting TERMCast: Temporal Relation Modeling for Effective Urban Flow Forecasting This research is supported by Australian Research Council (ARC) Discovery Project DP190101485. We would also like to thank the COVID-19 Sounds App team of the Department of Computer Science and Technology of the University of Cambridge for access to the COVID-19 sound dataset and Project Coswara by Indian Institute of Science (IISc) Bangalore for the Coswara dataset.