key: cord-0604666-1kfruri2 authors: Dong, Shunjie; Yang, Qianqian; Fu, Yu; Tian, Mei; Zhuo, Cheng title: RCoNet: Deformable Mutual Information Maximization and High-order Uncertainty-aware Learning for Robust COVID-19 Detection date: 2021-02-22 journal: nan DOI: nan sha: 88c03d355a8599704ebcae0ad4042984f8469e86 doc_id: 604666 cord_uid: 1kfruri2 The novel 2019 Coronavirus (COVID-19) infection has spread world widely and is currently a major healthcare challenge around the world. Chest Computed Tomography (CT) and X-ray images have been well recognized to be two effective techniques for clinical COVID-19 disease diagnoses. Due to faster imaging time and considerably lower cost than CT, detecting COVID-19 in chest X-ray (CXR) images is preferred for efficient diagnosis, assessment and treatment. However, considering the similarity between COVID-19 and pneumonia, CXR samples with deep features distributed near category boundaries are easily misclassified by the hyper-planes learned from limited training data. Moreover, most existing approaches for COVID-19 detection focus on the accuracy of prediction and overlook the uncertainty estimation, which is particularly important when dealing with noisy datasets. To alleviate these concerns, we propose a novel deep network named {em RCoNet$^k_s$} for robust COVID-19 detection which employs {em Deformable Mutual Information Maximization} (DeIM), {em Mixed High-order Moment Feature} (MHMF) and {em Multi-expert Uncertainty-aware Learning} (MUL). With DeIM, the mutual information (MI) between input data and the corresponding latent representations can be well estimated and maximized to capture compact and disentangled representational characteristics. Meanwhile, MHMF can fully explore the benefits of using high-order statistics and extract discriminative features of complex distributions in medical imaging. Finally, MUL creates multiple parallel dropout networks for each CXR image to evaluate uncertainty and thus prevent performance degradation caused by the noise in the data. C ORONAVIRUS disease 2019 (COVID- 19) causes an ongoing pandemic that significantly impacts everyone's life since it was first reported, with hundreds of thousands of deaths and millions of infections emerging in over 200 countries [1] , [2] . As indicated by the World Health Organization (WHO), due to its highly contagious nature and lack of corresponding vaccines, the most effective method to control the spread of COVID-19 infection is to keep social distance and contact tracing. Hence, early and fast diagnosis of COVID-19 has become significantly essential to control further spreading, and such that the patients could be hospitalized and receive proper treatment in time. Since the emerge of COVID-19, reverse transcription polymerase chain reaction (RT-PCR), as a viral nucleic acid detection method by gene sequencing, is the accepted standard for COVID-19 detection [3] . However, because of the low accuracy of RT-PCR and limited medical test kits in many hyper-endemic regions or countries, it is challenging to detect every individual affected by COVID-19 rapidly [4] , [5] . Therefore, alternative testing methods, which are faster and more reliable than RT-PCR, are urgently needed to combat the disease. Since most COVID-19 positive patients were diagnosed with pneumonia, radiological examinations could help detect and assess the disease. Recently, chest computed tomography (CT) has been shown to be efficient and reliable to achieve a real-time clinical diagnosis of COVID-19, outperforming over RT-PCR in terms of accuracy. Moreover, some deep learning based methods have been proposed for COVID-19 detection using chest CT images [6] , [7] , [8] , [9] . For example, an adaptive feature selection approach was proposed in [10] for COVID-19 detection based on a trained deep forest model. In [11] , an uncertainty vertex-weighted hypergraph learning method was designed to identify COVID-19 from community acquired pneumonia (CAP) using CT images. However, the routine use of CT, which is conducted via expensive equipments, takes considerably more time than X-ray imaging and brings a massive burden on radiology departments. Compared to CT, X-rays could significantly speed up disease screening, and hence become a preferred method for disease diagnosis. Accordingly, deep learning based methods for detecting COVID-19 with chest X-ray (CXR) have been developed and shown to be able to achieve accurate and speedy detection [12] , [13] . For instance, a tailored convolution neural network platform trained on open source dataset called COVIDNet in [14] was proposed for the detection of COVID-19 cases from CXR. Oh et al. [15] proposed a novel probabilistic gradient-weighted class activation map to enable infection segmentation and detection of COVID-19 on CXR images. Fig. 1 shows three samples from the COVIDx dataset [14] which contains three different classes: normal, pneumonia and COVID-19. However, due to the similar pathological information between pneumonia and COVID-19 in the early stage, the CXR samples may have latent features distributed near the category boundaries, which can be easily misclassified by the hyper-plane learned from the limited training data. Moreover, to the best of our knowledge, most of the existing methods for COVID-19 detection are designed to extract the lower-dimension latent representations which may not be able to fully capture statistical characteristic of complex distributions (i.e., non-Gaussian distribution). Furthermore, quantifying uncertainty in COVID-19 detection is still a major yet challenging task for doctors, especially with the presence of noise in the training samples (i.e., label noise and image noise). To address the above problems, we propose a novel deep network architecture, referred to as RCoNet k s , for robust COVID-19 detection which, in particular, contains the following three modules, i.e., Deformable mutual Information Maximization (DeIM), Mixed High-order Moment Feature (MHMF) and Multi-expert Uncertainty-aware Learning (MUL): • The Deformable mutual Information Maximization (DeIM) module estimates and maximizes the mutual information (MI) between input data and learned high-level representations, which pushes the model to learn the discriminative and compact features. We employ deformable convolution layers in this module which are able to explore disentangled spatial features and mitigate the negative effect of similar samples across different categories. • The Mixed High-order Moment Feature (MHMF) module, inspired by [16] , fully explores the benefits of using a mix of high-order moment statistics to better characterize the feature distributions in medical imaging. • The Multi-expert Uncertainty-aware Learning (MUL) creates multiple parallel dropout networks, each can be treated as an expert, to derive multiple experts based diagnosis similar to clinical practices, which improves the prediction accuracy. MUL also quantifies the prediction accuracy by obtaining the variance in prediction across different experts. • The experimental results show that our proposal achieves the state-of-the-art performance in terms of most metrics both on open source COVIDx dataset of 15134 original CXR images and that of noisy setting. The remaining of this paper is organized as follows: In Section II, we review related works on mutual information estimation and uncertainty learning as well. In Section III, after an overview of our proposed approach, we discuss the main components of RCoNet k s . In Section IV, we compare our proposed architecture with the existing deep learning based methods evaluated on a public available dataset of CXR images and also the same dataset but under noisy conditions. And we also conduct extensive experiments to demonstrate the benefits of DeIM, MHMF and MUL on the performance of the system. Finally, we conclude this paper in Section V. In this section, we introduce related works on mutual information estimation and uncertainty learning that lay the foundation of this paper. Mutual information (MI), as a fundamental concept in information theory, is widely applied to unsupervised feature learning for quantifying the correlation between random variables. MI has been exploited in a wide range of domains and tasks, including biomedical sciences [17] , blind source separation (BSS, e.g., independent component analysis [18] ), feature selection [19] , [20] and causal inference [21] . For example, the object tracking task considered in [22] was treated as a problem of optimizing the mutual information between features extracted from a video with most color information removed and those from the original full-color video. Closely related work presented in [23] considered learning representations to predict cross-modal correspondence by maximizing MI between features from the multi-view encoders and the content of the held-out view. Moreover, Mutual Information Neural Estimation (MINE) proposed by [24] was designed to learn a general-purpose estimator of the MI between continuous variables based on dual representations of the KL-divergence, which are scalable, flexible and, most crucially, trainable via back-propagation. Based on MINE, our proposal estimates and maximizes the CXR image inputs and the corresponding latent representations to improve diagnosis performance. Aiming at combating the significant negative effects of uncertainty in deep neural networks, uncertainty learning has been getting lots of research attention, which facilitates the reliability assessment and solves risk-based decision-making problems [25] , [26] , [27] . In recent years, various frameworks have been proposed to characterize the uncertainty in the model parameters of deep neural networks, referred to as model uncertainty, due to the limited size of training data [28] , [29] , which can be reduced by collecting more training data [26] , [30] , [31] . Meanwhile, another kind of uncertainty in deep learning, referred to as data uncertainty, measures the noise inherent in given training data, and hence cannot be eliminated by having more training data [32] . To combat these two kinds of uncertainty, lots of works on various computer vision tasks, i.e., face recognition [25] , semantic segmentation [33] , object detection [34] , person re-identification [35] , etc., have introduced deep uncertainty learning to improve the robustness of deep learning model and interpretability of discriminant. For face recognition task in [26] , an uncertainty-aware probabilistic face embedding (PFE) was proposed to represent face images as distributions by utilizing data uncertainty. Exploiting the advantage of Bayesian deep neural networks, one recent study [36] leveraged the model uncertainty for analysis and learning of face representations. To our knowledge, our proposal is the first work that utilizes the high-order moment statistics and multiple expert networks to estimate uncertainty for COVID-19 detection using CXR images. In this section, we introduce the novel RCoNet k s for robust COVID-19 detection, which incorporates Deformable mutual Information Maximization (DeIM), Mixed High-order Moment Feature (MHMF) and Multi-expert Uncertainty-aware Learning (MUL), as illustrated in Fig. 2 . k is the number of levels of moment features that are combined in MHMF, and s is the number of the expert network in MUL, which will be further clarified in the sequel. The CXR images are first processed by DeIM which consists of a stack of deformable convolution layers, extracting discriminative features. The compact features are then fed into MHMF module to generate high-order moment latent features, reducing negative effects caused by similar images. The proposed MUL utilizes the learned high-order features to generate final diagnoses. Due to the similarity between COVID-19 and pneumonia in the latent space, we propose Deformable mutual Information Maximization (DeIM) to extract discriminative and informative features, reducing the negative influence caused by the lack of distinctiveness in the deep features. In particular, we train the model by maximizing the mutual information between the input and corresponding latent representation. We use a stack of five convolutional stages, as shown in Fig. 2 , to encode inputs into latent representations, which is denoted by a differentiable parametric function E ψ : where ψ denotes the set of all the trainable parameters in these layers, and X and Z denote the input and output spaces, respectively. The detailed architecture of each convolutional stage is presented in Fig. 2 , which consists of several convolutional layers each followed by a batch normalization layer. Note that we employ deformable convolutional layers which can better extract spatial information of the irregular infected area compared to conventional convolutional layers. More specifically, regular convolution operates on pre-defined rectangular grid from an input image or a set of input feature maps, while the deformable convolution operates on deformable grids that each grid point is moved by a learnable offset. For example, the receptive grid P of a regular convolution with kernel size 3 × 3 is fixed and can be given by: while, for deformable convolution, the receptive grid is moved by the learned offsets ∆p n ∈ R 2 and the output is given as follows: where b(p 0 ) denotes the value at location p 0 on the output feature map b, p n enumerates the locations in P, w(p n ) represents the weight at location p n of the kernel, and a(·) is value at given location on the input feature map. We can see that with the introduction of offsets ∆p n , the receptive grid is no longer fixed to be a rectangle, and instead is deformable. We optimize E ψ by maximizing the mutual information between the input and the output, i.e., I(X; Z), where Z E ψ (X). The precise mutual information requires knowledge probability density functions (PDFs) of X and Z, which is intractable to obtain in practice. To overcome this issue, Mutual Information Neural Estimation (MINE) proposed in [24] estimates mutual information by using a lower-bound on the Donsker-Varadhan representation [37] of the KL-divergence: where J represents the joint probability of X and Z, i.e., J P (X, Z), and M denotes the product of marginal probabilities of X and Z, M P (X)P (Z). T θ : X × Z → R denotes a global discriminator modeled by a neural network with parameters θ, which is trained to maximize I (DV ) θ (X; Z) to approximate the actual mutual information. Hence, we can simultaneously estimate and maximize I(X; E ψ (X)) by maximizing I Since the encoder E ψ and the mutual information estimator T θ are optimized simultaneously with the same objective function, we can share some layers between them, and replace the T θ with T θ,ψ to account for this fact. Since we are primarily interested in maximizing the mutual information rather than estimating the precise value, we can alternatively use a Jensen-Shannon MI estimator (JSD) [38] , which offers more interpretable trade-off: where x is an input sample of an empirical probability distribution P, x denotes a fake sample from distribution P, where P = P. This estimator is illustrated by th DeIM block shown in Fig. 2 , which has the latent representation E ψ (x), the input sample x and the fake sample x as input, and the difference between the outputs of the two softplus operations as the estimation of MI. Another alternative MI estimator is called Noise-Contrastive Estimator (NCE) [39] , which is defined as: The experiments have found that using the NCE estimator outperforms the JSD estimator in some cases, but appears to be quite similar most of the time. The existing works [40] that implement these estimators use some latent representation of x, which is then merged with some randomly generated features to obtain "fake" samples that satisfy P = P. In contrast, we use the samples from other categories as the "fake" samples, i.e., x , instead. For example, if the input is a pneumonia sample, then the fake sample is either a normal or COVID sample. We note that this can push the learned encoder to derive more distinguishable features for samples from different categories. The presence of the image noise and label noise in CXR datasets may cause image latent representations generated by deep neural networks to be scattered in the entire feature space. To deal with this issue, [25] , [26] , [35] represent each image as a Gaussian distribution, that is defined by a mean (a standard feature vector) and a variance. However, the deep features of . CXR samples we considered in this paper typically follow a complex, non-Gaussian distribution [41] , [42] , which cannot be fully captured by its first-order (mean) or second-order statistics (variance). We seek a better combination of different orders of statistics to more precisely characterize the latent representation of the CXR images. We illustrate the moment features of different orders [16] in Fig. 3 , where we plot 350 data points in R 2 sampled from a distribution that combines three different Gaussian distributions. We can observe that the high-order moment features are more expressive of statistical characteristic compared to low-order one. More specifically, it captures the shape of the cloud of samples more accurately. Therefore, we include the Mixed High-order Moment Feature (MHMF) module in the proposed model, as shown in Fig. 2 , which outputs a combination of high-order moment features with the latent representation E ψ (X) as input. This will potentially solve the scattering problem, and, more importantly, capture the subtle differences between CXR images of similar categories, i.e., pneumonia and COVID-19 in our case. We show how to obtain the complicated high-order moment feature in the following. Define r-th order moment feature as φ r (a), where a ∈ R H×W ×C denotes a latent feature map of dimension H × W × C. Lots of recent works adopt the Kronecker product to compute high-order moment feature [42] . However, calculating Kronecker product of high dimensional feature maps is significantly computational intensive, and hence infeasible for real-world applications. Inspired by [43] , [44] , [45] , we approximate φ r (a) by exploiting r random projectors which relies on certain factorization schemes, such as Random Maclaurin [46] . We use 1×1 convolution kernels as the random projectors to estimate the expectations of high-order moment features. That is, where represents the Hadamard (element-wise) product, and K 1 , K 2 , . . . , K r are 1 × 1 convolution kernels with random weights. Note that Random Maclaurin produces a estimator that is independent of the input distribution, which causes the estimated high-order moments to contain non-informative highorder moment components. We eliminate these components by learning the weights of the projectors, i.e., the 1×1 convolution kernels, from the data. Also note that the Hadamard product of a number of random projectors may end up with the estimated high-order moment features to be similar to low-order ones. To solve this problem, we use a recursive way to estimate the high-order moments instead, φ r (a) = φ r−1 (a) K r (a). Since different order moments capture different informative statistics, we design the MHMF module to keep the estimated moments of different levels of order, as shown in Fig. 2 , the output of which is given as: Hence, J (a) is rich enough to capture the complicated statistics, and produce discriminative features for the input of different categories. The MHMF module, as described in section III-B, generates mixed high-order moment features of each sample in the latent space, which we aim to further exploit to derive compact and disentangled information for COVID-19 detection. Meanwhile, quantifying uncertainty in disease detection is undoubtedly significant to understand the confidence level of computer-based diagnoses. Motivated by the clinical practices, we present a novel neural network in this section, referred to as Multi-expert Uncertainty-aware Learning (MUL), which takes in the mixed high-order moment features and outputs the prediction and the quantification of the diagnostic uncertainty caused by the noise in the data. The structure of Multi-expert Uncertainty-aware Learning module is shown in Fig. 2 , which consists of multiple dropout layers that process the output from MHMF in parallel, each of which together with the following several fully connected layers can be regarded as an expert for COVID-19 detection. We note that each dropout layer uses different masks which results in different subsets of latent information to be kept, while the following fully connected layers share the same weights across different experts. The masks for the dropout layers are generated randomly at each iteration during training, but fixed during the inference time. We denote the input-output function of each expert by C j e (·), j = 1, ..., N , where N is the total number of experts. Hence, we have the classification loss L j e of j-th expert given as follows: where n represents the total number of labeled CXR samples, and y i denotes the one-hot representation of the class label, i = 1, ..., n, and we recall that J (·) denotes the MHMF operation given in Eq. (10) and E ψ (·) is the preprocessing step on the CXR samples. Note that, the total number of COVID-19 cases is much smaller than non-COVID cases, i.e., normal and pneumonia cases. This imbalance in the dataset leads to a high ratio of false-negative classification. To mitigate this negative effect, we employ a weighted cross-entropy L w (·) given as follows: where C is the total number of classes, y i,c is the c-th element of y i , and y i,c denotes the corresponding prediction. λ c represents the weight that controls how much the error on class c contributes to the loss, c = 1, ..., C. Finally, the loss L M of the whole MUL module is derived by averaging the loss values of all the experts: We use the variance of classification loss L j e with regards to the average loss L M to quantify the uncertainty, denoted by σ, which is given as: The proposed MUL module improves the diagnostic accuracy as the final prediction combines the results from multiple experts, and also mitigates the negative effects caused by the noise in the data by introducing the dropout layers. Moreover, the experiments have revealed that the more experts in MUL module the faster the system converges during training. The whole architecture of RCoNet k s is presented in Fig. 2 , where the CXR images are first processed by a stack of deformable convolution layers, then transformed to high-order moment latent features by the MHMF module, which are then fed to the MUL module to generate final diagnoses. The loss used to optimize RCoNet k s is given as follows where L M is the prediction loss given by Eq. (13) , and L I denotes the mutual information between the input X and the latent representation E ψ (X) estimated by either Eq. (6) or Eq. (7). α is a positive hyper-parameter that governs how much L M and L I contribute to the total loss. During training, the trainable parameters of the whole systems are updated iteratively to minimize L total , which is to jointly minimize the prediction loss L M thus to improve the accuracy, and maximize the mutual information L I . We use a public chest X-ray dataset, referred to as COVIDx, to evaluate the proposed model, which is published by the authors of COVID-Net [14] . This dataset contains a total of 13975 CXR images from 13870 patients of 3 classes: Normal Pneumonia COVID-19 Train 7966 5451 207 13624 Test 885 594 31 1510 CXR datasets from https://www.kaggle.com/c/rsna-pneumoniadetection-challenge/data. Following [14] , [47] , the dataset is finally divided into 13624 training and 1510 test samples. The numbers of samples from different categories used for training and testing are summarized in Table I . Moreover, we also adopted various data augmentation techniques to generate more COVID-19 training samples, such as flipping, translation, rotation using random five different angles, to tackle the data imbalance issue such that the proposed model can learn an effective mechanism of detecting COVID-19. In our experiments, we use the following six metrics to evaluate the COVID-19 detection performance of different approaches: We compare the proposed RCoNet k s with the following five existing deep learning methods for COVID-19 detection: • PbCNN [15] : A patch-based convolutional neural network with a relatively small number of trainable parameters. • COVID-Net [14] : A tailored deep convolutional neural network that uses a projection-expansion-projection design pattern. • DenseNet-121 [48] : A densely connected convolutional network that connects each layer to every other layer in a feed-forward fashion. We implement our RCoNet k s using the PyTorch library and apply ResNeXt [50] as the backbone network. We train the model with the Adam optimizer with an initial learning rate of 2 × 10 −4 and a weight decay factor of 1 × 10 −4 . All the experiments are run on an NVIDIA GeForce GTX 1080Ti GPU. We set the batch size to be 8, and resize all images to 224 × 224 pixels. The hyperparameter α in the loss function given in Eq. (15) is set to be within the range of [0, 0.4]. The drops rate of each dropout layer in the MUL module is randomly chosen from {0.1, 0.3, 0.5}. The loss weight λ c for each category, which is used to calculate the weighted sum of the loss as given in Eq. (12) , is set to be 1, 1, and 20 for the normal, pneumonia, COVID-19 samples, respectively, corresponding to the number of training samples in each. We adopt 5-fold cross-validation training that we randomly divide the training sets into five equal-size subsets and train the model five times that using different four subsets for training, and the remaining one for validation each time. We also evaluate our proposed model with different number of order moments for the MHMF module k, and different number of experts s. To evaluate the performance of the proposed model with the presence of label noise, we derive a noisy dataset from the given dataset in the following way: we randomly select a given percentage of training samples in each category, and assign wrong labels to these sample. In particular, to ensure that the fake COVID-19 samples are less than the real ones, we assign the COVID-19 labels to selected normal and pneumonia samples in a way the the number of normal and pneumonia samples assigned with COVID-19 label equals to the number of COVID-19 samples assigned with either normal and pneumonia label. We show a realization of the derived noisy dataset when the percentage of fake samples is set to be 10% in Table II . Performance on Clean Data: The numerical results on the clean dataset without any artificial noise added are shown in Table III . The results are presented in the form of a ± b, where a and b denote the average and variance values of each metric on five independent experiments, respectively. We can see that RCoNet 5 4 , i.e., the proposed model with k = 4 levels of mixed moment features and s = 4 experts, achieves notable performance improvement over the comparison methods in terms of most metrics considered, including ACC, SPE, BAC, PPV and F1 score. We note the performance of RCoNet k s can be further improved with a different set of k and s. For instance, RCoNet 5 4 achieves better SEN and F1 score than RCoNet 4 4 . The higher ACC and F1 score validate that RCoNet k s is able to obtain latent features, i.e., the mixed moment features of different levels of order, that maintains inter-class separability and intra-class compactness better than other models. Note that RCoNet 5 4 leads to a higher SEN than all other methods, which is particularly important to COVID-19 detection, since successfully detecting COVID-19 positive cases is the key to control the spread of this super contagious disease. Moreover, it can be observed that RCoNet k s has smaller variance compared to the others, which demonstrates the robustness and stability of our model. We also evaluate the complexity of the proposed model in terms of numbers of parameters and computational cost, i.e., Float-point operations (FLOPs), which is presented in Table III . It can be observed that the proposed model has much fewer parameters than several existing methods, except ReCoNet. However, we note that the FLOPs of RCoNet k s is quite close to that of ReCoNet, which means it takes a similar amount of time to diagnose COVID-19 from CXR images by these two model. We can also observe that the increase of k and s, i.e., the number of mixed moment features and the number of experts in MUL, only causes a small, or even neglectable, amount of increase in the number of parameters and FLOPs as well, which suggests that we can improve the performance of the proposed model by optimizing k and s, without the concern on the significant increase of the complexity. Performance on Noisy Data: We further compare the proposed model to the existing ones when there is noise present in the training dataset. We generate three noisy training datasets in the aforementioned way from the clean dataset with 10%, 20% and 30% samples with wrong labels, respectively. The results, which we take the averages from five independent experiments, are presented in Table IV . It can be easily seen that the more fake samples we add the more it degrades the performance of all the methods. Note that the proposed RCoNet 4 4 still gets the state-of-the-art results in all considered cases with different percentages of noisy samples in the training dataset. Moreover, the performance gain over the existing methods slightly increases with the ratio of noisy samples, verifying that our model is more robust to the noise. Note that the extreme case of 30% noisy samples leads to great performance degradation of all the models. In practice, the percentage of label noise is usually around 10% to 20%. We present the confusion matrices in Fig. 4 to summarize the prediction accuracy of different categories. We can observe that, although with very limited number of COVID-19, our model still maintains high accuracy of detecting COVID-19 cases, even with the presence of noisy samples. Uncertainty Estimation: One remarkable advantage of our model is the ability to quantify the uncertainty in the final prediction, which is significantly crucial for COVID-19 detection. This is done by obtaining the variance in the output of different experts in MUL as described in Section III-C. The larger the variance is, the more different experts disagree with each other, and, hence, the more uncertain the model is about the final prediction. We present two CXR samples in Fig. 6 , including the predictions and the corresponding uncertainty level by RCoNet k s . We can see that the correctly classified CXR image has a low uncertainty level about its prediction, i.e., 0.0094, and the misclassified CXR sample with a high uncertainty level, i.e., 0.4792, suggests that an alternative way of diagnosis should be sought to correct this prediction. This greatly improves the reliability of the prediction by RCoNet k s , and reduces the chance of misdiagnosis. We also show in Fig. 7 the average uncertainty levels of RCoNet k s trained on clean and noisy datasets with different ratios of noisy samples. It can be observed that the uncertainty level increases almost linearly with the percentage of noisy samples in the dataset, which highlights the negative impact of noise on model training. We further numerically analyse the benefits of the three key modules of RCoNet k s , i.e., the DeIM, MHMF and MUL modules in this section. Effectiveness of DeIM: We utilize t-SNE method [51] to visualize the latent features, presented in Fig. 5 , which are Fig. 5(a) , and that by RCoNet-D presented in Fig. 5(b) , we can tell that the introduction of DeIM leads to better class separation in the latent space. Effectiveness of MHMF: We can observe in Fig. 5 (a) - Fig. 5 (d) that the latent features of the COVID-19 samples, generated by the models without MHMF, always distribute around the category boundary, and are not quite separable from those of some pneumonia samples. Meanwhile, the latent feature distributions presented in Fig. 5 (e) - Fig. 5 (h) derived by the models with MHMF show significant separability between different categories, which implies that MHMF can extract discriminative features. We also include numerical results of RCoNet k s , trained and tested on COVIDx dataset, with regards to different values of k, i.e., the number of levels of the moment features to be mixed, and s, i.e., the number of experts, in Table V in terms of accuracy. We can observe that, for a given value of s, the accuracy increases first with the value of k but decreases after k is larger than 4. It demonstrates that including more levels of moment feature could improve the model performance. However, the overly high-order moments may lead to performance degradation, which may be because these features are not useful for COVID detection. Effectiveness of MUL: From Table V , we observe that, for a given value of k, accuracy increases first with the value of s but saturates around s = 5. This implies that having more experts in MUL can increase the prediction accuracy but it is not necessary to have too many. Parameter Sensitivity and Convergence: We evaluate how sensitive the model performance in terms of accuracy to the value of α. We show the average accuracy of five independent experiments by RCoNet 4 4 trained on the dataset with different ratios of noisy samples in Fig. 8 . As we can see, the larger α, which means the prediction loss, i.e., L M , contributes less to the total loss, not necessarily leads to degradation in the accuracy. This means maximizing the mutual information between the input and the latent features could keep useful information within the latent features, thus improving the prediction accuracy. We have also shown the learning curves of different models in Fig. 9 , which shows that RCoNet 4 4 converges slightly faster than the others, including COVID-Net, ReCoNet and CoroNet. In this paper, we proposed a novel deep network model, named RCoNet k s , for robust COVID-19 detection, which contains three key components, i.e., Deformable mutual Information Maximization (DeIM), Mixed High-order Moment Feature (MHMF) and Multi-expert Uncertainty-aware Learning (MUL). DeIM estimates and maximizes the mutual information between input data and the latent representations simultaneously to obtain the category separability in the latent space. We proposed MHMF to overcome the limited expressive capability of low-order statistics, and instead use a combination of both low and high order moment features to extract more informative and discriminative features. MUL generates the final diagnosis and the uncertainty estimation, by combining the output of multiple parallel dropout networks, each as an expert. We numerically validated that the proposed RCoNet trained on either the public COVIDx dataset or the noisy version of it, outperforms the existing methods in terms of all the metrics considered. We note that these three modules can be easily implemented into other frameworks for different tasks. Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography Accurate screening of covid-19 using attention based deep 3d multiple instance learning Artificial intelligence-enabled rapid diagnosis of patients with covid-19 Relational modeling for robust and efficient pulmonary lobe segmentation in ct scans Dual-sampling attention network for diagnosis of covid-19 from community acquired pneumonia Ai augmentation of radiologist performance in distinguishing covid-19 from pneumonia of other etiology on chest ct Application of deep learning technique to manage covid-19 in routine clinical practice using ct images: Results of 10 convolutional neural networks Diagnosis of coronavirus disease 2019 (covid-19) with structured latent multi-view representation learning Inf-net: Automatic covid-19 lung infection segmentation from ct images Adaptive feature selection guided deep forest for covid-19 classification with chest ct Hypergraph learning for identification of covid-19 with ct imaging Coronavirus disease 2019 (covid-19): a perspective from china Covidlite: A depth-wise separable deep neural network with white balance and clahe for detection of covid-19 Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Deep learning covid-19 features on cxr using limited training data sets Sorting out typicality with the inverse moment matrix sos polynomial Multimodality image registration by maximization of mutual information Independent component analysis: algorithms and applications Input feature selection by mutual information based on parzen window Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements Tracking emerges by colorizing videos Look, listen and learn Mutual information neural estimation Data uncertainty learning in face recognition Probabilistic face embeddings Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding Weight uncertainty in neural networks Uncertainty in deep learning A practical bayesian framework for backpropagation networks Bayesian learning for neural networks What uncertainties do we need in bayesian deep learning for computer vision Deep convolutional encoder-decoder network with model uncertainty for semantic segmentation Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving Robust person re-identification by modelling feature uncertainty Face recognition with bayesian convolutional networks for robust surveillance systems Asymptotic evaluation of certain markov process expectations for large time, i f-gan: Training generative neural samplers using variational divergence minimization Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics Learning representations by maximizing mutual information across views Blind image quality assessment based on high order statistics aggregation Homm: Higher-order moment matching for unsupervised domain adaptation Metric learning with horde: High-order regularizer for deep embeddings Negative evidences and co-occurences in image retrieval: The benefit of pca and whitening Bier-boosting independent embeddings robustly Random feature maps for dot product kernels Reconet: Multi-level preprocessing of chest x-rays for covid-19 detection using convolutional neural networks," medRxiv Densely connected convolutional networks Coronet: A deep neural network for detection and diagnosis of covid-19 from chest x-ray images Aggregated residual transformations for deep neural networks Decaf: A deep convolutional activation feature for generic visual recognition