key: cord-0474575-43ki7eii authors: Li, Xin; Li, Chengyin; Zhu, Dongxiao title: COVID-MobileXpert: On-Device COVID-19 Screening using Snapshots of Chest X-Ray date: 2020-04-06 journal: nan DOI: nan sha: c9b729b5595bed455f2eb646c035dfd8b0eaf120 doc_id: 474575 cord_uid: 43ki7eii With the increasing demand for millions of COVID-19 screenings, Computed Tomography (CT) based test has emerged as a promising alternative to the gold standard RT-PCR test. However, it is primarily provided in hospital setting due to the need for expensive equipment and experienced radiologists. An accurate, rapid yet inexpensive test that is suitable for COVID-19 population screenings at mobile, urgent and primary care clinics is urgently needed. We present COVID-MobileXpert: a lightweight deep neural network (DNN) based mobile app that can use noisy snapshots of chest X-ray (CXR) for point-of-care COVID-19 screening. We design and implement a novel three-player knowledge transfer and distillation (KTD) framework including a pre-trained attending physician (AP) network that extracts CXR imaging features from large scale of lung disease CXR images, a fine-tuned resident fellow (RF) network that learns the essential CXR imaging features to discriminate COVID-19 from pneumonia and/or normal cases using a small amount of COVID-19 cases, and a trained lightweight medical student (MS) network that performs on-device COVID-19 screening. To accommodate the need for screening using noisy snapshots of CXR images, we employ novel loss functions and training schemes for the MS network to learn the robust imaging features for accurate on-device COVID-19 screening. We demonstrate the strong potential of COVID-MobileXpert for rapid deployment via extensive experiments with diverse MS network architecture, CXR imaging quality, and tuning parameter settings. The source code of cloud and mobile based models are available from https://github.com/xinli0928/COVID-Xray. The rapid spread of SARS-CoV-2 virus in all over the world and exponential increase of the susceptible population size demand for accurate, rapid yet inexpensive point-of-care COVID-19 screening. The gold standard screening approach based on RT-PCR demonstrates a good accuracy but subject to significant limitations of high cost and slow turnover time, making it not scalable to the ever-increasing population at risk [1] . Thanks to high-volume testing machines and new rapid tests, the total tests topped 1.4 million as of early April [2] . However, millions of tests are still urgently needed as the virus keeps communities across the country in lockdown and hospitals are overwhelmed with patients. Alternative nucleic acid and serology based tests, Computed Tomography (CT) [3] [4] [5] [6] [7] based approaches have also been widely adopted for testing COVID-19 cases, which have been shown better sensitivity and specificity compared with nucleic acid based tests [8] despite mixed results exist [9] . Up to date most of medical imaging based diagnostic tools are based on CT and deployed in hospitals where expensive CT equipment and experienced radiologists are available. For example, Alibaba's model [10] and Infervision system [11] both are trained on more than 5,000 confirmed cases and deployed at dozens of hospitals in China. The wide availability of Chest X-Ray (CXR) in diverse health care settings makes it an attractive option for rapid, accurate yet inexpensive point-of-care screening in mobile, urgent and primary care clinics. At present, the bottleneck lies in the short of board certified radiologists who are capable of differentiating massive COVID-19 positive cases from other lung diseases and normal conditions directly from CXR images, either those from PACS system or noisy snapshots. The intensive development of deep neural network (DNN) powered CXR image analysis has seen the unprecedented success in automatic classification and segmentation of lung diseases [12, 13] . Using the cloud solutions such as Amazon AWS, Google Cloud Platform, and Microsoft Azure, or even on-premise computing clusters to train a sophisticated DNN (e.g., DenseNet-121 [14] ) with dozens of millions of parameters and hundreds of layers via billions of operations for both training and inference, these large scale Artificial Intelligence (AI) models achieve amazing performance that even outperforms board certified radiologists in some well-defined tasks [15] . With the increasing number of smart devices and improved hardware, there is a growing interest to deploy machine learning models on the device to minimize latency and maximize the protection of privacy. However, up to date on-device medical imaging applications are very limited to basic functions, such as DICOM image view, which allows mobile access to PACS system outside a clinic via a network connection. As medical resources in hospitals are falling short under the unprecedented crisis, CXR based population screening emerges as a cost-effective approach to battle the COVID-19 pandemic, particularly in under resourced care facilities. The mobile AI screening approach is expected to not only protect patient privacy, but also assist the first-responder or the caregiver to quickly determine the acuity level with the absence of a board certified radiologist. However, a major challenge that prevents wide adoption of the mobile AI screening approach is lack of lightweight yet accurate and robust neural networks for on-device COVID-19 screening using noisy CXR images often shot with a mobile device. Adequate knowledge has been accumulated from training the large scale DNN systems to accurately discern the subtle difference among the different lung diseases by learning the discriminative CXR imaging features [16, 17] . Leveraging these results, we design and implement a novel three-player knowledge transfer and distillation (KTD) framework composed of an Attending Physician (AP) network, a Resident Fellow (RF) network, and a Medical Student (MS) network for on-device COVID-19 screening. In a nutshell, we pre-train a full AP network using a large scale of lung disease CXR images [12, 16] , followed by fine-tuning a RF network via knowledge transfer using labeled COVID-19, pneumonia and normal CXR images, then we train a lightweight MS network for on-device COVID-19 screening using either CXR images or noisy snapshots. The unique features of the KTD framework are knowledge transfer from a large-scale existing lung disease images to enhance the discrimination between COVID-19 and non-COVID pneumonia and novel loss functions to generate refined soft labels (predicted probabilities) to improve knowledge distillation to the MS network, enabling accurate on-device screening. To the best of our knowledge, there is no mobile AI system for on-device COVID-19 screening using CXR images, either in the native format (e.g., DICOM) or recaptured using mobile device. Moreover, the existing cloud based models do not exploit the lung disease imaging features from prior studies and do not give explanations on the screening results right on the CXR images. Here we present COVID-MobileXpert, a novel mobile AI approach for CXR based COVID-19 screening to be reliably deployed at mobile devices for point-of-care testing. It enjoys the following advantages: 1) accurately detecting positive COVID-19 cases particularly from closely related pneumonia cases; 2) identifying the important regions on CXR images that correspond to (hopefully responsible for) the positive screening results; and 3) robust performance on noisy CXR snapshots recaptured using mobile devices. Deep learning techniques have also been widely applied to medical image classification and computer-aided diagnosis for early detection of human diseases [15, [18] [19] [20] . Using labeled medical images to train sophisticated convolutional neural networks (CNNs), often pre-trained on a large number of natural images, the CNN performance on disease classification has achieved the level that is comparable to or even outperforms the board certified human radiologists [16, 21, 22] . Wang et al. [12] created the ChestX-ray8 data set of 108,948 frontal-view X-ray images from 32,717 unique patients, labeled with the eight diseases minded from the corresponding radiological reports, and trained a unified deep CNN model by using weight parameters from AlexNet, GoogLeNet, VGGNet-16 and ResNet-50, pre-trained using ImageNet, followed re-training the weights in penultimate layer. Using this data set, Rajpurkar et al. [16] trained a DenseNet-121 based ChestXNet model, which can detect pneumonia from CXR's at a level exceeding practicing radiologists. In [22] , authors trained a 169-layer DenseNet model using a large labeled dataset of musculoskeletal radiographs containing 40,561 bone X-ray's from 14,863 studies to detect and localize abnormalities, and demonstrated a comparable performance to the best radiologist. To overcome the issue of label scarcity in medical images, semi-supervised [23] , multiple-instance [24] and transfer learning [25] techniques are widely applied to alleviate the need for radiologist labeled images without compromising the performance [26] . In the past a few weeks, CNNs have been successfully employed to distinguish COVID-19 from other community acquired pneumonia [6, 7, 27] . Using a collected data set of 4,356 chest CT exams from 3,322 patients, Li et al [4] trained COVNET, a ResNet-50 based CNN model, to achieve an impressive Area Under the ROC (AUROC) value over 0.95. Huang et al [28] used deep learning based segmentation and classification approaches to quantify the stages of lung burden change in patients with COVID-19 using serial CT scan. Although CXR images are generally considered less sensitive than the 3D chest CT scans, recent CXR based studies demonstrate a strong potential for being a point-of-care testing approach for COVID-19 screening using publicly available data sets [29] . Ghoshal et al [30] investigated how dropweight based Bayesian CNNs can tackle the uncertainties associated with small size of labeled images and found it is strongly correlated with the accuracy of prediction. Narin et al [31] experimented their CNN based ResNet-50, InceptionV3 and Inception-ResNetV2 architectures to classify COVID-19 and normal classes of CXR images. Similar to [6] in CT related studies, they pre-trained the models using ImageNet to alleviate the need for labeled COVID images. Zhang et al [32] adopted a similar approach in collecting the public data, yet employed an unsupervised anomaly detection approach that detects COVID-19 images as outliers. These studies have demonstrated the strong potential of the CXR based AI approach for point-of-care testing. However, up to date, all the AI models trained for COVID screening, either using CT scans or CXR images, are full DNNs that are not suitable to deploy on resource-constrained mobile devices. As there is no existing on-device medical image classification research, the vast majority of the existing work focus on comparing the performance of different lightweight neural networks such as MobileNetV2 [33] , SquezzeNet [34] , Condense-Net [35] , ShuffleNetV2 [36] , MnasNet [37] and MobileNetV3 [38] using small benchmark natural image data sets such as CIFAR 10/100. MnasNet and MobileNetV3 are representative models generated via automatic neural architecture search (NAS) whereas all other networks are manually designed [39] . Due to the practical hardware resource constraint of mobile devices, natural image classification and segmentation performance have been compared based on accuracy, energy consumption, runtime and memory complexity that no single network has demonstrated a superior performance in all tasks [40] . Besides tailor-made network architectures for mobile devices, compression the full DNN at the different stages of training also stands as a promising alternative. For in-training model compression, for example, Chen et al [41] designed a novel convolution operation via factorizing the mixed feature maps by their frequencies to store and process feature maps that vary spatially slower at a lower spatial resolution to reduce both memory and computation cost of the image classification. Post-training or fine tuning model compression techniques such as quantization [42] and/or pruning techniques [43] are often used to reduce the model size at the expense of reduced prediction accuracy. Wang et al [44] demonstrated using 8-bit floating point numbers for representing weight parameters without compromising the model's accuracy. Lou et al [45] automatically searched a suitable precision for each weight kernel and chose another precision for each activation layer and demonstrate a reduced inference latency and energy consumption while achieving the same inference accuracy. Tung and Mori [46] combined network pruning and weight quantization in a single learning framework to compress several DNNs without satisfying accuracy. In order to improve the performance of the lightweight on-device models, knowledge distillation [47] is also used where a full teacher model is trained in the cloud or a on-premise GPU cluster, and a student model is trained at mobile device with the 'knowledge' distilled via the soft labels from the teacher model. Thus the student model is trained to mimic the outputs of the teacher model as well as to minimize the cross-entropy loss between the true labels and predictive probabilities (soft labels). Knowledge distillation yields compact student models that outperform the compact models trained from scratch without a teacher model [48] . Goldblum et al [49] attempted to encourage the student network to output correct labels using the training cases crafted with a moderate adversarial attack budget to demonstrate the robustness of knowledge distillation methods. Unlike the natural images, on-device classification of medical images remain largely an uncharted territory due to the following unique challenges: 1) label scarcity in medical images significantly limits generalizability of the machine learning system; 2) vastly similar and dominant fore-and background in medical images make it hard samples for learning the discriminating features between different disease classes; and 3) excessive noises added particularly to the image recaptured from a snapshot can make CXR images and noisy snapshots more discriminate than that between different disease classes. To tackle these unique challenges we propose a novel three-player framework for training a lightweight network towards accurate and hardware friendly on-device CXR image classification. In Section 3, we describe the architectures for COVID-MobileXpert, the training data set, the three-player knowledge transfer and distillation (KTD) training scheme, and performance evaluation. We employ DenseNet-121 [14] architecture as the template to pre-train and fine-tune the AP and RF networks, and we use the lightweight MobileNetv2, ShuffleNetV2 and SqueezeNet as the candidate MS networks for on-device COVID-19 screening. Table 1 summarizes the key model complexity parameters [40] . Figure 1 illustrates the three-player KTD training framework where the knowledge of abnormal CXR images is transferred from AP network to RF network and knowledge of discriminating COVID-19, non-COVID-19 and pneumonia is distilled from the RF network to the MS network. In real-world scenarios, a caregiverr can either directly use mobile access to the PACS system to view the DICOM images or simply use a mobile device with a camera to capture a snapshot of the screen showing the CXR image. Importantly, the snapshot has its unique noise patterns, such as Moiré effect and pixel noise, that differ from 'clean' DICOM image. As a result, the difference between noisy snapshots and CXR images can be even more than that between different noisy classes. It is thus necessary to compile one CXR image dataset and another noisy snapshot dataset for evaluating performance of on-device COVID screening. The CXR image data set is composed of 179 CXR images from normal class [50] , 179 from pneumonia class [50] and 179 from COVID-19 class containing both PA (posterior anterior) and AP (anterior posterior) positions [51] and we split it into train/validation/testing sets with 125/18/36 cases (7:1:2) in each class. Since some patients have multiple CXR images in COVID-19 class, we sample images per patients for each split to avoid images from the same patient be included in both training and test sets. To create a noisy snapshot data set, we first display the original CXR image on the PC screen and then use Microsoft Office Lens to take snapshots centered on the screen. Using the 'scan to document' function to open the rear camera of the mobile device, we gradually zoom in/out to detect edges and vertices to take and save the snapshot. A noisy snapshot is a RGB image saved in JPEG format, which we pre-process it by converting to an 8-bit gray-scale image, removing the artificial effect of color and light brightness. To this end each clean CXR image has a noisy snapshot counterpart, e.g., Figure 2 . We pre-train the AP network as the source task, i.e., lung disease classification, and fine-tune, validate and test the RF network as the destination task. Different from recent studies [29, 52] that pre-train the models with natural image data sets such as ImageNet [53] , we pre-train the DenseNet-121 based AP network using the more related ChestX-ray8 data set [12] of 108,948 lung disease cases to extract the CXR imaging features of lung diseases instead of generic natural imaging features. Specifically, beyond the dense block, we employ a shared fully connected layer for extracting the general CXR imaging feature and 8 fully connected disease-specific layers (including pneumonia as one disease layer) to extract disease-specific features (Figure 1 ). After pre-training with the large ChestX-ray8 data set, the weights defining the general CXR imaging feature and the pneumonia disease feature are transferred to fine-tune the DenseNet-121 based RF network using a smaller compiled data set of 3 classes of CXR images or noisy snapshots, i.e., COVID-19, normal and pneumonia. Collectively a total of 537 CXR images are used for fine-tunning, validation and testing of the RF network. The latter is randomly initialized using two sets of weight parameters corresponding to normal and COVID-19 classes with the initial values of other weight parameters transferred from the pre-trained source model. The network is trained with Adam optimizer for 50 epochs with a mini-batch size of 32. The parameter values that give rise to the best performance on validation dataset are used for testing. The RF network is then used to train the lightweight MS network, e.g., MobileNetV2, ShuffleNetV2 or SqueezeNet, via knowledge distillation. In order to accommodate the real-world need in diverse healthcare settings, for example, CXR images from the on-premise PACS system at hospitals or noisy snapshots recaptured in mobile devices at the bedside, we train the MS network using CXR image and noisy snapshots respectively and assess their individual performance. In Section 4, we design and conduct extensive experiments to evaluate performance of the compact MS networks in screening COVID-19 CXR images and compare with the cloud based screening approach based on the large-scale RF network. In order to gain a holistic view of the model behavior, we investigate the performance with regard to a multiple choices of loss functions and a multiple values of tuning parameters. As stated before, a unique challenge in medical imaging classification is the so-called hard sample problem [54] , i.e., subtle difference on the Region Of Interest (ROI) across the images with large amount of shared fore-and backgrounds. Motivated by this, we use an in-house developed loss function, i.e., Probabilistically Compact (PC) loss, for generating the soft labels from the RF model and compared with ArcFace [55] , the additive angular margin loss for deep face recognition, using the classical softmax loss as the baseline. Both PC and ArcFace losses are designed for improving classification performance on hard samples. PC loss is to encourage the maximized margin between the most probable soft label (predictive probability) and the first several most probable labels whereas ArcFac loss is to encourage widening the geodesic distance gap between the closest soft labels. In terms of predicted probabilities, DNN robustness is beneficial from the large gap between f y (x) and f k (x) (k = y), where f y (x) represents the true class and f k (x) (k = y) represents the most probable class. Indeed, theoretical study [56] in deep learning shows that the gap f y (x) − max k f k (x) can be used to measure the generalizability of deep neural networks. The PC loss to improve CNN's robustness is as follows: where N is the number of training samples, ξ > 0 is the probability margin treated as a hyperparameter. Here, we include all non-target classes in the formulation and penalize any classes for each training sample that violate the margin requirement for two reasons: (1) by maintaining the margin requirement for all classes, it provides us convenience in implementation as the first several most probable classes can change during the training process; and (2) if one of the most probable classes satisfies the margin requirement, all less probable classes will automatically satisfy this requirement and hence have no effect on the PC loss. Compared with previous works that explicitly learn features with large inter-class separability and intra-class, the PC loss avoids assumptions on the feature space, instead, it only encourages the feature learning that leads to probabilistic intra-class compactness by imposing a probability margin ξ. ξ: in the PC loss formula (Eq. 1), a large value encourages the probabilistic intra-class compactness. α: in knowledge distillation framework [47, 49] (Eq. 2), it regularizes the 'strength' of knowledge distillation by specifying the relative contributions of the distillation loss, i.e., KL (S t θ (X) , T t (X)), measuring how well the MS model mimic the RF model's behavior using KL divergence and the classification loss of the MS model, i.e., (S t θ (X), y). S θ (.) and T (.) represent the RF model and MS model, respectively. Recapturing noise is added to the CXR image X to generate the noisy snapshot. The larger value, the stronger knowledge distillation is enforced from the RF model to the MS model. T : in Eq. 2, it represents temperature where T = 1 corresponds to the standard softmax loss. As the value of T increases, the probability distribution generated by the softmax loss becomes softer, providing more information regarding which classes the RF model found more similar to the predicted class. AUROC: it represents the Area Under Receiver-Operation Curve (ROC). We plot ROC for assessing the performance of each model where pairs of sensitivity is plotted against 1-specificity calculated using multiple decision thresholds. It demonstrates the trade-off between true positive rate and false positive rate using different thresholds. A large value of AUROC represents a good performance where the model achieves both high sensitivity and specificity. We first report the classification accuracy to select the best MS model under different values of hyperparameters, followed by systematic evaluation of the model's discriminating power of COVID-19 from non-COVID pneumonia and normal cases using AUROC values. With the knowledge transfer from the AP network pre-trained with a large set of abnormal lung disease cases, fine-tuned with the new PC loss, the RF network demonstrates a remarkably high accuracy of 93.5% in the classification of CXR images and 89.7% in the classification of noisy snapshots. We then employ dimension reduction techniques, e.g., T-SNE [57] , to visualize the three classes of CXR images at the low-dimension. As observed in Figure 3 , the three classes, either CXR images or noisy snapshots, demonstrate a good separability in the manifold learned by ShuffleNetV2 and MobileNetV2 (the left and middle columns) but not by SqueezeNet (the right column). It is consistent to the overall lower classification performance of the SqueezeNet than that of the ShuffleNetV2 and MobileNetV2 (Tables 2,3,4) . Importantly, the small intra-class variance and large inter-class separation of both CXR and noisy snapshot images in the feature space learned by ShuffuleNetV2 ensure the robust on-device COVID-19 screening performance. Distilling knowledge from the RF network to the lightweight MS network, we observe an impressive performance that a vast majority of accuracy values are well above 0.85 for CXR image classification and are above 0.80 for noisy snapshot classification. Table 2 shows classification accuracy results of the ShuffleNetV2 architecture with different loss functions and values of tuning parameters using both CXR images and noisy snapshots. It is clear that the knowledge distillation is essential to train the lightweight MS network without compromising much accuracy since the MS network alone, without knowledge distillation, achieves a baseline classification accuracy of 0.843 for CXR images and 0.694 for noisy snapshots, which are much lower than the average performance achieved with knowledge transfer, i.e., 0.890 and 0.782, respectively, in Table 2 ). Overall it is also observed that ShuffleNetV2 performs better on CXR images than on the noisy snapshots, evident by a uniform drop of the AUROC values in a vast majority of comparisons. Looking at Table 2 in more details, we note the performance of ShuffleNetV2 is not sensitive to the choice of temperatures (T) and strengths of distillation (α), however, it is very sensitive to the choice of loss functions. Overall, the PC loss developed in-house that flattens other probable class predictions perform the best across diverse settings of the tuning parameters, indicating the quality of knowledge distilled from the RF network to the MS network plays a pivotal role in training the lightweight MS network to ensure accurate on-device COVID-19 screening. This classification performance of MobileNetV2 (Table 3 ) follows a similar trend to that of ShuffleNetV2 (Table 2) with a similar overall accuracy whereas SqueezeNet (Table 4) the noisy snapshots, which are remarkably lower than the those achieved with knowledge distillation shown in Table 3 . Similarly, the MS network trained on SqueezeNet architecture alone without knowledge distillation from the RF model achieves a baseline classification accuracy of 0.732 for the CXR images and 0.769 for the noisy snapshots, which are much lower than those with knowledge distillation shown in Table 4 . Collectively these results further demonstrate that the knowledge distillation is essential to train the lightweight MS network without trading too much accuracy for model compactness. In order to systematically evaluate the performance of the MS networks under the different decision thresholds, we use AUROC value to assess how well the model is capable of discriminating COVID-19 cases from normal cases, pneumonia cases as well as normal plus pneumonia cases. In Figure 4 , both compact MS networks, i.e., ShuffleNetV2 and MobileNetV2, demonstrate a remarkable performance on all discrimination tasks that are comparable to that of the large scale cloud based RF network, i.e., DenseNet-121, either using CXR images or noisy snapshots. Importantly, both ShuffleNetV2 and MobileNetV2 achieve high AUROC values of 0.940 and 0.943 when discriminating COVID-19 cases against mixed pneumonia and normal cases demonstrating a strong potential for on-device screening using noisy snapshots. Besides accurately screening COVID-19 CXR images and noisy snapshots from other lung disease and normal conditions, the model has to explain how and why the prediction result is generated before it is ready to be adopted for on-device screening. We use GRAD-CAM [58] to interpret the COVID-19 screening results, which uses the gradient information and flows it back to the final convolutional layer to decipher the importance of each neuron in classifying an image to each disease class. Figure 5 shows the COVID-19 disease progression in a patient over the four time points, i.e., day 10, day 13, day 17 and day 25 with the worst status on the day 17 then recovered afterwards. In Figure 5 , the heatmap starts from right side then spreads to the entire lung and finally migrates back to the right side upon recovery. For on-device COVID-19 screening with resource constraints, resource consumption is also an important consideration for performance evaluation in addition to accuracy. In order to systemically assess the performance of our COVID-19 on-device screening app, we select six mobile systems released following a chronic order, i.e., Nexus One / Nexus S (low-end); Pixel/ Pixel 2 (mid-range) and Pixel 2 XL/ Pixel 3 XL (high-end). Using Pytorch Mobile framework, we deploy the three MS networks to the six Android based mobile systems and compare the resource consumption with regard to CPU, Memory and Energy usages. Figure 6 describes a workflow to build an Android App based on the MS networks for on-device screening. In Table 5 , it is clear that the MobileNetV2 based COVID screening app is the most resource-hungry one, followed by ShuffleNetV2, demonstrated by a much higher resource consumption than SqueezeNet. Thus, the high accuracy achieved by MobileNetV2 and ShuffleNetV2 is at the cost of high resource consumption. Within each app, we observe a downward trend in resource consumption following the chronic order, reflecting a continuous improvement of mobile device hardware. Overall, MobileNetV2 and ShuffleNetV2 based COVID screening apps are more suitable for high-performing mobile device whereas the latter is a good choice for COVID-MobileXpert deployment due to its high accuracy achieved by lower resource consumption. SqueezeNet is more suitable for low-end mobile device with both lower accuracy and resource consumption. The classical two-player knowledge distillation framework [47] has been widely used to train a compact network that is explainable [59] and/or hardware friendly [60] with ample applications such as Electronic Health Record (EHR) based decision support [61] and on-device machine learning [40] . In the related task of on-device natural image classification, the teacher network is pre-trained with ImageNet and distill the knowledge to a lightweight student network (e.g., MobileNetV2). This two-player framework, although is seemingly successful, can be problematic for on-device medical imaging based screening described herein. The large gap between natural images and the medical images of a specific disease such as COVID-19 makes the knowledge distillation less effective as it is supposed to be. The small number of labeled COVID images for training further aggravate the situation. In our three-player KTD framework, knowledge transfer from the AP network to the RF network can be viewed as a more effective regularization as they are built on the same network architecture, which in turn, make the knowledge distillation more effective since the RF network and MS network share the same training set. Different from what have been extensively investigated focusing on the impact of distillation strength and temperature, we uncover a pivotal role of employing novel loss functions in refining the quality of knowledge to be distilled. Hence our three-player framework provides a more effective way to train the compact on-device model using smaller labeled data set while preserving the performance. When tested on an array of mobile devices, ShuffleNetV2 and MobileNetV2 demonstrate a better performance at the cost of demanding more systems resources. We expect performance of the MS network will keep improving with the increasingly available COVID CXR images. From a more broad perspective, the three-player KTD framework is generally applicable to train other on-device medical imaging classification and segmentation apps for point-of-care screening of other human diseases such as lung [12] and musculoskeletal [22] abnormalities. Detection of sars-cov-2 in different types of clinical specimens No covid-19 testing at home yet but quicker options coming Coronavirus disease 2019 (covid-19): a systematic review of imaging findings in 919 patients Covid-19 pneumonia: what has ct taught us? Performance of radiologists in differentiating covid-19 from viral pneumonia on chest ct Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct Large-scale screening of covid-19 from community acquired pneumonia using infection sizeaware classification Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases A role for ct in covid-19? what data really tell us so far. The Lancet CT image analytics for COVID-19 China uses ai in medical imaging to speed up covid-19 diagnosis Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The lancet digital health Radiologist-level pneumonia detection on chest x-rays with deep learning Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs Deep learning of static and dynamic brain functional networks for early mci detection Deeply-supervised networks with threshold loss for cancer detection in automated breast ultrasound A multi-task self-normalizing 3d-cnn to infer tuberculosis radiological manifestations Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks Large dataset for abnormality detection in musculoskeletal radiographs Realistic evaluation of deep semi-supervised learning algorithms Attention-based deep multiple instance learning Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19 Serial quantitative chest ct assessment of covid-19: Deep-learning approach Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images Estimating uncertainty and interpretability in deep learning for coronavirus (covid-19) detection Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks Covid-19 screening on chest x-ray images using deep learning based anomaly detection Mobilenetv2: Inverted residuals and linear bottlenecks Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size Condensenet: An efficient densenet using learned group convolutions Shufflenet: An extremely efficient convolutional neural network for mobile devices Mnasnet: Platform-aware neural architecture search for mobile Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3 Neural architecture search: A survey On-device machine learning: An algorithms and learning theory perspective Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution Xnor-net: Imagenet classification using binary convolutional neural networks Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding Training deep neural networks with 8-bit floating point numbers Autoq: Automated kernel-wise neural network quantization Clip-q: Deep network compression learning by in-parallel pruning-quantization Distilling the knowledge in a neural network Towards understanding knowledge distillation RSNA pneumonia detection challenge Covid-19 image data collection Covid-19: Automatic detection from x-ray images utilizing transfer learning with convolutional neural networks Imagenet: A large-scale hierarchical image database On the learning property of logistic and softmax losses for deep neural networks Arcface: Additive angular margin loss for deep face recognition Exploring generalization in deep learning Visualizing data using t-sne Grad-cam: Visual explanations from deep networks via gradient-based localization Peeking inside the black-box: A survey on explainable artificial intelligence (xai) Similarity-preserving knowledge distillation Distilling knowledge from deep networks with applications to healthcare domain