key: cord-0195018-zvgj2pua
authors: Giuste, Felipe; Shi, Wenqi; Zhu, Yuanda; Naren, Tarun; Isgut, Monica; Sha, Ying; Tong, Li; Gupte, Mitali; Wang, May D.
title: Explainable Artificial Intelligence Methods in Combating Pandemics: A Systematic Review
date: 2021-12-23
journal: nan
DOI: nan
sha: 9c00c48986a75f88093b8f1ed584458486d3dedc
doc_id: 195018
cord_uid: zvgj2pua

Despite the myriad peer-reviewed papers demonstrating novel Artificial Intelligence (AI)-based solutions to COVID-19 challenges during the pandemic, few have made significant clinical impact. The impact of artificial intelligence during the COVID-19 pandemic was greatly limited by lack of model transparency. This systematic review examines the use of Explainable Artificial Intelligence (XAI) during the pandemic and how its use could overcome barriers to real-world success. We find that successful use of XAI can improve model performance, instill trust in the end-user, and provide the value needed to affect user decision-making. We introduce the reader to common XAI techniques, their utility, and specific examples of their application. Evaluation of XAI results is also discussed as an important step to maximize the value of AI-based clinical decision support systems. We illustrate the classical, modern, and potential future trends of XAI to elucidate the evolution of novel XAI techniques. Finally, we provide a checklist of suggestions during the experimental design process supported by recent publications. Common challenges during the implementation of AI solutions are also addressed with specific examples of potential solutions. We hope this review may serve as a guide to improve the clinical impact of future AI-based solutions.

C ORONAVIRUS disease 2019 (COVID-19) has become a worldwide phenomenon with over 272 million cases and claiming over four million lives [1] . Medical imaging and clinical data have been explored as potential supplements to molecular test screening of patients potentially infected with COVID-19 [2] - [4] . The need for fast COVID-19 detection has led to a massive number of Artificial Intelligence (AI) solutions to alleviate this clinical burden. Unfortunately, very few have succeeded in making a real impact [5] . As the world slowly transitions from disease detection and containment to maximizing patient outcomes, so too must AI solutions meet this urgent requirement. To face these new challenges, we must look back to identify areas of improvement in AI. For example, how many of the wildly successful models published in the last year actually made a meaningful clinical impact? A major barrier limiting the real-world utility of AI approaches in clinical decision support is the lack of interpretability in their results.

A wide range of imaging and non-imaging clinical data sources are used to extract meaningful patterns for automated COVID detection tasks. The most common data sources used for elucidating COVID-19 pathology include X-ray, computed tomography (CT) and electronic health records.

Using imaging and clinical features (including past medical history, current medications, and demographic data), state of art AI models have achieved impressive performance [3] , [6] - [8] . This is due to the complex data transformations applied to the input data and the large numbers of parameters optimized by the model without human intervention. This approach can lead to unexpected or inaccurate results [9] . This is especially true if the model learns features unrelated to the challenge being solved, as can be the case when significant confounders, such as in patient demographic or underlying disease epidemiology effects, are present within training datasets. Thus, physicians and other healthcare practitioners are often reluctant to trust many high-performing yet unintelligible AI systems. Additionally, lack of interpretability in AI algorithms limits the ability of researchers and model developers to identify potential pitfalls and avenues for improvement. State-of-theart approaches may be finding non-optimal solutions, or worse, basing their solutions on irrelevant input features [10] , [11] .

Explainable artificial intelligence (XAI) is a collection of processes and methods that enables human users to comprehend and trust machine learning algorithms' results [12] . XAI techniques improve the transparency of AI models, thus facilitating confident clinical decision-making and increasing the real-world utility of AI approaches. Clinicians benefit from XAI by gaining insight into how the AI models reach solutions from clinical data, as shown in Fig. 1 . AI-based clinical solutions should meet three criteria: achieve high performance, instill user trust, and generate user response. Specifically, the model should have achieved sufficient performance at their task on a real-world dataset not used during the training process in order to be considered for real-world use. Guidelines for establishing and reporting real-world clinical trial performance could be found in the SPIRIT-AI [13] and CONSORT-AI [14] guidelines. Trust 

• I see the evidence for prediction.

• I see potential novel biomarkers.

• I understand it and trust it.

• I can provide clinical feedback. 

The m odel needs to perform well.

The clinician needs to believe the m odel results.

The clinician needs to act based on m odel results. The term "interpretability" refers to a property of AI systems in which the process by which they arrive at a conclusion is easily understood. K-nearest neighbors, decision trees, logistic regression, linear models, and rule-based models are all popular interpretable machine learning methods. Explainable AI is frequently used to refer to methods (usually post-hoc) for enhancing comprehension of black-box models such as neutral networks and ensemble models. Explainable AI methods attempt to summarize the rationale for a model's behavior or to generate insights into the underlying cause of a model's decision. Both interpretability and explainability are frequently used interchangeably, and both seek to shed light on the model's credibility. In this review, we will focus on XAI methods used in clinical settings. (d) AI-based clinical solutions should meet three criteria: achieve high performance, instill user trust, and generate user response, all of which demonstrate the importance of XAI in clinical applications.

provided to the user on important metrics used to obtain the model prediction. Finally, no solution is effective if it does not result in a change in user response. This response may include a change in treatment plan, patient prioritization, or diagnosis. This response must be consistent with clinical expertise and evidence-based protocols. This review primarily focuses on the XAI solutions affecting user trust, but model performance and user interfaces are also mentioned where appropriate. XAI can allow for validation of extracted features, confirm heuristics, identify subgroups, and generate novel biomarkers [15] . In addtion, XAI can also support research conclusions and guide research field advancement by identifying avenues of model performance improvement. In this systematic review, we describe the current usage of XAI techniques to solve COVID-19 clinical challenges. Upon review of the current literature leveraging AI for COVID-19 detection and risk assessment, we provide strong support for their increased use if clinical integration is the goal. The remaining of this paper is structured as follows (see Fig. 2 ): Section II illustrates how the XAI-based studies applicable to COVID-19 were selected using the Preferred Reporting Items on Systematic Reviews and Meta-analysis (PRISMA) model and exclusivity criteria; Section III provides a comprehensive overview of the XAI approaches used to support AI-enabled clinical decision support systems during COVID-19 pandemic; Section IV and Section V describes the representation of explanations and evaluation of XAI methods; Section VI and VII summarize the contribution of this paper, provide a schema of the integration of explainable AI module in both model development and clinical practice, and discuss potential challenges and future work of XAI.

This systematic review paper follows the PRISMA guidelines [16] , as shown in Fig. 3 (a) . PRISMA is an evidencebased minimum set of items for systematic reviews and metaanalyses with the goal of assisting authors in improving their literature review.

We first conducted a systematic search of papers using three large public databases, PubMed, IEEE Xplore and Google Scholar. The keywords for our paper identifications are: (deep learning OR machine learning OR artificial intelligence) AND (interpretable OR explainable OR interpretable artificial intelligence OR explainable artificial intelligence) AND (COVID-19 OR SARS-CoV-2). We limited the date of publication between January 2020 and October 2021 to reflect the most recent progress. We selected this search approach with the help of Emory University librarians to ensure adequate breadth and specificity of search results. The word cloud generated by 

A system of color-coding to highlight important features

Score of input features based on influence on targets

Collection of user feedback

Display of decision-making process Fig. 2 . Overview of paper structure. XAI techniques increase the transparency of AI models, enabling more confident clinical decision-making and increasing the practical utility of AI approaches. In this paper, we present a comprehensive review of explanation generation, representation, and evaluation methods used to support AI-enabled clinical decision support systems during the COVID-19 pandemic.

titles of selected article can also indicate our focus on XAI and COVID-19, as shown in Fig. 3 (b). Our initial search matched 45 papers in PubMed, 46 in IEEE Xplore, 107 in Google Scholar, and 77 from external resources. After identifying initial papers, we then conducted a manual screening by reading the titles and abstracts and eliminating 67 papers that did not discuss XAI or COVID-19. Among the 187 remaining papers, we further excluded 36 papers for other reasons such as quality. In total, we included 151 papers from PRISMA in this study. These studies were thoroughly analyzed in order to gain a better understanding of XAI approaches, which is critical for model interpretation and has become a necessary component for any AI-based approach seeking to make a clinical impact.

In this section, we introduce XAI approaches used to support AI-enabled clinical decision support systems during the COVID-19 pandemic. We categorize them as follows: data augmentation, outcome prediction, unsupervised clustering, and image segmentation. Moreover, we organized XAI methods according to the underlying theory within each task, as shown in Fig. 4 . Additional technical details and clinical applications will be discussed below. An overview of XAI methods implemented in clinical applications is presented in Fig. 2 . In addition, we also summarized publicly available COVID-19-related imaging (Table I) and non-imaging datasets ( Table II) that were used in these clinical applications.

The need for labeled data for model training was highlighted in the early stages of the COVID-19 pandemic. This was also a point in time where AI-based solutions could have made the most impact by supplementing scarce public datasets. Future pandemics will likely result in the same urgency for labeled data, and AI-solutions would greatly benefit from synthetic data augmentation. Generative Adversarial Networks (GAN) are used to supplement available labeled COVID radiology data with synthetic images and labels. This allows for improved model training with limited labeled datasets by increasing the number of labeled images available for training. Example of classical and modern data augmentation approaches with model interpretation is shown in Fig. 5 .

Singh et al. tested a wide variety of GAN models to generate synthetic X-ray images while training a COVID-19 detection deep learning model named COVIDscreen [17] . They compared the quality of four different GAN-based X-ray image generators including Wasserstein GAN (WGAN), least squares GAN (LSGAN), auxiliary classifier GAN (ACGAN), and deep convolutional GAN. They visualized the resulting synthetic X-ray images and showed that WGAN produces visibly higher quality images than the tested alternatives. To the best of our knowledge, this was the first publication to show successful X-ray image generation for COVID-19 data augmentation. A significant limitation of this study was that, although they generated realistic X-ray images using WGAN, they did not leverage this additional data to improve their classifier performance. This is likely due to the lack of label [18] . ACGANS take both a label and noise as input to generate new images with known labels. Using COVID-19 status as the label, the proposed model CovidGAN is able to generate normal and COVID-19 images. They train a convolutional neural network (CNN) COVID-19 classifier and compare its performance when trained on a real labeled dataset and a dataset augmented with synthetic images from CovidGAN. They demonstrate that augmentation of their labeled dataset with synthetic images improves classifier performance from 85% to 95% classification accuracy.

Loey et al. [19] trained four CNN classifiers to detect COVID-19 within chest CT images. Synthetic CT images were generated with a conditional GAN (CGAN). They compared the performance of each classifier when trained with four different datasets. Training datasets include: the original dataset alone, the original with morphological augmentation, the original with synthetic images, and the original with morphological augmented combined with synthetic images. They showed that the best classifier (ResNet50) was trained with the original dataset with morphological augmentation (82.64% balanced accuracy).

Although GANs are widely used for clinical image genera-tion, XAI techniques are not commonly used to understand how they generate the final images from the latent space. Without XAI, it is difficult to detect potential biases in generated images. This is especially important when models are trained on small clinical datasets and subject to a wide range of confounding variables (e.g. hospital-specific signal properties associated with COVID-19 diagnosis). The following novel XAI techniques allow for the interpretation of the GAN latent space in order to understand how sampling of the latent space affects the final image. Voynov and Babenko [20] created a GAN learning scheme to maximize the interpretability of the GAN latent space. This approach allows the latent space to describe a set of independent image transformations. They showed that this latent space can be visually interpreted and manipulated to generate synthetic images with specific properties (e.g. object rotation, background blur, zoom, etc.). Their method produced synthetic images with interpretable latent space sampling effects across a wide range of datasets including MNIST [21] , AnimeFaces [22] , CelebA-HQ [23] , and BigGAN [24] generated images. They show that their interpretation of the latent space can be used to create images with specific properties including zoom, background blur, hair type, skin type, glasses, and many others. These properties were specific to the dataset the GANs were trained on.

Härkönen et al. [25] also sought to utilize the GAN latent space for image synthesis with specific properties. Instead of re-training models to isolate latent space axis of greatest interpretation, they take existing GANs and identify explainable latent space axes. This allowed them to take an image and change its properties (e.g. convert concrete to grass and object color). The interpretable latent space axes were extracted using principal components analysis (PCA), which requires no additional model training. This technique could also be used to alter image properties (e.g. add wrinkles and white hair to a person) while inheriting the label of the original image. This approach allows synthesis of additional labeled images with known object properties. GANs for generating additional radiology images can be interpreted to identify directions of greatest interpretability. This would allow users to understand which image properties the generator is trained to reproduce. There is also the potential to identify latent space directions which significantly correlates with COVID-19 infection presence. Examining these vectors could allow for a better understanding of COVID-19 disease pathology. Non-COVID-19 directions can be used to alter labeled images without affecting class labels which would allow for dataset augmentation with interpretable noise. This image augmentation could improve classifier performance by training it on a wider range of images, reducing the potential for overfitting.

The latent space of COVID-19 GANs are not being examined enough for interpretable features. This is a missed opportunity to identify novel COVID-specific image properties. Using XAI to understand latent space effects on image generation would also allow generation of images with desired properties. XAI also allows examination of image transformation "directions" (e.g. object rotation, zoom) to ensure that they are independent, and not correlated with potential sources of confounding (e.g. scanner model, hospital source, and technician bias). In future pandemics, reliable and explainable synthetic data augmentation approaches may facilitate the training of high-performing AI models to help in the clinical arena.

In addition to data augmentation, synthetic examples can be used to improve model robustness to outliers. Rahman et al. showed that many COVID-19 diagnostic models are vulnerable to attacks by adversarial examples [26] . Palatnik de Sousa et al. [27] also demonstrated the utility of adding random "color artifac" artifacts to CT images to identify model architecture which are most robust to such perturbation. This illustrates the importance of robust validation of models prior to their integration within clinical settings. XAI may also be used to verify the validity of models' approach to guard against such unexpected, and potentially harmful, results.

Due to their rapid acquisition times and accessibility, imaging modalities such as X-rays and CT scans have aided clinicians tremendously in diagnosing COVID-19. Radiographic signs, such as airspace opacity, ground-glass opacity, and subsequent consolidation, aid in the diagnosis of COVID-19. However, medical images contain hundreds of slices making diagnosis difficult for clinicians. COVID-19 also exhibits similarities to a variety of other types of pneumonia, posing an additional challenge for clinicians. Although AI-based clinical decision support systems outperform conventional shallow models that have been adapted for clinical use, clinicians frequently lack trust in or understanding of them due to unknown risks, posing a significant barrier to widespread adoption. Thus, XAI-assisted diagnosis via radiological imaging is highly desirable, as it can be viewed as an explainable image classification task for distinguishing COVID-19 from other pneumonia and healthy subjects, as shown in Fig. 6 .

Another important clinical application in outcome predication task is risk prediction. Clinicians and researchers use Electronic Health Records (EHRs) to predict risk of adverse clinical events, such as mortality or ICU readmission, and to identify top-ranking clinical features to mitigate negative consequences (see Fig. 7 ). Interpretation by feature scoring, also known as saliency, relevance, or feature attribution, is the most common XAI strategy in outcome prediction. Interpretation by feature scoring finds evidence supporting individual predictions by calculating importance scores associated with each feature of the input. Specifically, given an input, we need to find a vector of importance scores that is the same size as the input. In general, feature scoring can be grouped into five categories: perturbation-based, activation-based, gradient-based, mixedbased (combination of activation-based and gradient-based), and attention-based approaches.

1) Perturbation-based Approach: Perturbation is the simplest way to analyze the effect of changing the input features on the output of an AI model. This can be implemented by removing, masking, or modifying certain input features, running the forward pass, and then measuring the difference from the original output. The input features affecting the output the most are ranked as the most important features.

Permutation-or occlusion-based methods measure the importance of a feature by calculating the increase in the model's prediction error after permuting the feature. A feature occlusion study [28] was performed to show the influence of occluding regions of the input image to the confidence score predicted by the CNN model. The occlusion map was computed by replacing a small area of the image with a pure white patch and generating a prediction on the occluded image. While systematically sliding the white patch across the whole image, the prediction score on the occluded image was recorded as an individual pixel of the corresponding occlusion map. In biomedical application, Tang et al. [29] utilized occlusion mapping to demonstrate that networks learn patterns agreeing with accepted pathological features in Alzheimer's disease.

In the COVID-19 imaging applications, Gomes et al. [30] presented an interpretable method for extracting semantic features from X-rays images that correlate to severity from a data set with patient ICU admission labels. The interpretable results pointed out that a few features explain most of the variance between patients admitted in ICU. To properly handle the limited data sets, a state-of-the-art lung segmentation network was also trained and presented, together with the use of low-complexity and interpretable models to avoid overfitting. Casiraghi et al. [31] calculated COVID-19 patient risk for significant complications from radiographic features extracted using deep learning and non-imaging features. Random Forest (RF) and Boruta feature selection were used for feature selection. The most important features were then used to train a final RF model to predict risk. In order to maximize final model interpretability, they generated a sequence of steps to generate an association decision tree from the final RF model. The final association tree is easily interpretable by experts.

Another perturbation-based approach is Shapley value sampling [32] , which estimates input feature importance via sampling and re-running the model. Calculating these Shapley feature importance values is computationally expensive as the network has to be run for each sample and feature (sample × number of features) times. Lunberg et al. [33] proposed a fast implementation for tree-based models named SHapley Additive exPlanation (SHAP) to boost the calculation process. Shapley values can be calculated to discover how to divide the payoff equitably by treating the input features as participants in a coalition game. SHAP has been shown to be helpful in explaining clinical decision-making in the medical field both from image [34] and non-image [35] inputs and has also been well explored under COVID-19 cases [36] - [38] .

Similarly, Local Interpretable Model-Agnostic Explanations (LIME) [39] is a procedure that enables an understanding of how the input features of a deep learning model affect its predictions. For instance, LIME determines the set of super-pixels (a patch of pixels) that have the most grounded relationship with a prediction label when used for image classification. LIME performs clarifications by creating a new dataset of random perturbations (each with its own forecast) around the occasion and then fitting a weighted neighborhood proxy model. Typically, this neighborhood model is a simpler one with natural interpretability, such as a linear regression model. LIME generates perturbations by turning on and off a subset of the super-pixels in the image. To generate a human-readable representation, LIME attempts to determine the importance of contiguous superpixels in a source image relative to the output class. It has been widely implemented in COVID-19 diagnosis tasks [38] , [40] - [42] to further explain the process of feature extraction, which contributes to a better understanding of what features in CT/X-ray images characterize the onset of COVID-19. Ahsan et al. [40] implemented LIME to interpret top features in COVID-19 X-ray imaging and build trust in an AI framework to distinguish between patients with COVID-19 symptoms with other patients. Similarly, Ong et al. [38] implemented both SHAP and LIME to expound and interpret how Squeezenet performs COVID-19 classification and highlight the area of interest where they can help to increase the transparency and the interpretability of the deep model.

2) Activation-based Approach: The group of activationbased approaches can identify important regions in a forward pass by obtaining or approximating the activations of intermediate variables in a DL model. Because extracted features within deep layers are closer to the classification layer, they capture more class-discriminative information than those in bottom layers. Erhan et al. [43] concentrated on input patterns that maximize the activation of a particular hidden unit, called Activation Maximization, to illustrate the relevance of features in DL models. Zhou et al. [44] proposed Class Activation Maps (CAM), which used global average pooling to calculate the spatial average of feature maps in the last convolutional layer of a CNN. Han et al. [45] proposed an attention-based deep 3D multiple instance learning (AD3D-MIL) to semantically generate deep 3D instance following the potential infection regions. Additionally, AD3D-MIL used an attention-based pooling to gain insight into each instance's contribution over a broader spectrum, allowing for more indepth analysis. In comparison to conventional CAM, AD3D-MIL was capable of precisely detecting COVID-19 infection regions via key instances in 3D models. It achieved an accurate and interpretable COVID-19 screening that has the potential to be generalized to large-scale screening in clinical practice.

3) Gradient-based Approach: Gradient-based approaches identify important features by evaluating gradients of an input through back-propagation. The intuition behind this idea is that input features with large gradients have the largest effects on predictions. Simonyan et al. [47] constructed the importance map of input features by calculating the absolute value of partial derivatives of class score with respect to the input through back-propagation. However, feature importance calculated above could be noisy because of the saturation problems caused by the existence of non-linear operations such as rectified linear units (ReLU). That is, changes in gradients could be removed in a backward pass if the input to ReLU are negative. To address this issue, several modifications to the way ReLU is handled have been proposed, as illustrated in Fig.  8 . Springenberg et al. [46] proposed guided back-propagation by combining standard back-propagation and the "deconvnet" approach: keep gradients only when both bottom input and top gradients are positive. Thus, guided back-propagation can sharpen the feature importance scores compared to vanilla gradients by back-propagation.

Layer-wise Relevance Propagation (LRP) proposed by Bach et al. [48] is also used to find relevance scores for individual features in the input data by decomposing the output predictions of the DL models. The relevance score for each input feature is derived by back-propagating the output class node's class scores towards the input layer. The propagation is governed by a stringent conservation property, which requires an equal redistribution of the relevance received by a neuron. LRP was used in COVID-19 X-ray imaging to offer reasons for diagnostic predictions and to pinpoint crucial spots on the patients' chests. [49] , [50] .

Saliency map generation and analysis was first introduced by Simonyan et al. [47] to calculate the pixel importance. The importance score of each pixel is generated using the gradient of the output class category relative to an input picture, and a meaningful summary of pixel importance can be obtained by examining which positive gradients had the most effect on the output. Shamout et al. [51] proposed a a data-driven approach for automatic prediction of deterioration risk using a deep neural network that learns from chest Xray images and a gradient boosting model that learns from routine clinical variables. To illustrate the interpretability of proposed model, they performed the saliency maps for all time windows (24, 48, 72 , and 96 h) to highlight regions that contain visual patterns such as airspace opacities and consolidation, which are correlated with clinical deterioration. These saliency maps could be used to guide the extraction of six regions of interest patches from the entire image, each of which is then assigned a score indicating its relevance to the prediction task. Similarly, [52] - [56] also include saliency maps as an explainable deliverable to interpret deep models and find potential infection regions in COVID-19 diagnosis and detection. 4) Mixed-based Approach: Both activation-based and gradient-based methods have their own set of benefits and drawbacks. Specifically, activation-based methods generate feature scores that are more class discriminative, but they suffer from the coarse resolution of importance scores. On the other hand, although gradient-based methods produce fine resolution of feature scores, they tend not to show ability to differentiate between classes. Gradient-based and activationbased approaches could be combined to produce both fine and discriminative features importance scores.

Gradient-weighted Class Activation Mapping (Grad-CAM) [57] proposed by Selvaraju et al. uses the gradients flowing down to the last convolutional layer to multiply CAM from a forward pass. The resolution is enhanced by multiplying Grad-CAM with guided-backpropagated gradients. Class-specific queries and counterfactual explanations supported by Grad-CAM enable the visualization of portions of a picture that have a detrimental impact on model output, as shown in Fig.  9 . Grad-CAM++ [58] replaces the globally averaged gradients in Grad-CAM with a weighted average of the pixel-wise gradients since the weights of pixels contribute to the final prediction, which leads to better visual explanations of CNN model predictions. It addresses the shortcomings of Grad-CAM, especially multiple occurrences of a class in an image and poor object localization.

Due to the vanishing non-linearity of classifiers, CAM is often unsuitable for interpreting deep learning models in COVID-19 image classification tasks. Grad-CAM and Grad-CAM++ both enhanced the CAM procedure to enable better visualizations of deeper CNN models and are usually considered the most popular interpretation strategy in COVID-19 automatic diagnosis on radiographic imaging [7] , [59] - [64] . Additionally, Oh et al. [65] proposed patch-wise deep learning architecture to investigate potential biomarkers in Xray images and find the globally distributed localized intensity variation, which can be a discriminatory feature for COVID-19. They extended the idea of Grad-CAM to a novel probabilistic Grad-CAM that took patch-wise disease probability into account, resulting in more precise interpretable saliency maps that are strongly correlated with radiological findings. 5) Attention-based Approach: Attention mechanism is a critical component of human perception, as it enables humans to selectively focus on critical portions of an image rather than processing the entire scene. Simulating the human visual system's selective attention mechanism is also critical for comprehending the mechanisms underlying black-box neural networks. Attention mechanism has been widely applied to computer vision applications [66] , endowing the model with several new characteristics: 1) determine which portion of the inputs to focus on; 2) allocate limited computing resources to more critical components.

The effectiveness of attention mechanism has been demonstrated in a variety of medical image analysis tasks. Specifically, several state-of-the-art methods have been proposed to leverage attention mechanisms in order to improve the discriminative capability of classification models for both X- ray [67] and CT image analysis tasks [68] , [69] . In the application of COVID-19 diagnosis, Shi et al. [70] propose an explainable attention-transfer classification model based on a knowledge distillation network structure to address the difficulties associated with automatically differentiating COVID-19 and community-acquired pneumonia from healthy lungs in radiographic imaging. Extensive experiments on public radiographic datasets demonstrated the explainability of the proposed attention module in diagnosing COVID-19.

In addition to medical imaging, the attention mechanism is also useful in other feature interpretation setting, such as unstructured clinical notes with natural language processing (NLP). Diagnostic coding of clinical notes is a task that aims to provide patients with a coded summary of their diseaserelated information. Recently, Dong et al. [71] proposed a novel Hierarchical Label-wise Attention Network (HLAN) to automate such a medical coding process and to interpret model prediction results by evaluating the attention weights at word and sentence level. The label-wise attention scores in the proposed HLAN model provide comprehensive and robust explanation to support the prediction. Zhang et al. [72] proposed Patient2Vec to learn interpretable deep representations and predict risk of hospitalization on EHR data. The backbones of the model are gated recurrent units (GRU) and a hierarchical attention mechanism that learn and interpret the importance of clinical events on individual patients.

Recurrent neural network (RNN)-based variants with attention modules have been widely explored for severity as-sessment in EHRs [73] - [79] . Choi et al [73] proposed a reversed time attention model (RETAIN) to learn interpretable representation on EHRs. RETAIN is based on a two-level neural attention model that detects significant clinical variables in influential hospital visits in reverse time order. Kaji et al. [75] combined RNNs with a variable-level attention mechanism to interpret the model and results at the level of input variables. Shickel et al. [76] proposed an interpretable illness severity scoring framework named DeepSOFA that can predict the risk of in-hospital mortality at any time point during the ICU stay. By incorporating the self-attention mechanism, the RNN variant with gated recurrent unit (GRU) analyzes hourly-based time-series clinical data in the ICU. The selfattention module in the recurrent neural network is designed to highlight certain time steps of the input data which the model perceives to be most important in mortality prediction. Yin et al. [77] developed a domain-knowledge-guided recurrent neural network, an interpretable RNN model with a graphbased attention mechanism that incorporates clinical domain knowledge learnt from a public clinical knowledge graph. Li et al. [78] extended the proposed method to enable the exploration and interpretation on clinical risk prediction tasks through visualization and interaction of deep learning models.

In addition to RNN variants, Lauritsen et al. [80] proposed an explainable AI Early Warning Score (xAI-EWS) to predict acute critical illness on EHRs. The proposed model consists of a Temporal Convolution Network (TCN) prediction module and a Deep Taylor Decomposition (DTD) explanation module. Through computing back-propagated relevance scores, the DTD module identifies relevant clinical parameters at a given time for a prediction generated by the TCN. The xAI-EWS enables explanations in real time and allows the physicians to understand which clinical variables or parameters cause high EWS scores or changes in the EWS score.

However, the attention mechanism continues to struggle when confronted with missing coding, rare labels, or clinical notes containing subtle errors. Additionally, clinical notes in real-world clinical practice frequently contain multiple sentences, and it is unknown how well the attention mechanism would function when interpreting multiple sentences. Additionally, external domain knowledge in the medical field is required to verify interpretation results. In general, the attention mechanism has enormous potential for emphasizing critical features and fostering trust in clinical practice.

Development of an AI-based diagnosis system for COVID-19 was different from traditional epidemiological challenges: in the early stage of a new disease there is limited amounts of available data, especially diagnostic information [82] . The major downside of traditional deep learning methods is that they largely rely on the availability of labeled data, while COVID-19 datasets often contain incomplete or inaccurate labels. In biomedical applications, unsupervised learning has the benefit of not needing labeled data to still train, extract features, and cluster data, which makes it a great candidate for COVID-19 diagnosis (see Fig. 10 ). Fig. 10 . Examples of classical and modern XAI approaches in unsupervised clustering task. Unsupervised clustering has benefited from the use of latent spaces generated by deep learning models to generate sample similarities. This shift from the conventional approach of calculating input feature distances enables the use of custom transformations to optimize the space in which similarity is measured. This can result in improved sample disentanglement. The histopathology images are adapted from [81] .

The application of unsupervised learning approaches, especially clustering techniques, represents a powerful means of data exploration. Discovering underlying data characteristics, grouping similar measurements together, and identifying patterns of interest are some of the applications which can be tackled through clustering. Being unsupervised, clustering does not always provide clear and precise insight into the produced output, especially when the input data structure and distribution are complex and unlabeled. Applying XAI can allow researchers to understand the reasons leading to a particular decision under clinical scenarios and suggest an explanation to the clustering results for the end-users.

Recent advances in Auto-Encoders (AEs) have shown their ability to learn strong feature representations for image clustering [83] - [85] . By designing the constraint of the distance between data and cluster centers well, Song et al. [83] artificially re-aligned each point in the latent space of an AE to its nearest class neighbors during training to obtain a stable and compact representation suitable for clustering. Lim et al. [84] generalize Song's approach by introducing a Bayesian Gaussian mixture model for clustering in the latent space and replacing the input points with probability distributions which can better capture more hidden variables and hyperparameters. Prasad et al. [85] introduced a Gaussian Mixture prior to help clustering based on Variational Auto-Encoders to efficiently learn data distribution and discriminate between different clusters in a latent space.

In addition to guided feature representation achieved by AEs, King et al. [86] applied chest X-ray images of COVID-19 patients to a Self-Organizing Feature Map (SOFM) and found a distinct classification between COVID-19 and healthy patients. SOFM was first proposed to provide data visualization to cluster unlabeled X-ray images as well as reducing the dimensions of data to a map to understand high dimensional data. SOFM applied competitive learning to selectively tune the output neurons to the classes of the input patterns and then cluster their weights in locations respective to each other based off the feature similarities. They demonstrate that image clustering methods, specifically with SOFM networks, can cluster COVID-19 chest X-ray images and extract their features successfully to generate explainable results.

Yadav [87] proposed a deep unsupervised framework called Lung-GANs to learn interpretable representations of lung disease images using only unlabeled data and classify COVID-19 from chest CT and X-ray images. They extracted the lung features learned by the model to train a support vector machine and a stacking classifier and demonstrated the performance of proposed unsupervised models in lung disease classification. They visualized the features learnt by Lung-GANs to interpret deep models and empirically evaluate its effectiveness in classifying lung diseases.

Singh et al. [88] used image embedding generated from a prototypical part network (ProtoPNet) inspired network to calculate similarities and differences of X-ray image patches to known examples of pathology and healthy patches. This metric was then used to classify subjects into COVID-19 positive, pneumonia, or healthy classes.

The task of image clustering in COVID-19 and other clinical scenarios naturally requires good feature representation to capture the distribution of the data and subsequently differentiate one category from one another. In general, unsupervised clustering is an XAI technique which can be implemented to validate that images cluster in meaningful groups and facilitate expert annotation by extrapolating labels within samples belonging to the same cluster, when labels need be estimated.

Segmentation algorithms make pixel-level classifications of images and the overall segmentation produced provide insight into the decisions of the model. In the realm of XAI, image segmentation itself can be considered highly interpretable. Therefore, explanations of the segmentation process are currently not widely explored for medical image analysis. In the current climate, segmentation algorithms function as useful tools for isolating regions significant to COVID-19 diagnosis or for determining infection severity. Application of explainable AI techniques to segmentation techniques could provide valuable information to improve COVID-19 segmentation approaches, as shown in Fig. 11 .

Current COVID-19 segmentation approaches often use convolutional neural networks to delineate the regions of interest. One example of this model was developed by Saeedizadeh et al. for segmenting CT images of COVID-19 patients based on U-Net, which they call TV-UNet [89] . The framework was trained to detect ground glass regions on the pixel level, which are indicative of infected regions, and to segment them from normal tissue. TV-UNet differs from regular U-Net by the addition of an explicit regularization term in the training loss function which the authors report improves connectivity for predicted segmentations. Their model was trained on a COVID CT segmentation dataset with three different types of ground truth masks and reported an average DICE coefficient score of 0.864 and an average precision of 0.94. However, the results of the segmentation algorithm do not provide any intuition on why the model made the decisions it did. Part of this is due to the black box nature of U-Net. The residual connections between layers are inherently obscure to human intuition which makes it difficult to understand how U-Net decided to apply the labels. Application of a technique that explains the model's decision-making process could provide information on possible biases in the model and ways to improve it. Pennisi et al. [90] achieved sensitivity and specificity of COVID-19 lesion categorization of over 90% using a combination of lung lobe segmentation followed by lesion classification. In addition, they also created a clinician-facing user interface to visualize model prediction. This expert oversight was leveraged to improve future prediction by integrating clinician feedback through the same user interface (expert in the loop). Wang et al. [91] proposed an interpretable DeepSC-COVID designed with 3 subnets: a cross-task feature subnet for feature extraction, a 3D lesion subnet for lesion segmentation, and a classification subnet for disease diagnosis. Different from the single-scale self-attention constrained mechanism [66] , they implemented multi-scale attention constraint to generate more fine-grained visualization maps for potential infections.

Image morphology-based segmentation approaches are not as common within the context of COVID-19 image segmentation, but they do exist. An example from [92] demonstrates the successful use of an maximum entropy threshold segmentation-based method along with fundamental image processing techniques, such as erosion and dilation, to isolate a final lung-only binary mask. These lung masks can also be used to generate bounding boxes to limit classification to regions surrounding, and including, lung tissue [93] . In addition to lesion segmentation, some approaches first segment lung tissue prior to classification or further segmentation of clinically-relevant lesions. Jadhav et al. utilized this approach to allow radiologists to use a user-interface to view the two and three-dimensional CT regions used for the classification task with a saliency map overlay [94] . This combination of XAI approaches sought to increase radiologist trust of classification predictions by gaining multiple visual insights of the automated workflow. Natekar et al. described one such method of explaining segmentation algorithms known as network dissection [95] . Their focus was on explaining segmentations done on MR images of brain tumors with U-Net, but the techniques could be applicable to COVID-19 segmentation. They explain network dissection as follows: for a single filter in a single layer, collect the activation maps of all input images and determine the pixel-level distribution over the entire dataset. In CNNs, individual filters can focus on learning specific areas or features in an image, however, it is not clear from the outside which filter does which. Dissecting the network would make the purpose of each filter clearer and allow for better understanding of the decisions made by the model. Application to COVID-19 algorithms such as TV-UNet could allow for visualization of specific features that the model looks for to make a segmentation decision, thereby increasing user confidence in the model.

Another COVID-19 segmentation approach is the joint classification and segmentation diagnosis system developed by Wu et al. [52] . In their framework, they include an explainable classification model and segmentation model that work together to provide diagnosis prediction for COVID-19. Their segmentation is done via an encoder-decoder architecture based on VGG-16, plus the addition of an Enhanced Feature Module to the encoder which the authors proposed to improve the extracted feature maps. They trained and tested their model on a private COVID dataset and reported a DICE coefficient score of 0.783. Typically, image segmentation tasks are used to help explain classification decisions but the authors of this paper extend this idea by having the classification also help explain the segmentation. The segmentation algorithm references information from the classifier by merging their feature maps together to improve its decisions but this also helps indicate the reasoning behind the decisions made when producing segmentation. Utilizing classification information to help train and explain segmentation is an avenue which merits further exploration. Fig. 11 . Examples of classical and modern XAI approaches in image segmentation task. Segmentation models have progressed from being highly interpretable (when simple color thresholds are used) to requiring numerous nonlinear transformations to generate the final segmentation. Although XAI approaches to image segmentation are not widely used, recent techniques have used the model activation maps generated by deep layers to identify significant associations with the final segmentation.

Visualizing the network is advantageous for diagnosing model problems, interpreting the models' meaning, or simply guiding deep learning concepts. For instance, we can visualize decision boundaries, model weights, activations, and gradients for a CNN model and extracting feature maps from hidden convolutional layers. Heat-maps or saliency maps visualization is the most frequently used method for interpreting convolutional neural network predictions which plays an important role in the model interpretation and explanation representation.

In computer vision, heat-map or saliency map is an image that highlights important regions to focus user attention. The purpose of a saliency map is to show how important a pixel is to the human visual system. Visualization provides an interpretable technique to investigate hidden layers in deep models. For example, in a COVID-19 imaging classification task, saliency maps generated by gradient-based XAI methods are widely used to measure the spatial support of a particular class in each image [7] , [51] - [56] , [59] - [65] , [69] , [70] .

Most existing papers on XAI for COVID-19 EHRs rely on decision tree-based approaches for mortality prediction and ICU readmission prediction [4] , [96] - [105] . Before COVID-19 pandemic, decision tree-based models [106] - [111] have already been widely used for interpreting clinical features in EHRs under difference clinical settings.

Gradient boosted tree (XGBoost) models can generate feature importance and consequently identify top-ranking features and potential biomarkers. Yet these models in existing COVID-19 papers are often validated on single-center datasets, and their interpretation for clinical practice is limited. Yan et al. [4] presented an interpretable XGBoost classifier to predict patient mortality and identify critical predictive biomarkers on medical records of 485 infected patients in the city of Wuhan. The machine learning model achieved over 90% accuracy in predicting patient mortality more than 10 days in advance. In addition, the author modified the XGBoost classifier into an interpretable "single-tree XGBoost" to identify the top three biomarkers: lactic dehydrogenase, lymphocyte and highsensitivity C-reactive protein (hs-CRP), which were consistent with medical domain knowledge. Although the author reported great performance on mortality prediction and identifying the top three clinical biomarkers, the model was only validated on a small sample size without external validation. In addition, as 88% of patients survived and 12% of patients died, it is inappropriate to use accuracy as the only evaluation performance on this imbalanced dataset. Subsequently, this paper has been challenged by a few follow up articles from different perspectives.

Three other papers [112] - [114] challenged the original paper [4] on its limited performance of mortality prediction and less applicable clinical interpretation using external datasets. Barish et al. [114] demonstrated the ineffectiveness of the original model as an admission triage tool on the internal dataset. Furthermore, the original decision-tree-based model is not applicable on an external dataset which was collected from 12 acute care hospitals in New York, USA. Therefore, external validation is a critical step in verifying the proposed model before adopting the identified biomarkers in clinical practice.

Similarly, Alves et al. also used a post-hoc method of explaining a random forest (RF) classifier result using an instance based interpretation [102] . This involves minimizing the distance between a decision tree classifier and the RF model for a specific sample. This approach was called decision tree-based explanation. Additional nearby samples are generated for the purposes of optimization by adding noise to the chosen sample features. This approach allows decision trees to be generated for specific patients which help clinicians understand the model decision. This approach is limited to explaining the decision for individual samples, and does not offer a global explanation of model decision making. Shapley Additive Explanations (SHAP) and LIME were also used to generate additional global and local explanations respectively.

Under COVID-19 pandemic, an alternative explanation representation approach is to calculate the feature interpretation score after predicting mortality and other critical health events [96] , [98] , [104] , [115] - [128] . Before the pandemic, clinical scoring systems [129] - [132] had already been widely used for interpreting clinical features in EHRs. In clincial applications, SHAP [33] and permutation-based [133] feature importance are the most popular ways for feature importance calculation. SHAP feature importance [33] is widely used to explain the prediction of an instance by computing the contribution of each feature to the prediction. The SHAP explanation method computes Shapley values based on coalitional game theory. Pan et al. [98] identified important clinical features using SHAP score and LIME score on EHR data from 123 COVID-19 ICU patients. Afterwards, the authors built a reliable XGBoost classifier for mortality prediction and ranked selected features. The combination of clinical score and decision tree-based algorithm enables a more generalizable method to identify and interpretable clinical features of top importance.

In addition to prognostic assessment, SHAP feature importance was also adopted to identify both the city-level and the national-level contributing factors in curbing the spread of COVID-19. The findings may help researchers and policymaker to implement effective responses in mitigating the consequences of the pandemic. Cao et al [134] applied XGBoost model to predict new COVID-19 cases and growth rate using six categories of variables, including travel-related factors, medical factors, socioeconomic factors, environmental factors and influenza-like illness factors. By calculating SHAP scores, the author quantitatively evaluated these contributing factors to new cases and growth rate. The author indicated that the population movement from Wuhan and non-Wuhan regions to the target city are both significant factors that contribute to new cases. One major concern with SHAP score is how to interpret the negative SHAP values in this task. Instead of indicating negative impact, negative SHAP values might be interpreted as no impact or compromised effect. More careful consideration is definitely needed.

Apart from SHAP, Foieni et al. [116] derived a score to identify the risk of in-hospital mortality and clinical evolution based on the linear regression of 8 clinical and laboratory variables on 119 COVID-19 patients. These 8 clinical and laboratory variables are significantly associated with model prediction outcomes. One limitation of this work is that the score is defined as a linear combination of 8 variables. Consequently, the score cannot be generalized on external dataset with different clinical and laboratory variables.

Meanwhile, DeepCOVID-Net [135] was proposed to forecast new COVID-19 cases using county-level features, such as census data, mobility data and past infection data. Feature importance was estimated in two steps: 1) evaluate model accuracy on a small subset of the training data; 2) loop through all features, independently randomize the values of one feature at a time, and re-calculate the model accuracy on the same training set. Higher importance was assigned to features with lower performance drop during the evaluation process. As such, the author identified the top three features: past rise in infected cases, cumulative cases in all counties, and incoming county-wise mobility. Although these three features are reasonable in common sense, interpreting interactions at the individual feature level might still be difficult, as the model accounts for the higher-order interaction between these feature groups.

User interfaces are a highly valuable way for users without significant computational experience to visualize the results of XAI approaches. These interfaces are frequently used to demonstrate model performance and to instill confidence in AI solutions. The following works [90] , [94] , [101] , [119] , [120] , [136] all successfully leveraged XAI and user interfaces to deliver value to their clinician end users. Meanwhile, it enables the future collection of user feedback and clinical response.

Markridis et al. [119] used XGBoost to predict the probability of patient mortality from a Veterans Affairs hospital dataset. In addition to feature importance derived directly from the XGBoost model, they also used SHAP to establish feature effects on prediction directionality. These insights were presented with a user interface for clinicians to examine individual patients and understand individual feature effects on final model prediction.

Brinati et al. [101] used 13 clinical features (e.g. diagnostic blood lab values and patient gender) to compare seven machine learning models including random forest and conventional decision tree approaches using nested cross validation. Random forest was found to produce the best average accuracy across the external folds, and was also used to generate feature importance scores. The decision tree model, despite not performing as well, was also used to illustrate the contribution of features to the final classification label. The final workflow was made available to clinicians via a web-based user interface. Haimovich et al. [120] established their outcome variable of patient respiratory compensation based on clinical insights and designed a web-based platform to be utilized in real-world clinical settings. They generated feature importance by comparing values obtained from multiple interpretable machine learning approaches. The ranked list of feature importance was also supplemented by SHAP to elucidate the direction of contribution of each feature across patients for clinical use.

Qualitative visualization plays an important role in evaluating XAI methods. For biomedical applications, qualitative evaluation focuses on whether visualization could align with established knowledge. For instance, expert radiologists can assess how well the generated attention map identifies image regions of high diagnostic relevance [137] . A standard of evaluation is illustrated in Fig. 12 .

Although qualitative evaluation is important, quantitative evaluation of interpretation is still desirable. Quantitative evaluation can be obtained through either user study or automatic approaches. In the scenario of conducting user studies for feature scoring, target users (e.g. physicians for medical applications) perform certain tasks with, and without, the assistance of visual interpretation. For example, in using histopathology image for clinical diagnosis, clinicians will be asked to diagnose cases of original images versus images with visual interpretation. Then improvement in performance is measured with the assistance of visual interpretation. User studies could be considered as the most reliable approach for evaluating interpretability, if they are designed to resemble real application scenarios. However, conducting such user studies are expensive and time-consuming, especially for biomedical applications.

An alternative approach is automatic evaluation, which serves as a proxy for user study without involving real users. Zeilar and Fergus first introduced the idea of the occlusion experiment [28] , in which portions of input images were systematically occluded by a grey square for monitoring the performance of deep learning models. Samek et al. [138] further formalized the occlusion experiments by introducing a procedure called "pixel flipping", which destroys data points ordered by their feature importance scores and compares the decrease in classification metrics among multiple interpretation methods. A larger decrease in the metrics suggests a better interpretation method. Because occlusion experiments are model agnostic, they can be used as an objective measure for interpretation methods. On the other hand, the occlusion experiments can not serve as objective evaluation for perturbationbased feature scoring methods, such as Randomized Input Sampling for Explanation (RISE) [139] , that perturb input directly to identify important features.

Quantitative evaluation of data synthesis is still in its infancy. DeVries et al. [140] designed an evaluation metric, named Fréchet Joint Distance (FJD), for the quality of images generated by conditional GAN based on visual quality, intraconditioning diversity, and conditional consistency. Assuming the joint distribution of hidden space and labels are Gaussian, they used FJD to compare the mean and variance between real and generated images. Recently, Yang et al. [141] created a ground-truth dataset consisting of mosaic natural images for interpretation methods and tried to unify the evaluation of both feature scoring and data synthesis methods. Their aforementioned methods are early in their developmental stage, even for natural images, and ways to adopt them into biomedical images and other biomedical data modalities remains an ongoing challenge. Besides, Lin et al. [142] proposed an adversarial attack to evaluate the robustness of interpretability in XAI methods by checking whether they can detect backdoor triggers present in the input. Researchers employed data poisoning to create trojaned models, generated saliency maps that will highlight the trigger to evaluates the saliency map output with three quantitative evaluation metrics (IoU, recovery rate, and recovering difference).

Upon review of the existing works leveraging XAI to facilitate the interpretation of AI-based COVID-19 solutions to clinical challenges, we have identified key features present in papers which have made substantial impacts in the field. Table III summarizes the XAI techniques used in COVID-19 related clinical applications in this review paper. Furthermore, we discuss these findings, and references to example implementations, as a table of important considerations during the process of AI-based experimental design Table. IV.

Using the framework values of performance, user trust, and user response, we noticed the need for incorporating clinical insights throughout the study design process. This includes understanding the factors influencing response variables in the real world, as illustrated in Haimovich et al. [120] when they stated that ICU admission was not an ideal outcome variable due to site-specific and time-dependent patient admission requirements. Clinical input may also be obtained during and after model optimization via real-time expert feedback [146] and during implementation via expert-facing user interfaces 

Category Technique

Intrinsic latent space guidance [17] - [20] , [26] , [27] Post-hoc PCA-based [25] Outcome Prediction

Feature occlusion and ablation [28] - [31] , [143] , [144] SHAP feature importance [32] - [38] , [96] , [98] , [102] , [115] - [122] , [134] Local interpretable model-agnostic explanations (LIME) [38] - [42] Activation-based Activation maximization [43] Class activation maps (CAM) [44] , [45] Gradient-based Gradient-based class score [47] Deconvnet [46] Layer-wise relevance propagation (LRP) [48] - [50] Gradient-based saliency analysis [47] , [51] - [56] Mixed-based Grad-CAM and Grad-CAM++ [7] , [57] - [65] Attention-based

Self-attention mechanism [66] - [70] , [76] , [145] Hierarchical attention mechanism [71] , [72] RNN-based temporal attention mechanism [73] - [79] Temporal convolution network [80] Unsupervised Clustering

Guided Embedding VAE-feature clustering [83] - [85] Feature Extraction

Self-organizing feature map (SOFM) clustering [86] Similarity calculation [81] , [88] Latent space interpretation [87] Image Segmentation

Morphology-based Maximum entropy threshold [92] - [94] Context-based

Mult-scale attention [91] Saliency Analysis [52] Network dissection [95] [136], [147] - [149] . In addition to web-based applications, visualizing sample clusters [61] , [100] and feature importance metrics [31] , [65] , [71] , [98] , [117] , [121] , [122] , [134] , [135] , [150] can offer users without expertise in data analysis an option of understanding the decision-making process of otherwise obscure models. A very common approach to generating easily interpretable models is to optimize a decision tree approach that can describe the decision making process using available features [31] , [96] , [98] , [100] - [102] , [115] - [122] . This approach is also similar to commonly used clinical guidelines to generate fast and consistent metrics for patient triage and management [151] , [152] .

Validating feature importance ranking by using multiple methods, such as tree-based importance metrics and SHAPley values, can establish features lists which are consistent between approaches to prevent spurious rankings [101] , [102] , [119] , [120] , [122] , [134] . This may be especially important if the list is to be used for feature selection or simplified feature visualizations (e.g. only visualize odds ratios for most important features).

Comparison of multiple competing models is often necessary to generate high-performance solutions. We noticed the widespread use of cross validation when authors sought to conduct these comparisons [98] , [101] - [103] , [120] , [121] . Cross validation is easier to implement when models are quickly trained and tested, but this approach may also be used with more complex models to ensure robust comparisons.

In the quest for easily interpretable results, it is common to see accuracy being reported as a model performance metric. Although accuracy is understood by model developers and end-users alike, it should be avoided when significant data imbalance is present. Examples of works using appropriate performance metrics include [98] , [103] , [117] , [118] , [121] . Common metrics include Area Under the Receiver Operating Curve (AUROC) and Matthews Correlation Coefficient (MCC), the later being appropriate even in imbalanced binary classification tasks [153] .

An often overlooked aspect of model development is the potential for adversarial attack within the final implementation context. Cyber attacks on hospital systems are depressingly common with a notable rise in frequency over time [154] . As research tools make their way into the hospital it may become important to understand the vulnerability of models to potential future attacks. Therefore, we included this component to our checklist alongside a recent illustration of adversarial testing approaches [26] .

With any clinical informatics work there will be challenges. Often these will arise due to issues with the dataset being used, especially if it was derived from real-world data. After our review of the literature we summarized common challenges and potential solutions, including example works which successfully solve the problem Table. IV.

Early in the pandemic, there was a scarcity of reliable data available to the general scientific community. This resulted in a significant need for data imputation in order to fill in missing values to maximize the utility of existing data [53] . Poor data quality also affected model performance and artifact correction techniques were implemented [7] .

Imbalanced classes were frequently found within COVID-19 datasets due to the accessibility of normal samples relative to COVID-19 positive. Data augmentation was found to alleviate this problem in some cases by generating additional samples of the underrepresented class [59] . 

Incorporate clinical insights [120] Small sample size Data imputation [30] , [53] Interactive user interface [90] , [94] , [101] , [119] , [120] , [136] Bad data quality Artifact correction [7] Clinical feedback [90] Imbalanced classes Data augmentation [56] , [59] Visualization of feature importance [65] , [71] , [98] , [117] , [121] , [122] , [134] Complex disease phenotype Multi-modality data [51] Clustering analysis [61] , [87] , [100] Data heterogenity Data normalization [53] Decision tree [31] , [96] , [98] , [100] - [102] , [115] - [122] Lack of expert annotation Weakly supervised learning [45] , [65] , [87] Use multiple feature importance approaches [15] , [38] , [69] , [101] , [102] , [122] Unkown sources of signal Key feature extraction [34] , [52] , [63] Cross validation when comparing models [41] , [98] , [101] - [103] , [120] , [121] Explanations unclear Pre-processing changes [61] , [62] Use appropriate and robust performance metrics (AUROC, MCC) [103] , [118] Data leakage Patient-level split [40] , [54] , [60] , [91] Adversarial example testing [26] , [27] Training is inefficient Transfer learning [50] , [56] , [92] Small Fig. 13 . Summary of insights gained for designing an AI development workflow. We provide a checklist of considerations to make early in the experimental design process in order to avoid common problems. Additionally, we provide a list of common issues encountered when working with clinical data and discuss several common solutions that may assist the reader when working with these data.

Lack of expert annotation of key regions of pathology in imaging data created the need for weakly supervised learning models capable of generalizing small ground truth datasets [45] , [65] . Without expert insight, it was often necessary to identify features capable of differentiating between similar phenotypes (e.g. bacterial versus viral pneumonia). This problems was frequently solved via key feature extraction [7] , [45] , [52] , [54] , [55] , [60] , [63] , [65] . In the case of complex disease phenotypes, multi-modality data were integrated to leverage data obtained from consistent or complementary sources [51] . Ensuring model generalizability requires robust external validation. Data leakage occurs when testing/external dataset information is used during the model design or optimization process. The likelihood of this occurring can be reduced by isolating the test dataset during hyperparameter selection and model training. Special care must be taken to avoid including data derived from the same patient in both the test and training datasets. There exists significant withinpatient correlation of features, even across samples. This data leakage may allow models to learn patient-specific patterns which are not generalizable to other patients, resulting in poor performance in the real-world [7] , [40] , [54] , [60] .

XAI may lead to unclear results, either due to inconsistent feature importance ranking or nonspecific image highlighting. In these cases, it is often a good idea to re-establish the quality of the preprocessing pipeline [61] , [62] .

When training is inefficient, transfer learning may be used to take advantage of prior parameter optimization on similar problems [50] , [64] .

Ultimately, we designed this checklist to help both academic researchers in general and clinical data scientists specifically. We summarize the integration of XAI in both settings, along with its benefits in Figure 13 .

XAI techniques have developed quickly in recent years to meet the evolving needs of AI researchers and the end-users of their models. Although it is easy to fall into the trap of believing that more recent models are objectively better than their more classic counterparts, it is important to understand that each model was designed to improve our understanding of different facets of AI-solutions. For example, in the task of data augmentation, it is common to see K-Nearest Neighbors interpolation and other classic approaches used instead of more complex modern solutions. This is in part because the more classic approach has been around for longer and its pitfalls have been well established. The modern approach may result in data bias which may be difficult to understand due to the lack of real-world examples of their successes and failures. The trend for data augmentation has been to increase the number of considered factors and complexity of data transformations to better model the underlying data distribution of samples.

Clinical decision support is a very common setting to find AI-solutions in need of explanation. For the task of disease diagnosis, the trend has been to generate input importance visualizations which can be used across a wide range of common deep learning models. This is in contrast to early XAI approaches which relied on model-specific solutions to improve interpretability. XAI in risk prediction for clinical decision support has trended towards generating sample-specific explanations. These may provide the end user with a custom answer to the question of "why did this sample get the score that it did?". This is especially useful in the clinical setting where precision medicine is becoming the standard, and patient-specific explanations for risk scores are vital. Additionally, we summarized in Table V a list of the most popular XAI tools, libraries, and packages that can be used directly in clinical applications. Unsupervised clustering has benefited from the use of deep learning model latent spaces for their generation of sample similarities. This shift from the classical input feature distance approaches allows custom transformations to optimize the space within which similarity is measured. This can result in better disentanglement of samples [155] .

Image segmentation approaches have increased in complexity in recent years due to models such as U-Net and its variants. Models have gone from highly interpretable (if using simple color thresholds) to involving many nonlinear transformations to producing the final segmentations. XAI approaches for image segmentation are still not commonly used, but recent techniques have leveraged the model activation maps produced by deep layers to identify significant associations with final deep learning model output [156] . XAI approaches will continue to adapt as models continue to become better optimized for different tasks. XAI will likely cover a much wider range of approaches to meet the needs of end-users and regulatory agencies.

In future work, with the decrease of COVID-19 incidence and increase of vaccine supply, risk stratification will become vital to determine optimal treatment plan. We also hope our focus of XAI within the ongoing COVID-19 pandemic may increase the relevance of our insights to future disease outbreaks. The framework we provided can be used across common AItasks and may improve the clinical implementation of these solutions, especially in the early stages of infection.

The recent confluence of large-scale public healthcare datasets combined with the rapid increase of computing capacity, and the popularity of open-source code has resulted in a noteworthy increase in AI-based clinical decision support systems. This trend has increased the need for understanding the criteria which make AI solutions successful in practice.

In this work, we reviewed XAI approaches used to solve challenges which arose during the COVID-19 pandemic to identify trends and insights useful for meeting future challenges. First, we provided a general overview of common XAI techniques used during the pandemic and gave specific examples to demonstrate the value that these techniques provide. We then illustrated classical, recent, and potential future trends in XAI approaches to clarify the evolving goals of these tools. Evaluation approaches were also discussed to provide the reader with an understanding of qualitative and quantitative methods used to check the performance of XAI results. After covering the different aspects of implementing XAI, we summarized the insights that we have gained for the design of an AI development workflow. We provided a checklist of suggestions to consider early during the experimental design process to avoid prevalent issues. We also provided a list of common problems seen when working with clinical data and discuss some common solutions which may aid the reader during their work with these data. Finally, we discussed the potential challenges and future directions of XAI in COVID-19 and biomedical applications, with an ideal workflow meeting the requirements of performance, trustworthiness, and impact on user response.

Clinical informatics is generally risk-averse creating the need for AI developers in the field to understand how AI-based decisions are reached. This understanding would provide two benefits: (i) increasing confidence that a deep learning model is unbiased and relies on relevant features to accomplish desired tasks and (ii) detecting biases or discovering new knowledge if the generated explanations do not fit with established science. Ultimately, we hope that the implementation of XAI techniques will accelerate the translation of data-driven analytic solutions to improve the quality of patient care.

The authors would like to thank the Emory Science Librarian, Ms. Kristan Majors, for her support and guidance on search optimization for the PRISMA chart. We would like to thank Dr. Siva Bhavani from Emory University for his insights on leveraging artificial intelligence in clinical practice. We would like to thank Mr. Benoit Marteau from Bio-MIBLab for his help on reviewing the manuscript. 

Characteristics of sars-cov-2 and covid-19

The role of chest imaging in patient management during the covid-19 pandemic: a multinational consensus statement from the fleischner society

Rapid triage for covid-19 using routine clinical data for patients attending hospital: development and prospective validation of an artificial intelligence screening test

An interpretable mortality prediction model for covid-19 patients

Hundreds of AI tools have been built to catch covid. none of them helped

Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography

Using artificial intelligence to detect covid-19 and community-acquired pneumonia based on pulmonary ct: evaluation of the diagnostic accuracy

Deepcovid: Predicting covid-19 from chest x-ray images using deep transfer learning

Adversarial Examples-Security threats to COVID-19 deep learning systems in medical IoT devices

One pixel attack for fooling deep neural networks

Robust physical-world attacks on deep learning visual classification

Towards robust interpretability with self-explaining neural networks

Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension

Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension

Differences between human and machine perception in medical diagnosis

Preferred reporting items for systematic reviews and meta-analyses: the prisma statement

COVIDScreen: explainable deep learning framework for differential diagnosis of COVID-19 using chest X-rays

CovidGAN: Data augmentation using auxiliary classifier GAN for improved covid-19 detection

A deep transfer learning model with classical data augmentation and CGAN to detect COVID-19 from chest CT radiography digital images

Unsupervised discovery of interpretable directions in the gan latent space

The mnist database of handwritten digit images for machine learning research

Towards the high-quality anime characters generation with generative adversarial networks

Deep learning face attributes in the wild

Large scale GAN training for high fidelity natural image synthesis

Ganspace: Discovering interpretable gan controls

Adversarial examples-security threats to covid-19 deep learning systems in medical iot devices

Explainable artificial intelligence for bias detection in covid ct-scan classifiers

Visualizing and understanding convolutional networks

Interpretable classification of alzheimer's disease pathologies with a convolutional neural network pipeline

Features of icu admission in x-ray images of covid-19 patients

Explainable machine learning for early assessment of covid-19 risk prediction in emergency departments

Explaining prediction models and individual predictions with feature contributions

A unified approach to interpreting model predictions

Interpretation of deep learning using attributions: application to ophthalmic diagnosis

Explainable machine-learning predictions for the prevention of hypoxaemia during surgery

Machine learning-based prediction of covid-19 diagnosis based on symptoms

Is ai model interpretable to combat with covid? an empirical study on severity prediction task

Comparative analysis of explainable artificial intelligence for covid-19 diagnosis on cxr image

why should i trust you?" explaining the predictions of any classifier

Study of different deep learning approach with explainable ai for screening patients with covid-19 symptoms: Using ct scan and chest x-ray image dataset

Detection of covid-19 patients from ct scan and chest x-ray data using modified mobilenetv2 and lime

Explainable AI for COVID-19 CT classifiers: An initial comparison study

Understanding representations learned in deep architectures

Learning deep features for discriminative localization

Accurate screening of covid-19 using attention-based deep 3d multiple instance learning

Striving for simplicity: The all convolutional net

Deep inside convolutional networks: Visualising image classification models and saliency maps

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

Deepcovidexplainer: Explainable covid-19 diagnosis from chest x-ray images

A deep convolutional neural network for covid-19 detection using chest x-rays

An artificial intelligence system for predicting the deterioration of covid-19 patients in the emergency department

Jcs: An explainable covid-19 diagnosis system by joint classification and segmentation

Machine learning automatically detects covid-19 using chest cts in a large multicenter cohort

M 3 lung-sys: A deep learning system for multi-class lung pneumonia screening from ct imaging

An interpretable deep learning model for covid-19 detection with chest x-ray images

Screening of covid-19 suspected subjects using multi-crossover genetic algorithm based dense convolutional neural network

Grad-cam: Visual explanations from deep networks via gradient-based localization

Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks

Psspnn: Patchshuffle stochastic pooling neural network for an explainable diagnosis of covid-19 with multiple-way data augmentation

Deep learning enables accurate diagnosis of novel coronavirus (covid-19) with ct images

Artificial intelligence applied to chest x-ray images for the automatic detection of covid-19. a thoughtful evaluation approach

Explainable deep learning for pulmonary disease and coronavirus covid-19 detection from x-rays

A deep learning and grad-cam based color visualization approach for fast detection of covid-19 cases using chest x-ray and ct-scan images

Explainable covid-19 detection using chest ct scans and deep learning

Deep learning covid-19 features on cxr using limited training data sets

Non-local neural networks

Lesion location attention guided network for multi-label thoracic disease classification in chest x-rays

Pulmonary textures classification via a multi-scale attention network

Dual attention multiple instance learning with unsupervised complementary loss for covid-19 screening

Covid-19 automatic diagnosis with radiographic imaging: Explainable attentiontransfer deep neural networks

Explainable automated coding of clinical notes using hierarchical label-wise attention networks and label embedding initialisation

Patient2vec: A personalized interpretable deep representation of the longitudinal electronic health record

Retain: An interpretable predictive model for healthcare using reverse time attention mechanism

Retainvis: Visual analytics with interpretable and interactive recurrent neural networks on electronic medical records

An attention based deep learning model of clinical events in the intensive care unit

Deepsofa: a continuous acuity score for critically ill patients using clinically interpretable deep learning

Domain knowledge guided deep learning with electronic health records

Marrying medical domain knowledge with deep learning on electronic health records: A deep visual analytics approach

Interpretable clinical prediction via attention-based neural network

Explainable artificial intelligence model to predict acute critical illness from electronic health records

Similar image search for histopathology: Smily

Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19

Auto-encoder based data clustering

Deep clustering with variational autoencoder

Variational clustering: Leveraging variational autoencoders for image clustering

Unsupervised clustering of covid-19 chest x-ray images with a self-organizing feature map

Lung-gans: Unsupervised representation learning for lung disease classification using chest ct and x-ray images

These do not look like those: An interpretable deep learning model for image recognition

Covid tv-unet: Segmenting covid-19 chest ct images using connectivity imposed unet

An explainable ai system for automated covid-19 assessment and lesion categorization from ct-scans

Joint learning of 3d lesion segmentation and classification for explainable covid-19 diagnosis

Csgbbnet: An explainable deep learning framework for covid-19 detection

Explainable deep neural models for covid-19 prediction from chest x-rays with region of interest visualization

Covid-view: Diagnosis of covid-19 using chest ct

Demystifying brain tumor segmentation networks: interpretability and uncertainty analysis

Machine learning to predict mortality and critical events in a cohort of patients with covid-19 in new york city: Model development and validation

Physiological and socioeconomic characteristics predict covid-19 mortality and resource utilization in brazil

Prognostic assessment of covid-19 in the intensive care unit by machine learning methods: Model development and validation

An explainable system for diagnosis and prognosis of covid-19

Machine learning for mortality analysis in patients with covid-19

Detection of covid-19 infection from routine blood exams with machine learning: a feasibility study

Explaining machine learning based diagnosis of covid-19 from routine blood tests with decision trees and criteria graphs

Individualized prediction of covid-19 adverse outcomes with mlho

A tree-based mortality prediction model of covid-19 from routine blood samples

Prediction of icu admission for covid-19 patients: a machine learning approach based on complete blood count data

Rupture risk assessment for cerebral aneurysm using interpretable machine learning on multidimensional data

An interpretable boosting model to predict side effects of analgesics for osteoarthritis

Cross-site transportability of an explainable artificial intelligence model for acute kidney injury prediction

Predicting 30-day all-cause readmissions from hospital inpatient discharge data

Early detection of septic shock onset using interpretable machine learners

Interpretable deep models for icu outcome prediction

Limited applicability of a covid-19 specific mortality prediction rule to the intensive care setting

Replication of a mortality prediction model in dutch patients with covid-19

External validation demonstrates limited clinical utility of the interpretable mortality prediction model for patients with covid-19

The pandemyc score. an easily applicable and interpretable model for predicting mortality associated with covid-19

Derivation and validation of the clinical prediction model for covid-19

Prediction of sepsis in covid-19 using laboratory indicators

An interpretable model-based prediction of severity and crucial factors in patients with covid-19

Designing covid-19 mortality predictions to advance clinical outcomes: Evidence from the department of veterans affairs

Development and validation of the quick covid-19 severity index: a prognostic tool for early clinical decompensation

Prognostic modeling of covid-19 using artificial intelligence in the united kingdom: model development and validation

Contrasting factors associated with covid-19-related icu admission and death outcomes in hospitalised patients by means of shapley values

Natural history, trajectory, and management of mechanically ventilated covid-19 patients in the united kingdom

Machine learning identifies icu outcome predictors in a multicenter covid-19 cohort

eards: A multicenter validation of an interpretable machine learning algorithm of early onset acute respiratory distress syndrome (ards) among critically ill adults with covid-19

Budget constrained machine learning for early prediction of adverse outcomes for covid-19 patients

Development and validation of prediction models for mechanical ventilation, renal replacement therapy, and readmission in covid-19 patients

Interpretable machine learning for covid-19: an empirical study on severity prediction task

Drivers of prolonged hospitalization following spine surgery: A game-theory-based approach to explaining machine learning models

Interpretation of compound activity predictions from complex machine learning models using local approximations and shapley values

Development and interpretation of multiple machine learning models for predicting postoperative delayed remission of acromegaly patients during long-term follow-up

Iseeu: Visually interpretable deep learning for mortality prediction inside the icu

All models are wrong, but many are useful: Learning a variable's importance by studying an entire class of prediction models simultaneously

Impact of systematic factors on the outbreak outcomes of the novel covid-19 disease in china: factor analysis study

Deepcovidnet: An interpretable deep learning model for predictive surveillance of covid-19 using heterogeneous features and their interactions

Natural history, trajectory, and management of mechanically ventilated covid-19 patients in the united kingdom

Interpretable artificial intelligence framework for covid-19 screening on chest x-rays

Evaluating the visualization of what a deep neural network has learned

Rise: Randomized input sampling for explanation of black-box models

On the evaluation of conditional gans

Bim: Towards quantitative evaluation of interpretability methods with ground truth

What do you see? evaluation of explainable artificial intelligence (xai) interpretability through neural backdoors

Deep learning with convolutional neural networks for eeg decoding and visualization

Applying deep learning for epilepsy seizure detection and brain mapping visualization

Attention is all you need

An explainable ai system for automated covid-19 assessment and lesion categorization from ct-scans

Detection of covid-19 infection from routine blood exams with machine learning: a feasibility study

Designing covid-19 mortality predictions to advance clinical outcomes: Evidence from the department of veterans affairs

Explainable dcnn based chest x-ray image analysis and classification for covid-19 pneumonia detection

Explainable cardiac pathology classification on cine mri with motion characterization by semisupervised learning of apparent flow

The sequential organ failure assessment score for predicting outcome in patients with severe sepsis and evidence of hypoperfusion at the time of emergency department presentation

The curb65 pneumonia severity score outperforms generic sepsis and early warning scores in predicting mortality in community-acquired pneumonia

The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation

Cybersecurity of hospitals: discussing the challenges and working towards mitigating the risks

Quantitative assessment of chest ct patterns in covid-19 and bacterial pneumonia patients: a deep learning perspective

Understanding the role of individual units in a deep neural network