key: cord-0162668-dnjqopvm authors: Nguyen, Hoang T.N.; Nie, Dong; Badamdorj, Taivanbat; Liu, Yujie; Zhu, Yingying; Truong, Jason; Cheng, Li title: Automated Generation of Accurate &Fluent Medical X-ray Reports date: 2021-08-27 journal: nan DOI: nan sha: 5b74d72befc080e827c3aa4d33428590f0f1a74e doc_id: 162668 cord_uid: dnjqopvm Our paper focuses on automating the generation of medical reports from chest X-ray image inputs, a critical yet time-consuming task for radiologists. Unlike existing medical re-port generation efforts that tend to produce human-readable reports, we aim to generate medical reports that are both fluent and clinically accurate. This is achieved by our fully differentiable and end-to-end paradigm containing three complementary modules: taking the chest X-ray images and clinical his-tory document of patients as inputs, our classification module produces an internal check-list of disease-related topics, referred to as enriched disease embedding; the embedding representation is then passed to our transformer-based generator, giving rise to the medical reports; meanwhile, our generator also pro-duces the weighted embedding representation, which is fed to our interpreter to ensure consistency with respect to disease-related topics.Our approach achieved promising results on commonly-used metrics concerning language fluency and clinical accuracy. Moreover, noticeable performance gains are consistently ob-served when additional input information is available, such as the clinical document and extra scans of different views. Medical reports are the primary medium, through which physicians communicate findings and diagnoses from the medical scans of patients. The process is usually laborious, where typing out a medical report takes on average five to ten minutes (Jing et al., 2018) ; it could also be error-prone. This has led to a surging need for automated generation of medical reports, to assist radiologists and physicians in making rapid and meaningful diagnoses. Its potential efficiency and benefits could be enormous, especially during critical situations such as Code is available at https://github.com/ ginobilinie/xray_report_generation COVID or a similar pandemic. Clearly a successful medical report generation process is expected to possess two key properties: 1) clinical accuracy, to properly and correctly describe the disease and related symptoms; 2) language fluency, to produce realistic and human-readable text. Fueled by recent progresses in the closely related computer vision problem of image-based captioning (Vinyals et al., 2015; Tran et al., 2020) , there have been a number of research efforts in medical report generation in recent years (Jing et al., 2018 (Jing et al., , 2019 Li et al., 2018 Xue et al., 2018; Yuan et al., 2019; Wang et al., 2018; Lovelace and Mortazavi, 2020; Srinivasan et al., 2020) . These methods often perform reasonably well in addressing the language fluency aspect; on the other hand, as is also evidenced in our empirical evaluation, their results are notably less satisfactory in terms of clinical accuracy. This we attribute to two reasons: one is closely tied to the textual characteristic of medical reports, which typically consists of many long sentences describing various disease related symptoms and related topics in precise and domain-specific terms. This clearly sets the medical report generation task apart from a typical image-to-text problem such as image-based captioning; another reason is related to the lack of full use of rich contextual information that encodes prior knowledge. These information include for example clinical document of the patient describing key clinical history and indication from doctors, and multiple scans from different 3D views -information that are typically existed in abundance in practical scenarios, as in the standard X-ray benchmarks of Open-I (Demner-Fushman et al., 2016) and MIMIC-CXR (Johnson et al., 2019) . The aforementioned observations motivate us to propose a categorize-generate-interpret framework that places specific emphasis on clinical accuracy while maintaining adequate language fluency of the generated reports: a classifier module Figure 1 : Our approach consists of three modules: a classifier that reads chest X-ray images and clinical history to produce an internal checklist of disease-related topics; a transformer-based generator to generate fluent text; an interpreter to examine and fine-tune the generated text to be consistent with the disease-related topics. reads chest X-ray images (e.g., either single-view or multi-view images) and related documents to detect diseases and output enriched disease embedding; a transformer-based medical report generator; and a differentiable interpreter to evaluate and finetune the generated reports for factual correctness. The main contributions are two-fold: • We propose a differentiable end-to-end approach consists of three modules (classifiergenerator-interpreter) where the classifier module learns the disease feature representation via context modeling (section 3.1.3) and disease-state aware mechanism (section 3.1.4); the generator module transforms the disease embedding to medical report; the interpreter module reads and fine-tunes the generated reports, enhancing the consistency of the generated reports and the classifier's outputs. • Empirically our approach is demonstrated to be more competitive against many strong baselines over two widely-used benchmarks on an equal footing (i.e. without accessing to additional information). We also find that the clinical history of patients (prior-knowledge) play a vital role in improving the quality of the generated reports. 2 Related Work 2.1 Image-based Captioning and Medical Report Generation Apart from some familiar topics such as disease detection (Oh et al., 2020; Luo et al., 2020; Lu et al., 2020b; Rajpurkar et al., 2017; Lu et al., 2020a; Ranjan et al., 2018) and lung segmentation (Eslami et al., 2020) , the most related computer vision task is the emerging topic of image-based captioning, which aims at generating realistic sentences or topic-related paragraphs to summarize visual contents from images or videos (Vinyals et al., 2015; Xu et al., 2015; Goyal et al., 2017; Rennie et al., 2017; Huang et al., 2019; Feng et al., 2019; Pei et al., 2019; Tran et al., 2020) . Not surprisingly, the recent progresses in medical report generation (Jing et al., 2018 (Jing et al., , 2019 Li et al., 2018 Xue et al., 2018; Yuan et al., 2019; Wang et al., 2018; Lovelace and Mortazavi, 2020; Srinivasan et al., 2020; Zhang et al., 2020; Huang et al., 2021; Gasimova et al., 2020; Singh et al., 2019; Nishino et al., 2020) have been particularly influenced by the successes in image-based captioning. The work of (Vinyals et al., 2015; Xu et al., 2015) is among the early approaches in medical report generation, where visual features are extracted by convolution neural networks (CNNs); they are subsequently fed into recurrent neural networks (RNNs) to generate textual descriptions. In remedying the issue of inaccurate textual descriptions, a secondary task is explicitly adopted by (Jing et al., 2018; Srinivasan et al., 2020) to select top-k most likely diseases to gauge report generation. The methods of (Jing et al., 2019; Li et al., 2018) , on the other hand, consider a reinforcement learning process to promote generating reports with correct contents. It has been noted by (Jing et al., 2018 (Jing et al., , 2019 Li et al., 2018 ) that traditional RNNs are not well suited in generating long sentences and paragraphs (Vaswani et al., 2017; Krause et al., 2017) , which renders them insufficient in medical report generation task (Jing et al., 2018) . This issue is relieved by either conceiving hierarchical RNN architectures (Krause et al., 2017) (Jing et al., 2018 (Jing et al., , 2019 Li et al., 2018; Xue et al., 2018; Yuan et al., 2019; Wang et al., 2018; , or resorting to alternative techniques including in particular the recently developed transformer architectures (Vaswani et al., 2017) (Srinivasan et al., 2020; Lovelace and Mortazavi, 2020) . It is worth noting that most existing methods concentrates on the image-to-fluent-text aspect of the medical report generation problem; on the other hand, their results are considerably less well-versed at uncovering the intended disease and symptom related topics in the generated texts, the true gems where the physicians would base their decisions upon. To alleviate this issue, a graph-based approach is considered in : it starts by compiling a list of common abnormalities, then transforms them into correlated disease graphs, and categorizes medical reports into templates for paraphrasing. Its practical performance is however less stellar, which may be credit to the fact that is fundamentally based on detecting abnormalities from medical images, thus may overlook other important information. The transformer technique (Vaswani et al., 2017) is first introduced in the context of machine translation with the purpose of expediting training and improving long-range dependency modeling. They are achieved by processing sequential data in parallel with an attention mechanism, consisting of a multi-head self-attention module and a feedforward layer. By considering multi-head selfattention mechanisms, including e.g. a graph attention network (Veličković et al., 2017) , recent transformer-based models have shown considerable advancement in many difficult tasks, such as image generation , story generation (Radford et al., 2018) , question answering, and language inference (Devlin et al., 2018) . The CheXpert labeler (Irvin et al., 2019 ) is a rulebased system that extracts and classifies medical reports into 14 common diseases. Each disease label is either positive, negative, uncertain, or unmen-tioned. This is a crucial part in building large-scale chest X-ray datasets, such as (Irvin et al., 2019; Johnson et al., 2019) , where an alternative manual labeling process may take years of effort. It could also be used to evaluate the clinical accuracy of a generated medical report . Another important use of the CheXpert labeler is to facilitate the generation of medical reports. Since the rule-based CheXpert labeler is not differentiable, it is regarded as a score function estimator for reinforcement learning models to fine-tune the generated texts. However, the reinforcement learning methods are often computationally expensive and practically difficult to convergence. As an alternative, Lovelace et al. (Lovelace and Mortazavi, 2020) propose an attention LSTM model and fine-tune the generated report via a differentiable Gumbel random sampling trick, with promising results. Our framework consists of a classification module, a generation module, and an interpretation module, as illustrated in Fig. 1 . The classification module reads multiple chest X-ray images and extracts the global visual feature representation via a multiview image encoder. They are then disentangled into multiple low-dimensional visual embedding. Meanwhile, the text encoder reads clinical documents, including, e.g., doctor indication, and summarizes the content into text-summarized embedding. The visual and text-summarized embeddings are entangled via an "add & layerNorm" operation to form contextualized embedding in terms of disease-related topics. The generation module takes our enriched disease embedding as initial input and generates text word-by-word, as shown in Fig. 2 . Finally, the generated text is fed to the interpretation module for fine-tuning to align to the checklist of disease-related topics from the classification module. In what follows, we are to elaborate on these three modules in detail. For each medical study which consists of m chest X-ray images where c is the number of features, via a shared DenseNet-121 image encoder (Huang et al., 2017) . Then, the multi-view latent features x ∈ R c can be obtained Figure 2 : An example of our approach in action. The enriched disease embedding produced from the classification module are fed into the generation module as initial inputs. Then, at each time step, the hidden state h i is obtained and predicts the next output word. Finally, the interpretation module takes as input all predicted outputsŴ to predict a checklist of disease-related topics, which are to be gauged with the same topics output from the classification module for consistency verification. by max-pooling across the set of m latent features {x i } m i=1 , as proposed in (Su et al., 2015) . When m = 1, the multi-view encoder boils down to a single-image encoder. Let T be a text document with length l consisting of word embeddings {w 1 , w 2 , ..., w l }, where w i ∈ R e embodies the i-th word in the text and e is the embedding dimension. We use the transformer encoder (Vaswani et al., 2017) as our text feature extractor to retrieve a set of hidden states H = {h 1 , h 2 , ..., h l }, where h i ∈ R e is the attended features of the i-th word to other words in the text, (1) The entire document T is then summarized by Q = {q 1 , q 2 , ..., q n }, representing n diseaserelated topics (e.g., pneumonia or atelectasis) to be queried from the document. We refer to this retrieval process as text-summarized embedding D txt ∈ R n×e , Here matrix Q ∈ R n×e is formed by stacking the set of vectors {q 1 , q 2 , ..., q n } where q i ∈ R e is randomly initialized, then learned via the attention process. Similarly, the matrix H ∈ R l×e is formed by {h 1 , h 2 , ..., h l } from Eq. (1). The term Softmax(QH ) is the word attention heat-map for the n queried diseases in the document. The intuition here is for each disease (e.g., pneumonia) to be queried from the text document T . We only pay attention to the most relevant words (e.g., cough or shortness of breath) in the text that associates with that disease, also known as a vector similarity dot product. This way, the weighted sum of these words by Eq. (2) gives the feature that summarizes the document w.r.t. the queried disease. The latent visual features x ∈ R c are subsequently decoupled into low-dimensional disease representations, as illustrated in Fig. 1 . They are regarded as the visual embedding D img ∈ R n×e , where each row is a vector φ j (x) ∈ R e , j = 1, . . . , n defined as follows: Here A j ∈ R c×e and b j ∈ R e are learnable parameters of the j-th disease representation. n is the number of disease representations, and e is the embedding dimension. Now, together with the available clinical documents, the visual embedding D img and the text-summarized embedding D txt are entangled to form contextualized disease representations D fused ∈ R n×e as Intuitively, the entanglement of visual and textual information allows our model to mimic the hospital workflow, to screen the disease's visual representations conditioned on the patients' clinical history or doctors' indication. For example, the doctor's indication in Fig. 1 shows cough and shortness of breath symptoms. It is reasonable for a medical doctor to request a follow-up check of the pneumonia disease. As for the radiologists receiving the doctors' indication, they may prioritize diagnosing the presence of pneumonia and related diseases based on X-ray scans and look for specific abnormalities. As empirically shown in Table 3 , the proposed contextualized disease representations bring a significant performance boost in the medical report generation task. Meanwhile, our current embedding is basically a plain mingling of heterogeneous sources of information such as disease type (i.e., disease name) and disease state (e.g., positive or negative). As shown by the ablation study in Table 3 , this embedding by itself is insufficient for generating accurate medical reports. This leads us to conceive a follow-up enriched representation below. The main idea behind enriched disease embedding is to further encode informative attributes about disease states, such as positive, negative, uncertain, or unmentioned. Formally, let k be the number of states and S ∈ R k×e the state embedding. Then the confidence of classifying each disease into one of the k disease states is S ∈ R k×e is randomly initialized, then learned via the classification of D fused . D fused acts as features for the multi-label classification, and the classification loss is computed as where y ij ∈ {0, 1} and p ij ∈ [0, 1] are the jth ground-truth and predicted values for the disease i-th, respectively. The state-aware embedding D states ∈ R n×e are then computed as y ∈ {0, 1} n×k is the one-hot ground-truth labels about the disease-related topics, whereas p ∈ [0, 1] n×k is the predicted values. During training, the ground-truth disease states facilitate our generator in describing the diseases & related symptoms based on accurate information (teacher forcing). At test time, our generator then furnishes its recount based on the predicted states. Finally, the enriched disease embedding D enriched ∈ R n×e is the composition of state-aware disease embedding D states (i.e., good or bad), disease names D topics (i.e., which disease/topic), and the disease representations D fused (i.e., severity and details of the diseases), Like the disease queries Q, D topics ∈ R n×e is randomly initialized, representing diseases or topics to be generated. It is then learned in training through the medical report generation pipeline. The enriched disease embedding provides explicit and precise disease descriptions, and endows our followup generation module with a powerful data representation. Our report generator is derived from the transformer encoder of (Vaswani et al., 2017) . The network is formed by sandwiching & stacking a masked multi-head self-attention component and a feed-forward layer being on top of each other for N times, as illustrated in Fig. 2 . The hidden state for each word position h i ∈ R e in the medical report is then computed based on previous words and disease embedding, as D enriched = {d i } n i=1 , h i = Encoder(w i |w 1 , w 2 , ..., w i−1 , d 1 , d 2 , ..., d n ). (9) This is followed by predicting future words based on the hidden states H = {h i } l i=1 ∈ R l×e , as Here W ∈ R v×e is the entire vocabulary embedding, v the vocabulary size, and l the document length. Let p word,ij denote the confidence of selecting the j-th word in the vocabulary W for the i-th position in the generated medical report. The generator loss is defined as a cross entropy of the groundtruth words y word and predicted words p word , Finally, the weighted word embeddingŴ ∈ R l×e , also known as the generated report, are: It is worth noting that this set-up facilitate the backpropagation of errors from the follow-up interpretation module. It is observed from empirical evaluations that the generated reports are often distorted in the process, such that they become inconsistent with the original output of the classification module -the enriched disease embedding that encodes the disease and symptom related topics. Inspired by the CycleGAN idea of , we consider a fully differentiable network module to estimate the checklist of disease-related topics based on the generator's output, and to compare with the original output of the classification module. This provides a meaningful feedback loop to regulate the generated reports, which is used to fine-tune the generated report through the word representation outputsŴ . Specifically, we build on top of the proposed text encoder (described in section 3.1.2) a classification network that classifies disease-related topics, as follows. First, the text encoder summarizes the current medical reportŴ , and outputs the reportsummarized embedding of the queried diseases Q, HereĤ is computed from the generated medical reportsŴ using Eq. (1). Second, each of the reportsummarized embeddingd i ∈ R e (i.e., each row of the matrixD txt ∈ R n×e ) is classified into one of the k disease-related states (i.e., positive or negative), as Finally, the interpreter is trained to minimize the subsequent multi-label classification loss, here y ij ∈ {0, 1} is the ground-truth disease label and p int,ij ∈ [0, 1] is the predicted disease label of the interpreter. In fine-tuning the generated medical reportsŴ , all interpreter parameters are frozen, which acts as a guide to force the word representationsŴ being close to what the interpreter has learned from the ground-truth medical reports. If the weighted word embeddingŴ is different from the learned representation -which leads to incorrect classification -a large loss value will be imposed in the interpretation module. This thus forces the generator to move toward producing a correct word representation. Collectively our model is trained in an end-toend manner by jointly minimizing the total loss, This section evaluates the medical report generation task on two fronts: the language performance and the clinical accuracy performance. Empirical evaluations are carried out on two widely-used chest X-ray datasets, (Lovelace and Mortazavi, 2020) to focus on generating text in the "findings" section as the corresponding medical report. The Open-I dataset (Demner-Fushman et al., 2016) collected by the Indiana University hospital network contains 3,955 radiology studies that correspond to 7,470 frontal and lateral chest X-rays. Some radiology studies are associated with more than one chest X-ray image. Each study typically consists of impression, findings, comparison, and indication sections. Similar to the MIMIC-CXR dataset, we utilized both the multi-view chest X-ray images (frontal and lateral) and the indication section as our contextual inputs. For generating medical reports, we follow the existing literature (Jing et al., 2018; Srinivasan et al., 2020) by concatenating the impression and the findings sections as the target output. The implementation details, dataset splits, preprocessing steps, generated examples, and qualitative analysis are described in the supplementary materials. A comprehensive quantitative comparison of our approach and many baselines as shown in Table 1 on the two benchmarks using the widely-used language evaluation metrics: BLEU-1 to BLEU-4 (Papineni et al., 2002) , ROUGE-L (Lin, 2004) , and METEOR (Banerjee and Lavie, 2005) scores. Since all comparison methods have their own experiment setups, for a fair comparison, we further categorize these methods into four aspects: singleview (SV), multi-view (MV), accessing to additional information (AI) such as clinical document, and applying fine-tuning (FT) to the generated medical reports. Experiments in Table 1 show that our models outperform the baselines in most language metrics. With a single input X-ray image as the sole input, ours (SV) outperforms by a noticeable margin the best SOTA methods of CoAtt on Open-I and Transformer on MIMIC, respectively. This we mainly attribute to the utilization of the enriched disease embedding that explicitly incorporates the diseaserelated topics. With multiple X-ray images as input, Ours (MV) again outperforms the best comparison methods of HRG-Transformer on Open-I. With multiple X-ray images and additional clinical document information as input, ours (MV+T) outperforms the comparison methods of KERP on Open-I. Finally, with the complete contextual information available as input, ours (MV+T+I) outperforms all the comparison methods available in both Open-I and MIMIC datasets. To evaluate the clinical accuracy of the generated reports, we use the LSTM CheXpert labeler (Lovelace and Mortazavi, 2020) as a universal measurement. We compare different methods based on accuracy, F-1, precision (prec.), and recall (rec.) metrics on 14 common diseases. Since there are 14 independent diseases, we also report the macro and micro scores. Intuitively, a high macro score means the detection of all 14 diseases is improved. Meanwhile, a high micro score implies the dominant diseases are improved (i.e., some diseases appear more frequently than others). As observed in Table 2 , our clinical performance increased significantly compared to the baselines in both macro and micro scores. Among our ablation models in cision and accuracy scores of our contextualized variant (MV+T) tend to be higher, whereas other scores are lower than the one with the interpreter (MV+T+I). This opposite behavior is due to the interpreter, which encourages detecting diseases, thus increases False Positives (FP). Note in the medical context, it is usually critically important to lower the False Negatives (FN) rate, thus a high recall score with a slight decrease in precision is more preferred. We observe that the latent features D fused extracted from the classifier are insufficient to generate robust medical reports, as shown in Table 3 . Based on our human languages, a meaningful story needs three factors: the topic (i.e., what disease), the tone (i.e., is it negative or positive), and the details (i.e., the severity). However, there is no guarantee that the learned latent features D fused has all three re-quired elements. On the other hand, with the the explicit representations (i.e., D fused , D topics , and D states ), all three factors are preserved. Therefore, the enriched disease embedding D enriched can generate precise and complete medical reports, leading to the language metrics' substantial improvement. Table 3 also shows that our proposed "contextualized" version can improve the language scores over the "regular" version, which reads only images. Notably, the contextualized version is the entanglement of the chest X-ray images and the clinical history, which is crucial to improve the generated report's quality and accommodate doctors' practical needs. It mimics how radiologists receive requests from medical doctors and write reports to answer their questions. Hence, the generated reports are believed to be more "on point" and receives higher language scores than the regular "image-to-text" setting. This paper introduces a novel three-module approach for generating medical reports from X-ray scans. Empirical findings demonstrated the superior performance of our approach over state-of-theart methods on widely-used benchmarks under a range of evaluation metrics. Moreover, our approach is flexible and can work with additional input information, where consistent performance gains are observed. For future work, we plan to apply our approach to related medical report generation tasks that go beyond X-rays. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments Baselines for chest x-ray report generation Generative pretraining from pixels Preparing a collection of radiology examinations for distribution and retrieval Bert: Pre-training of deep bidirectional transformers for language understanding Long-term recurrent convolutional networks for visual recognition and description Nassir Navab, and Malek Adjouadi. 2020. Image-to-images translation for multitask organ segmentation and bone suppression in chest x-ray radiography Unsupervised image captioning Spatial semantic-preserving latent space learning for accelerated dwi diagnostic report generation Making the v in vqa matter: Elevating the role of image understanding in visual question answering Densely connected convolutional networks Deepopht: medical report generation for retinal images via deep models and visual explanation Attention on attention for image captioning Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison Show, describe and conclude: On exploiting the structure information of chest X-ray reports On the automatic generation of medical imaging reports Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports A hierarchical approach for generating descriptive image paragraphs Hybrid retrieval-generation reinforced agent for medical image report generation Knowledge-driven encode, retrieve, paraphrase for medical image report generation Rouge: A package for automatic evaluation of summaries Clinically accurate chest x-ray report generation Learning to generate clinically coherent chest x-ray reports Knowing when to look: Adaptive attention via a visual sentinel for image captioning Muxconv: Information multiplexing in convolutional neural networks Multi-objective evolutionary design of deep convolutional neural networks for image classification Deep mining external imperfect data for chest x-ray disease screening Reinforcement learning with imbalanced dataset for data-to-text medical report generation Deep learning covid-19 features on cxr using limited training data sets Bleu: a method for automatic evaluation of machine translation Memoryattended recurrent network for video captioning Improving language understanding by generative pre-training Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning Jointly learning convolutional representations to compress radiological images and classify thoracic diseases in the compressed domain Self-critical sequence training for image captioning From chest x-rays to radiology reports: a multimodal machine learning approach Hierarchical x-ray report generation via pathology tags and multi head attention Multi-view convolutional neural networks for 3d shape recognition Transform and tell: Entity-aware news image captioning Attention is all you need Graph attention networks Show and tell: A neural image caption generator Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays Reinforced transformer for medical image captioning Show, attend and tell: Neural image caption generation with visual attention Multimodal recurrent model with attention for automated radiology report generation Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network Image captioning with semantic attention Automatic radiology report generation based on multi-view image fusion and medical concept enrichment When radiology report generation meets knowledge graph Unpaired image-to-image translation using cycle-consistent adversarial networks