key: cord-0530698-k4frzdhy authors: Heidarian, Shahin; Afshar, Parnian; Oikonomou, Anastasia; Plataniotis, Konstantinos N.; Mohammadi, Arash title: CAE-Transformer: Transformer-based Model to Predict Invasiveness of Lung Adenocarcinoma Subsolid Nodules from Non-thin Section 3D CT Scans date: 2021-10-17 journal: nan DOI: nan sha: 8863578534b51f131847c4b399aac63501d98015 doc_id: 530698 cord_uid: k4frzdhy Lung cancer is the leading cause of mortality from cancer worldwide and has various histologic types, among which Lung Adenocarcinoma (LUAC) has recently been the most prevalent one. The current approach to determine the invasiveness of LUACs is surgical resection, which is not a viable solution to fight lung cancer in a timely fashion. An alternative approach is to analyze chest Computed Tomography (CT) scans. The radiologists' analysis based on CT images, however, is subjective and might result in a low accuracy. In this paper, a transformer-based framework, referred to as the"CAE-Transformer", is developed to efficiently classify LUACs using whole CT images instead of finely annotated nodules. The proposed CAE-Transformer can achieve high accuracy over a small dataset and requires minor supervision from radiologists. The CAE Transformer utilizes an encoder to automatically extract informative features from CT slices, which are then fed to a modified transformer to capture global inter-slice relations and provide classification labels. Experimental results on our in-house dataset of 114 pathologically proven Sub-Solid Nodules (SSNs) demonstrate the superiority of the CAE-Transformer over its counterparts, achieving an accuracy of 87.73%, sensitivity of 88.67%, specificity of 86.33%, and AUC of 0.913, using a 10-fold cross-validation. Clinical relevance-The proposed framework provides timely and accurate information about the invasiveness of lung cancer with minor supervision, which can lead to a proper treatment plan and reduce the risk of unnecessary or late surgeries. Lung Cancer (LC) is the deadliest and least funded cancer worldwide [1] , [2] . Non-small-cell LC is the major type This work was partially supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada through the Create Grant RGPIN-2016-04988. of LC, and Lung Adenocarcinoma (LUAC) is the most prevalent histologic sub-type [3] . Lung nodules manifesting as Ground Glass (GG) or Subsolid Nodules (SSNs) on CT have a higher risk of malignancy than other incidentally detected small solid nodules. SNNs are often diagnosed as LUACs which are categorized according to their histology into three categories: pre-invasive lesions including atypical adenomatous hyperplasia (AAH) and adenocarcinoma in situ (AIS), minimally invasive (MIA), and invasive pulmonary adenocarcinoma (IPA) [4] . A timely and accurate attempt to differentiate the LUACs is of utmost importance to guide a proper treatment plan, as in some cases, a preinvasive or minimally invasive SSN can be monitored with regular follow up CTs, whereas invasive lesions should undergo immediate surgical resection if they are deemed eligible. Most often, the SSN's types are diagnosed based on their pathological findings performed after surgical resection which is not desired for prior treatment planning. Currently, radiologists use chest Computed Tomography (CT) scans to assess the invasiveness of the SSNs based on their imaging findings and patterns. Such visual approaches, however, are time-consuming, subjective, and error-prone. In this regard, many studies [5] , [6] have used highresolution and thin-slice (< 1.5mm) CT images (slices). In practice, however, lung nodules are mostly identified from CT scans performed for various clinical purposes, acquired using routine standard or low dose scanning protocols with non-thin slice thicknesses (up to 5mm) [7] . Moreover, recent lung cancer screening recommendations suggest using Low Dose CT scans (LDCT) with thicker slice-thicknesses (up to 2.5mm) [8] . Capitalizing on the above discussion, the necessity of developing an automated classification framework that performs well regardless of technical settings has recently arisen among the research community. Related Works: In general, existing publications on the SSN classification and invasiveness assessment can be categorized into two main groups: (1) Radiomics-based and (2) Deep Learning-based frameworks [9] . In the former, a set of histogram-based, morphological, and clinical features are extracted from the CT images which are then analyzed using statistical or machine learning techniques such as the study conducted in Reference [10] . As another example of such frameworks, in Reference [7] , a set of radiomics features are extracted from manually annotated nodules and used along with additional features obtained via the Functional Principal Component Analysis (FPCA) to train a linear logistic regression, achieving the accuracy of 81.0% on a dataset of primary LUACs from non-thin CT scans of 109 pathologically labeled SSNs. Deep learning-based frameworks, on the other hand, extract informative features in an automated fashion. Existing deep models working with volumetric CT scans can be classified into two main categories: (i) The first group includes the 3D models (e.g., 3D CNN), which are supplied by the whole volume of images (i.e., all 2D slices) or a stack of all nodule patches (cropped images including nodules) [11] . Processing a large 3D dataset at once, however, demands more complex models, more computational resources, and larger labeled datasets, and; (ii) The second approach, on the other hand, analyzes individual 2D CT slices or Regions of Interest (ROIs) in the first step, and utilizes an aggregation mechanism to represent characteristics of the whole volume [12] , [13] . Due to the nature of 3D CT scans, which are essentially sequences of 2D images, sequential deep models can be adopted to analyze them. Conventional sequential models such as LSTM and RNN, however, are incapable of capturing global context and dependencies between instances in sequential data and require huge computing resources. Transformer architecture [14] , on the other hand, is a recently-proposed alternative sequential model based on the novel self-attention mechanism, which is capable of capturing global relations between instances while requiring far less computational resources compared to conventional LSTM and RNN networks. Transformers are also superior to their counterparts in terms of parallelization and dynamic attention. Challenges and Contributions: Existing transformer-based models applicable in the image processing tasks such as Vision Transformer (ViT) [15] and Convolutional Vision Transformer (CvT) [16] apply the self-attention function to the small patches in single 2D images and find the relation between them. Analyzing a series of CT slices, however, requires a framework capable of capturing interslice relations. In this study, we have developed an automated predictive framework based on the novel self-attention mechanism and the transformer encoder, referred to as the "CAE-Transformer". Unlike ViT and CvT, our proposed framework uses a Convolutional Auto-Encoder (CAE) [17] to extract informative features from CT slices in an unsupervised fashion and stack them to form a sequential feature map. The CAE is first pre-trained on the public LIDC-IDRI dataset, then fine-tuned on our in-house dataset. The obtained sequential feature maps are then used to provide the final predictions. As previously mentioned, in the case of a volumetric CT scan, beside 2D patterns, capturing interslice relations in the axial direction is of utmost importance, which is addressed in our proposed framework. To the best of our knowledge, this manuscript is the first one targeting the lung cancer invasiveness prediction task using a transformerbased model. It is also worth mentioning that most studies on lung cancer classification are based on the public LIDC-IDRI dataset [18] , in which nodule patches are manually annotated and used to train the models, which is a challenging and time-consuming task even for expert radiologists. In this study, however, we used a relatively small dataset without using the nodule annotations. In fact, the need for exact tumor annotation is completely eliminated, and the model is supplied by the whole images with the evidence of tumor which are much easier to identify. Another challenge is that the transformer architecture requires large training datasets to perform well, which are hard to obtain in the medical domain. As such, particular modifications have been made to the transformer encoder's architecture as well as the pretraining and fine-tuning steps, making the model suitable for the small dataset. More specifically, the class token is removed, and the commonly used Global Average Pooling (GAP) layer at the top of the network has been replaced by a Global Max Pooling (GMP) layer. Furthermore, different training and pre-training approaches have been used in this study. In particular, label smoothing [19] is used during the training step, and only the auto-encoder part of the framework is pre-trained, not the transformer itself. Besides, only a few middle layers in the encoder-decoder network are pre-trained instead of the entire network, and CT images are used for this purpose, in contrast to other models which utilize natural images from the ImageNet dataset [20] . In this study, we have used the dataset initially introduced in Reference [7] and added five additional cases acquired from the same institution to further balance the dataset. This dataset contains volumetric CT scans of 114 pathologically proven SSNs, identified and reviewed by 2 experienced thoracic radiologists. All SSN labels are provided after surgical resections. SSNs are initially classified into three categories of pre-invasive lesions, minimally invasive, and invasive pulmonary adenocarcinoma. Following the original study [7] , we have grouped the first two categories to represent the benign class with 58 case and kept the invasive nodules as the malignant class, including 56 cases. In addition to the nodule labels, the CT slices with the evidence of a nodule are also specified by the radiologists. Fig. 1 shows two sample LUACs from this dataset. As the pre-processing step, we have utilized a welltrained U-Net-based lung segmentation model, introduced in Reference [21] , to extract the lung parenchyma from the CT scans. This approach has demonstrated a remarkable capability in enhancing the performance of models in previous studies [12] , [22] by removing distracting components from the CT images. The extracted lung areas are then resized from (512, 512) to (256, 256) to reduce the complexity and memory allocation without significant loss of information. In order to represent CT images with compressed and informative feature maps, to be used as the input of the subsequent modules, we initially pre-trained a CAE on the public LIDC-IDRI dataset, which contains 244, 527 CT images with or without the evidence of a nodule. The designed CAE model consists of an encoder and a decoder part. The encoder is responsible for generating a compressed representation of the input image through a stack of 5 convolution and 5 max-pooling layers followed by a fullyconnected layer with the size of 256, while the decoder part attempts to reconstruct the original image using the feature representation generated by the encoder. By minimizing the MSE error between the original and the reconstructed image, the CAE learns to produce highly informative feature representations for the input images. Finally, the pre-trained model is fine-tuned on our in-house dataset. The transformer model is the building block of the CAE-Transformer framework and uses a novel self-attention mechanism to capture global dependencies among instances in the input sequence with a high parallelization capability. The self-attention mechanism is based on a Scaled Dot-Product Attention function, mapping a query and a set of key-value pairs to an output, where the query (Q), keys (K), values (V ), are learnable representative vectors for the instances in the input sequence with dimensions d k , d k , and d v , respectively. The output of a self-attention module is computed as a weighted average of the values, where the weight assigned to each value is computed by a similarity function of the query and the corresponding key after applying a softmax function [14] . More specifically, the attention values for a set of queries are computed simultaneously, where K T is the transpose of the matrix K. It is also beneficial to linearly project the queries, keys, and values h times with different learnable linear projections to vectors with d k , d k and d v dimensions, respectively, before applying the attention function. On each of the projected versions of queries, keys, and values, the attention function is performed in parallel, resulting in several d v dimensional output values. These values are then concatenated and once again linearly projected via a fully-connected layer. This process is called "Multi-Head Attention (MHA)" which helps the model to jointly attend to information from different representation subspaces at different positions [14] . The output of the MHA module is calculated as M HA(Q, K, V ) = Concat(head 1 , · · · , head h )W O , where the projections are achieved by parameter matrices Fig. 2 illustrates the pipeline of the CAE-Transformer framework, along with the architecture of a transformer encoder in which LN represents the layer normalization and MLP stands for Multi-Layer Perceptron. The transformer model used in the CAE-Transformer framework is adopted from the transformer encoder proposed in References [14] , [15] and modified for the task at hand. The CAE-Transformer is constructed by stacking 3 transformer encoder blocks on top of each other with the projection dimension of 256, key and query dimensions of 128, and 5 number of heads in each MHA module. Finally, the features obtained by the stack of transformer encoders from all input instances (slices) are passed to a GMP layer, followed by two Fully-Connected (FC) layers with 32 and 2 neurons, respectively, to provide the final predictions. The last fullyconnected layer uses a softmax activation function to produce probability scores. Dropout layers are also incorporated to prevent the model from getting over-fitted. In addition, following the literature [14] , [15] , a Positional Embedding (PE) layer is incorporated into the model to add information about the position of instances in the input sequence to the CAE-generated features. It is worth noting that as the number of slices with the evidence of a nodule varies for different subjects (from 2 to 25 slices per nodule), we have taken the maximum number of slices in our dataset (i.e., 25 slices) and zero-padded the input sequences based on this number, so that all of them have the same dimension of (25, 256). We evaluated the performance of the proposed CAE-Transformer framework using the 10-fold cross-validation method. The CAE model is pre-trained using a batch size of 128, learning rate of 1e − 4 and 200 epochs. The best model on the randomly sampled 20% of the dataset was selected as the best model. The model was then fine-tuned on the in-house dataset using a lower learning rate of 1e − 6 and 50 epochs. To fine tune the final CAE, only the middle fully-connected layer and its previous and next convolution layers were trained, while the other layers were kept unchanged. The CAE-generated features were then used to train the transformer encoder. The transformer was trained using a learning rate of 1e − 4, batch size of 64, and 200 epochs. We also employed label smoothing with the α = 0.05 [19] . The results obtained by the CAE-Transformer are presented in Table I . We have compared the performance of the CAE-Transformer with the results obtained by the original model proposed in Reference [7] . We have further compared the CAE-Transformer with non-transformer-based alternative models, referred to as GMP-FC and GAP-FC, by aggregating the CAE-generated feature maps using GMP and GAP, respectively, followed by a stack of fully connected and batch normalization layers. The best experimental results for such models were obtained by utilizing 4 fully connected layers with 128, 128, 32, and 2 neurons, respectively. We have also compared performance of the CAE-Transformer with its deep learning-based counterparts. First, the CAE-LSTM is developed by replacing the transformer blocks with a stack of LSTM layers, while keeping the hyper-parameters and complexity the same. Then, we trained a custom 3D-CNN model, containing 4 convolution, 4 maxpooling, 1 batch normalization, and 1 dropout layers, followed by 2 FC layers. We also modified the last layers of the ResNet18 [23] , ResNet34 [23] , SE-Resnet18 [24] , and SE-Resnet34 [24] to be compatible with the classification task at hand and trained them on the same dataset. Such 3D CNNbased models are the building blocks of many frameworks developed in the field of medical image processing [9] , [22] . It is worth noting that larger and deeper networks did not perform well on the small dataset used in this study. As the last experiment, we investigated the effects of different pooling layers on top of the network. More specifically, we replaced the GMP layer by a GAP and concatenation function in separate experiments. It is worth mentioning that other models proposed in the literature are developed based on annotated tumor patches from different datasets, which makes re-training those models on our dataset impossible. As such, comparison with those studies is not included. The experimental results provided in Table I demonstrate that deep learning-based models outperform the original radiomics and machine learning-based model, while the proposed CAE-Transformer achieves the best performance among the developed frameworks. In conclusion, we have proposed an automated transformer-based framework, referred to as the "CAE-Transformer", to enhance the performance of existing models aiming to predict the invasiveness of lung adenocarcinoma subsolid nodules from 3D CT scans regardless of technical acquisition settings. We would like to mention that achieving a clinically applicable deep learning-based solution requires more experiments and research studies in this field, and we believe that the proposed framework is one step forward towards achieving such clinically applicable frameworks. As shown in the comparison results, the proposed CAE-Transformer performs far better than the original models developed in Reference [7] , which is a study on the same dataset based on machine learning approaches and radiomics features extracted from manually annotated tumors. The experimental results of this study further encourage researchers in the medical signal/image processing society to adopt more deep learning-based models, in particular, transformer-based models to target similar analytic and predictive tasks. In future works, we will be collaborating with our partners in medical centers to increase the size and diversity of the dataset and target the three-way SSN classification task. Furthermore, we would like to investigate the effects of incorporating radiomics and morphological features into the CAE-Transformer. Comparison of Cancer Burden and Nonprofit Organization Funding Reveals Disparities in Funding Across Cancer Types GLOBO-CAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries The biology and management of non-small cell lung cancer Subsolid Lung Adenocarcinomas: Radiological, Clinical and Pathological Features and Outcomes A Subsolid Nodules Imaging Reporting System (SSN-IRS) for Classifying 3 Subtypes of Pulmonary Adenocarcinoma Role of PET/CT in Management of Early Lung Adenocarcinoma Histogram-based models on non-thin section chest CT predict invasiveness of primary lung adenocarcinoma subsolid nodules ACR-STR Practice Parameter for the Performance and Reporting of Lung Cancer Screening Thoracic Computed Tomography (CT) On the performance of lung nodule detection, segmentation and classification Machine learning approach for distinguishing malignant and benign lung nodules utilizing standardized perinodular parenchymal features from CT Pulmonary nodule classification in lung cancer screening with three-dimensional convolutional neural networks COVID-FACT: A Fully-Automated Capsule Network-Based Framework for Identification of COVID-19 Cases from Chest CT Scans Ct-Caps: Feature Extraction-Based Automated Framework for Covid-19 Disease Identification From Chest Ct Scans Using Capsule Networks Attention is all you need An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale CvT: Introducing Convolutions to Vision Transformers Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): A Completed Reference Database of Lung Nodules on CT Scans Rethinking the inception architecture for computer vision ImageNet: A large-scale hierarchical image database Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem Diagnosis/Prognosis of COVID-19 Chest Images via Machine Learning and Hypersignal Processing: Challenges, opportunities, and applications Deep Residual Learning for Image Recognition Squeeze-and-Excitation Networks