key: cord-1034663-lln78gqh authors: Zhao, Chen; Shuai, Renjun; Ma, Li; Liu, Wenjia; Wu, Menglin title: Improving cervical cancer classification with imbalanced datasets combining taming transformers with T2T-ViT date: 2022-03-19 journal: Multimed Tools Appl DOI: 10.1007/s11042-022-12670-0 sha: d041e133378072e1bb56e7e3f7d503af3d198b57 doc_id: 1034663 cord_uid: lln78gqh Cervical cell classification has important clinical significance in cervical cancer screening at early stages. However, there are fewer public cervical cancer smear cell datasets, the weights of each classes’ samples are unbalanced, the image quality is uneven, and the classification research results based on CNN tend to overfit. To solve the above problems, we propose a cervical cell image generation model based on taming transformers (CCG-taming transformers) to provide high-quality cervical cancer datasets with sufficient samples and balanced weights, we improve the encoder structure by introducing SE-block and MultiRes-block to improve the ability to extract information from cervical cancer cells images; we introduce Layer Normlization to standardize the data, which is convenient for the subsequent non-linear processing of the data by the ReLU activation function in feed forward; we also introduce SMOTE-Tomek Links to balance the source data set and the number of samples and weights of the images we use Tokens-to-Token Vision Transformers (T2T-ViT) combing transfer learning to classify the cervical cancer smear cell image dataset to improve the classification performance. Classification experiments using the model proposed in this paper are performed on three public cervical cancer datasets, the classification accuracy in the liquid-based cytology Pap smear dataset (4-class), SIPAKMeD (5-class), and Herlev (7-class) are 98.79%, 99.58%, and 99.88%, respectively. The quality of the images we generated on these three data sets is very close to the source data set, the final averaged inception score (IS), Fréchet inception distance (FID), Recall and Precision are 3.75, 0.71, 0.32 and 0.65 respectively. Our method improves the accuracy of cervical cancer smear cell classification, provides more cervical cell sample images for cervical cancer-related research, and assists gynecologists to judge and diagnose different types of cervical cancer cells and analyze cervical cancer cells at different stages, which are difficult to distinguish. This paper applies the transformer to the generation and recognition of cervical cancer cell images for the first time. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s11042-022-12670-0. Cervical cancer is the fourth most common cause of death from cancer in females [48] . It is estimated that there will be 604,127 cases and 341,831 deaths worldwide in 2020, and it is the second most common cancer in women worldwide [28, 52] . At early stages of cervical cancer, the cure rate is nearly 100% [42] , so the prevention, early detection and classification of cervical cancer are essential [64] . At present, cervical cancer screening methods mainly include human papillomavirus detection, cervical smear and acetic acid testing under colposcopy [72] . After the introduction of a Papanicolaou (Pap) smear [40] , the standard screening test for cervical cancer and premalignant lesions is cervical cytology. As the most common screening test, cervical cytology has been extensively used and effectively reduces incidence and mortality. At present, manual screening of abnormal cells from a cervical cytology slide is still common practice. However, it is usually tedious, inefficient and expensive. Consequently, automatic screening methods have attracted increasing attention [3, 5] . Additionally, some research on cervical cell analysis shows that each independent cervical cell has intrinsic similarity. For example, superficial and intermediate cells generally have relatively small nuclei and have clear cytoplasmic and nuclear margins, while dyskeratotic and metaplastic cells have overlapping cytoplasmic and nuclear margins. In addition, koilocytotic cells have the presence of a perinuclear cavity, while other cells have a relatively thick cytoplasm [18, 34] . These observations indicate that there exists a potential relationship between cervical cell images. Therefore, accurate cervical cell classification is crucial to the automatic screening method. The analysis of Pap smear images requires low error tolerance and skilled pathologists, and the screening process is expensive and time-consuming. Therefore, an automated classification process can assist gynecologists in diagnosis and provide more objective test explanations. Recently, deep learning has brought considerable improvements in accuracy in many applications [46] . Due to its high accuracy in many fields, deep learning has become the most advanced machine learning technology. Deep learning and CNNs have been successfully used in breast cancer detection [7] , skin cancer recognition [73] , and COVID-19 recognition and analysis [26] . Among them, there are many studies based on convolutional neural networks, and convolutional neural networks (CNNs) have been the standard for 3D medical image classification and segmentation. The convolutional operations used in these networks, however, inevitably have limitations in modeling the long-range dependency due to their inductive bias of locality and weight sharing [69] . To solve these problems, the transformer was created. Beginning at the end of 2020, transformer-based research has gradually increased. At present, some transformer-based research has surpassed CNN-based research in the fields of image classification, image detection, and image segmentation [33] . Dosovitskiy et al. [11] proposed a vision transformer (ViT) for the first time to be applied to image classification. They applied a method that does not focus on pixels but focuses on small areas of the image. They believe that dependence on CNN is not necessary and that applying direct pure converters based on image patch sequences can perform image classification tasks well. Carion et al. [4] proposed a new method (DEtection TRansformer or DETR) that views object detection as a direct set prediction problem. Their approach streamlines the detection pipeline, effectively removing the need for many hand-designed components such as a nonmaximum suppression procedure or anchor generation that explicitly encode the prior knowledge about the task. Xie et al. [69] proposed a novel framework that efficiently bridges a convolutional neural network and a transformer (CoTr) for accurate 3D medical image segmentation. Compared to CNN, the self-attention in the transformer can produce a more interpretable model, from which the attention distribution can be checked, and each attention head can learn to perform different tasks. The number of operations required to calculate the association between two positions does not increase with distance. Research on the classification of cervical cancer cells is mostly carried out on two public datasets. Herlev [27] consists of 917 images of Pap smear cells classified carefully by cytotechnicians and doctors. Each cell is described by 20 numerical features, and the cells fall into 7 classes. SIPaKMeD [43] consists of 4049 annotated cell images. The cells are classified by expert cytopathologists into five different classes. In addition to the public dataset, there is nonpublic dataset research [68] . Currently, the number of samples in the two public datasets is limited, and the classification accuracy of various studies is above 90%, which tends to overfit, and the cell types in various datasets are inconsistent. The weights are not balanced, and the clarity and quality of the samples are uneven. Regarding the nonpublic dataset, cervical cancer researchers cannot obtain the data and can only conduct research on the limited public dataset, and the research progress is limited. To solve the above problems, this paper proposes a cervical cell image generation model based on taming transformers (CCG-taming transformers) and a classification model based on taming transformers and Tokens-to-Token Vision Transformers (T2T-ViT) for the first time. The method proposed in this paper expands its dataset sample size, balances the number and weight of each type of cervical cell, and generates high-quality sample images. A new dataset (liquid-based cytology Pap smear dataset [24] ) is introduced to provide more objective materials for the cervical cancer cell generation model in this paper, and T2T-ViT is used to further improve classification accuracy, provide more objective information for gynecologists, improve the efficiency of clinical-pathological diagnosis, detect the patient's condition in time, and improve the survival rate of cervical cancer patients. Figure 1 is an overview of the method framework of the model in this paper. The main contributions and novelty of this model are as follows: -We propose a cervical cancer cell sample image generation model based on taming transformers (CCG-taming transformers), to our best knowledge, this is the first model combining CNNs and Transformer and applied to cervical cancer research, -In CCG-taming transformers, we adjust the encoder structure of VQGAN in the taming transformers: 1) we introduce new convolutional structures MultiRes-block to better realize the extraction and analysis of the key information of the cervical cancer cell image by the encoder; 2) we introduce SE-block to alleviate the vanishing gradient problem in the neural network; 3) we introduce Layer normalization to enhance the discrimination ability of feature representations. This model mainly solves the problem that there are few public data sets in the current cervical cancer research, the samples of each class are very different, and the weights and quantities of various samples in the data sets are not balanced. The cervical cancer cell images generated by this model can provide certain reference value for cervical cancer research. -We introduce SMOTE-Tomek Links to balance the source data set and the number of samples and weights of the images we generate. -We introduce T2T-ViT combing transfer learning to classify the cervical cancer cell image dataset, which can solve the problem that the classification model based on CNNs may lose the details of the feature map. Artificial intelligence and deep learning play an important role in cell classification, medical image classification, generation and analysis [17, 59, 64] . As new technologies develop, they become cost-effective and less time-consuming. They are now more popular than traditional methods (such as Pap smears, colposcopy, and cervicography) [8] . These technologies have nothing to do with human experience. Although they cannot replace gynecologists for pathological evaluation, they can provide assistance for clinical diagnosis to a large extent, improve the diagnostic efficiency of gynecologists, and reduce the subjective components of diagnosis. Several works classify and detect cervical cancer in the literature. Convolutional neural networks (CNNs) [35] have been proposed to automatically learn multilevel features through hierarchical deep architecture. Wieslander et al. [63] used ResNet for binary classification of benign and malignant cervical cells. Plissiti et al. [43] proposed the annotated cervical cell image dataset SIPaKMeD and applied CNN to classify five types of cervical cells. Gautam et al. [14] proposed a patch-based approach using CNN combined with transfer learning for the segmentation of nuclei in single-cell images. Ghoneim cell detection and classification system based on convolutional neural networks (CNNs) and achieved 91.2% accuracy in the classification problem (7-class). William et al. [65] proposed contrast local adaptive histogram equalization for image enhancement. Cell segmentation was achieved through a trainable Weka segmentation classifier, and a sequential elimination approach was used for debris rejection. Wang et al. [58] proposed multiscale representation for scene classification, which is realized by a global-local two-stream architecture. They also [60] explored the attention mechanism and propose a novel endto-end attention recurrent convolutional network (ARCNet) for scene classification, their research has made outstanding contributions to the classification field. Although the neural network framework based on CNN has achieved good accuracy in the classification of cervical cancer smear cell images, the training requires more computing resources, and the number of network layers needs to reach a certain depth to capture the deep level of the sample image information. Compared with the transformer, in the CNN-based model, the number of operations required to calculate the association between two positions through convolution increases with distance. The number of operations required to calculate the association between two positions based on the selfattention in the transformer is independent of the distance. Generative adversarial networks (GANs) [1] have been successfully used to synthesize human faces, landscapes and even medical images; they are mainly used to expand the size of datasets and balance the weight and number of samples in each category. Pollastri et al. [44] presented a novel strategy that employs DC-GAN to augment data in the skin lesion segmentation task. The proposed framework generates both skin lesion images and their segmentation masks, making the data augmentation process extremely straightforward. Karras et al. [31] proposed an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes and stochastic variation in the generated images, and it enables intuitive, scale-specific control of the synthesis. Thuy et al. [54] proposed combining deep learning, transfer learning and generative adversarial networks to improve classification performance. Fine-tuning on the VGG16 and VGG19 networks was used to extract the welldiscriminated cancer features from histopathological images before feeding them into the neural network for classification. Han et al. [20] proposed a 3D multiconditional GAN (MCGAN) to generate realistic/diverse nodules placed naturally on lung computed tomography images to boost sensitivity in 3D object detection. However, they ignore the potential relationships among cervical cell images during feature learning, and thus, may influence the representation ability of CNN features. Besides, GANs usually require large training datasets, which are often scarce in the medical field, and GANs are only applied to medical image synthesis at a relatively low resolution. In this paper, we use CCG-taming transformers and SMOTE-Tomek Links to balance the weight and number of samples between different categories, through this method, we can generate more samples of cervical cancer cells in different categories, providing more objective reference value for the research of cervical cancer. Amazing results from transformer models on natural language processing have encouraged the vision community to study their research on computer vision problems [19] . Vaswani et al. [57] proposed the transformer network architecture for the first time, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Valanarasu et al. [56] proposed a gated axial attention model that extends the existing architectures by introducing an additional control mechanism in the self-attention module to train the model effectively on medical images, they also proposed a local-global training strategy (LoGo) to train the model effectively on medical images with significant performance. Zhu et al. [74] proposed deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 fewer training epochs. He et al. [22] introduced DropBlock as a regularization technique for HSI accurate classification to mitigate the overfitting problem in CNN-based HSI classification. Chen et al. [6] studied the low-level computer vision task (e.g., denoising, superresolution and deraining) and developed a new pretrained model, namely, an image processing transformer (IPT). Wang et al. [62] investigated a simple backbone network useful for many dense prediction tasks without convolutions, and they proposed the pyramid vision transformer (PVT), which overcomes the difficulties in applying the transformer to various dense prediction tasks. The transformer-based model has achieved high accuracy in image classification, target detection and other fields. Park et al. [41] proposed a novel vision transformer by using the low-level CXR feature corpus that is obtained to extract the abnormal CXR features. Wang et al. [61] combined transfer learning with the transformer model to predict the small-dataset Heck reaction. Besides, the simple tokenization of input images fails to model the important local structure (e.g., edges, lines) among neighboring pixels, leading to its low training sample efficiency, and the redundant attention backbone design of ViT leads to limited feature richness in fixed computational budgets and limited training samples. To solve these problems, we introduced T2T-ViT, which introduces 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to Token), such that local structure presented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deepnarrow structure for vision transformers motivated by CNN architecture design after extensive study. In this paper, we propose a cervical cell image generation model based on taming transformers (CCG-taming transformers) and a classification model based on Tokens-to-Token Vision Transformers (T2T-ViT) with transfer learning. The overall framework of the proposed model is illustrated in Fig. 2 . First, we used ImageNet to train T2T-ViT, preprocessed the cervical cancer Pap cell smear dataset, used CCG-taming transformers to expand the dataset, and then balanced the weights and numbers of different classes of images. After that, the newly obtained dataset was passed to the T2T-ViT that was pretrained by ImageNet and completed the transfer learning for classification, and finally, the result of the classification was obtained. Next, we elaborate on each part of the model in Fig. 2 . Taming Transformers [13] are used to generate a variety of high-quaility images; in contrast to CNNs, they contain no inductive bias that prioritizes local interactions. It uses a convolutional VQGAN to learn a codebook of context-rich visual parts, whose composition is subsequently modeled with an autoregressive transformer architecture. A discrete codebook provides the interface between these architectures and a patch-based discriminator enables strong compression while retaining high perceptual quality. This method introduces the efficiency of convolutional approaches to transformer based high resolution image synthesis. An image x ∈ ℝ H × W × 3 is represented by the spatial collection of codebook entries as z q ∈ℝ hÂwÂnz , and n z represents is the dimensionality of codes. First learn a convolutional network composed of an encoder E and a decoder G, and it will learn the code to represent the image discrete codebook Z ¼ z k f g K k¼1 ⊂ℝ nz . Image x passes through encoder E, We obtain z q using the encodingẑ ¼ E x ð Þ∈ℝ hÂwÂn z and a subsequent elementwise quantization q(⋅) of each spatial codeẑ ij ∈ℝ nz onto its closest codebook entryz k : Where x−x k k 2 2 is a reconstruction loss, sg[⋅] denotes the stop-gradient operation, and sg z q  à −E x ð Þ 2 2 is the so-called "commitment loss" with weighting factorβ. VQGAN uses a discriminator and perceptual loss to maintain good perceptual quality at an increased compression rate. VQGAN replaces B in A with a discriminator D to distinguish between real and reconstructed images. The complete objective for finding the optimal compression model Q * = {E * , G * , Z * } then reads: Where we compute the adaptive weight λ according to: Where L rec is the perceptual reconstruction loss [71] , ∇ GL ⋅ ½ denotes the gradient of its input w.r.t. the last layer L of the decoder, and δ = 10 −6 is used for numerical stability. With E and G available, the cervical cell images can be represented in terms of the codebook indices of their encodings. The quantized encoding of an image x is given by Þ ∈ℝ hÂwÂn z and is equivalent to a sequence s ∈ {0, …, |Z| − 1} h × w of indices from the codebook, which is obtained by replacing each code by its index in the codebook Z: By mapping indices of a sequence s back to their corresponding codebook entries, z q ¼ z s ij À Á is readily recovered and decoded to an imagex ¼ G z q À Á . Thus, after choosing some ordering of the indices in s, image generation can be formulated as autoregressive next-index prediction: given indices s