key: cord-1034663-lln78gqh
authors: Zhao, Chen; Shuai, Renjun; Ma, Li; Liu, Wenjia; Wu, Menglin
title: Improving cervical cancer classification with imbalanced datasets combining taming transformers with T2T-ViT
date: 2022-03-19
journal: Multimed Tools Appl
DOI: 10.1007/s11042-022-12670-0
sha: d041e133378072e1bb56e7e3f7d503af3d198b57
doc_id: 1034663
cord_uid: lln78gqh

Cervical cell classification has important clinical significance in cervical cancer screening at early stages. However, there are fewer public cervical cancer smear cell datasets, the weights of each classes’ samples are unbalanced, the image quality is uneven, and the classification research results based on CNN tend to overfit. To solve the above problems, we propose a cervical cell image generation model based on taming transformers (CCG-taming transformers) to provide high-quality cervical cancer datasets with sufficient samples and balanced weights, we improve the encoder structure by introducing SE-block and MultiRes-block to improve the ability to extract information from cervical cancer cells images; we introduce Layer Normlization to standardize the data, which is convenient for the subsequent non-linear processing of the data by the ReLU activation function in feed forward; we also introduce SMOTE-Tomek Links to balance the source data set and the number of samples and weights of the images we use Tokens-to-Token Vision Transformers (T2T-ViT) combing transfer learning to classify the cervical cancer smear cell image dataset to improve the classification performance. Classification experiments using the model proposed in this paper are performed on three public cervical cancer datasets, the classification accuracy in the liquid-based cytology Pap smear dataset (4-class), SIPAKMeD (5-class), and Herlev (7-class) are 98.79%, 99.58%, and 99.88%, respectively. The quality of the images we generated on these three data sets is very close to the source data set, the final averaged inception score (IS), Fréchet inception distance (FID), Recall and Precision are 3.75, 0.71, 0.32 and 0.65 respectively. Our method improves the accuracy of cervical cancer smear cell classification, provides more cervical cell sample images for cervical cancer-related research, and assists gynecologists to judge and diagnose different types of cervical cancer cells and analyze cervical cancer cells at different stages, which are difficult to distinguish. This paper applies the transformer to the generation and recognition of cervical cancer cell images for the first time. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s11042-022-12670-0.

Cervical cancer is the fourth most common cause of death from cancer in females [48] . It is estimated that there will be 604,127 cases and 341,831 deaths worldwide in 2020, and it is the second most common cancer in women worldwide [28, 52] . At early stages of cervical cancer, the cure rate is nearly 100% [42] , so the prevention, early detection and classification of cervical cancer are essential [64] .

At present, cervical cancer screening methods mainly include human papillomavirus detection, cervical smear and acetic acid testing under colposcopy [72] . After the introduction of a Papanicolaou (Pap) smear [40] , the standard screening test for cervical cancer and premalignant lesions is cervical cytology. As the most common screening test, cervical cytology has been extensively used and effectively reduces incidence and mortality. At present, manual screening of abnormal cells from a cervical cytology slide is still common practice. However, it is usually tedious, inefficient and expensive. Consequently, automatic screening methods have attracted increasing attention [3, 5] . Additionally, some research on cervical cell analysis shows that each independent cervical cell has intrinsic similarity. For example, superficial and intermediate cells generally have relatively small nuclei and have clear cytoplasmic and nuclear margins, while dyskeratotic and metaplastic cells have overlapping cytoplasmic and nuclear margins. In addition, koilocytotic cells have the presence of a perinuclear cavity, while other cells have a relatively thick cytoplasm [18, 34] . These observations indicate that there exists a potential relationship between cervical cell images. Therefore, accurate cervical cell classification is crucial to the automatic screening method. The analysis of Pap smear images requires low error tolerance and skilled pathologists, and the screening process is expensive and time-consuming. Therefore, an automated classification process can assist gynecologists in diagnosis and provide more objective test explanations.

Recently, deep learning has brought considerable improvements in accuracy in many applications [46] . Due to its high accuracy in many fields, deep learning has become the most advanced machine learning technology. Deep learning and CNNs have been successfully used in breast cancer detection [7] , skin cancer recognition [73] , and COVID-19 recognition and analysis [26] . Among them, there are many studies based on convolutional neural networks, and convolutional neural networks (CNNs) have been the standard for 3D medical image classification and segmentation. The convolutional operations used in these networks, however, inevitably have limitations in modeling the long-range dependency due to their inductive bias of locality and weight sharing [69] . To solve these problems, the transformer was created. Beginning at the end of 2020, transformer-based research has gradually increased. At present, some transformer-based research has surpassed CNN-based research in the fields of image classification, image detection, and image segmentation [33] . Dosovitskiy et al. [11] proposed a vision transformer (ViT) for the first time to be applied to image classification. They applied a method that does not focus on pixels but focuses on small areas of the image. They believe that dependence on CNN is not necessary and that applying direct pure converters based on image patch sequences can perform image classification tasks well.

Carion et al. [4] proposed a new method (DEtection TRansformer or DETR) that views object detection as a direct set prediction problem. Their approach streamlines the detection pipeline, effectively removing the need for many hand-designed components such as a nonmaximum suppression procedure or anchor generation that explicitly encode the prior knowledge about the task. Xie et al. [69] proposed a novel framework that efficiently bridges a convolutional neural network and a transformer (CoTr) for accurate 3D medical image segmentation. Compared to CNN, the self-attention in the transformer can produce a more interpretable model, from which the attention distribution can be checked, and each attention head can learn to perform different tasks. The number of operations required to calculate the association between two positions does not increase with distance.

Research on the classification of cervical cancer cells is mostly carried out on two public datasets. Herlev [27] consists of 917 images of Pap smear cells classified carefully by cytotechnicians and doctors. Each cell is described by 20 numerical features, and the cells fall into 7 classes. SIPaKMeD [43] consists of 4049 annotated cell images. The cells are classified by expert cytopathologists into five different classes. In addition to the public dataset, there is nonpublic dataset research [68] . Currently, the number of samples in the two public datasets is limited, and the classification accuracy of various studies is above 90%, which tends to overfit, and the cell types in various datasets are inconsistent. The weights are not balanced, and the clarity and quality of the samples are uneven. Regarding the nonpublic dataset, cervical cancer researchers cannot obtain the data and can only conduct research on the limited public dataset, and the research progress is limited.

To solve the above problems, this paper proposes a cervical cell image generation model based on taming transformers (CCG-taming transformers) and a classification model based on taming transformers and Tokens-to-Token Vision Transformers (T2T-ViT) for the first time. The method proposed in this paper expands its dataset sample size, balances the number and weight of each type of cervical cell, and generates high-quality sample images. A new dataset (liquid-based cytology Pap smear dataset [24] ) is introduced to provide more objective materials for the cervical cancer cell generation model in this paper, and T2T-ViT is used to further improve classification accuracy, provide more objective information for gynecologists, improve the efficiency of clinical-pathological diagnosis, detect the patient's condition in time, and improve the survival rate of cervical cancer patients. Figure 1 is an overview of the method framework of the model in this paper.

The main contributions and novelty of this model are as follows:

-We propose a cervical cancer cell sample image generation model based on taming transformers (CCG-taming transformers), to our best knowledge, this is the first model combining CNNs and Transformer and applied to cervical cancer research, -In CCG-taming transformers, we adjust the encoder structure of VQGAN in the taming transformers: 1) we introduce new convolutional structures MultiRes-block to better realize the extraction and analysis of the key information of the cervical cancer cell image by the encoder; 2) we introduce SE-block to alleviate the vanishing gradient problem in the neural network; 3) we introduce Layer normalization to enhance the discrimination ability of feature representations. This model mainly solves the problem that there are few public data sets in the current cervical cancer research, the samples of each class are very different, and the weights and quantities of various samples in the data sets are not balanced. The cervical cancer cell images generated by this model can provide certain reference value for cervical cancer research. -We introduce SMOTE-Tomek Links to balance the source data set and the number of samples and weights of the images we generate. -We introduce T2T-ViT combing transfer learning to classify the cervical cancer cell image dataset, which can solve the problem that the classification model based on CNNs may lose the details of the feature map.

Artificial intelligence and deep learning play an important role in cell classification, medical image classification, generation and analysis [17, 59, 64] . As new technologies develop, they become cost-effective and less time-consuming. They are now more popular than traditional methods (such as Pap smears, colposcopy, and cervicography) [8] . These technologies have nothing to do with human experience. Although they cannot replace gynecologists for pathological evaluation, they can provide assistance for clinical diagnosis to a large extent, improve the diagnostic efficiency of gynecologists, and reduce the subjective components of diagnosis. Several works classify and detect cervical cancer in the literature. Convolutional neural networks (CNNs) [35] have been proposed to automatically learn multilevel features through hierarchical deep architecture. Wieslander et al. [63] used ResNet for binary classification of benign and malignant cervical cells. Plissiti et al. [43] proposed the annotated cervical cell image dataset SIPaKMeD and applied CNN to classify five types of cervical cells. Gautam et al. [14] proposed a patch-based approach using CNN combined with transfer learning for the segmentation of nuclei in single-cell images. Ghoneim cell detection and classification system based on convolutional neural networks (CNNs) and achieved 91.2% accuracy in the classification problem (7-class). William et al. [65] proposed contrast local adaptive histogram equalization for image enhancement. Cell segmentation was achieved through a trainable Weka segmentation classifier, and a sequential elimination approach was used for debris rejection. Wang et al. [58] proposed multiscale representation for scene classification, which is realized by a global-local two-stream architecture. They also [60] explored the attention mechanism and propose a novel endto-end attention recurrent convolutional network (ARCNet) for scene classification, their research has made outstanding contributions to the classification field. Although the neural network framework based on CNN has achieved good accuracy in the classification of cervical cancer smear cell images, the training requires more computing resources, and the number of network layers needs to reach a certain depth to capture the deep level of the sample image information. Compared with the transformer, in the CNN-based model, the number of operations required to calculate the association between two positions through convolution increases with distance. The number of operations required to calculate the association between two positions based on the selfattention in the transformer is independent of the distance. Generative adversarial networks (GANs) [1] have been successfully used to synthesize human faces, landscapes and even medical images; they are mainly used to expand the size of datasets and balance the weight and number of samples in each category. Pollastri et al. [44] presented a novel strategy that employs DC-GAN to augment data in the skin lesion segmentation task. The proposed framework generates both skin lesion images and their segmentation masks, making the data augmentation process extremely straightforward. Karras et al. [31] proposed an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes and stochastic variation in the generated images, and it enables intuitive, scale-specific control of the synthesis. Thuy et al. [54] proposed combining deep learning, transfer learning and generative adversarial networks to improve classification performance. Fine-tuning on the VGG16 and VGG19 networks was used to extract the welldiscriminated cancer features from histopathological images before feeding them into the neural network for classification. Han et al. [20] proposed a 3D multiconditional GAN (MCGAN) to generate realistic/diverse nodules placed naturally on lung computed tomography images to boost sensitivity in 3D object detection. However, they ignore the potential relationships among cervical cell images during feature learning, and thus, may influence the representation ability of CNN features. Besides, GANs usually require large training datasets, which are often scarce in the medical field, and GANs are only applied to medical image synthesis at a relatively low resolution. In this paper, we use CCG-taming transformers and SMOTE-Tomek Links to balance the weight and number of samples between different categories, through this method, we can generate more samples of cervical cancer cells in different categories, providing more objective reference value for the research of cervical cancer.

Amazing results from transformer models on natural language processing have encouraged the vision community to study their research on computer vision problems [19] . Vaswani et al. [57] proposed the transformer network architecture for the first time, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Valanarasu et al. [56] proposed a gated axial attention model that extends the existing architectures by introducing an additional control mechanism in the self-attention module to train the model effectively on medical images, they also proposed a local-global training strategy (LoGo) to train the model effectively on medical images with significant performance. Zhu et al. [74] proposed deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 fewer training epochs. He et al. [22] introduced DropBlock as a regularization technique for HSI accurate classification to mitigate the overfitting problem in CNN-based HSI classification.

Chen et al. [6] studied the low-level computer vision task (e.g., denoising, superresolution and deraining) and developed a new pretrained model, namely, an image processing transformer (IPT). Wang et al. [62] investigated a simple backbone network useful for many dense prediction tasks without convolutions, and they proposed the pyramid vision transformer (PVT), which overcomes the difficulties in applying the transformer to various dense prediction tasks. The transformer-based model has achieved high accuracy in image classification, target detection and other fields. Park et al. [41] proposed a novel vision transformer by using the low-level CXR feature corpus that is obtained to extract the abnormal CXR features. Wang et al. [61] combined transfer learning with the transformer model to predict the small-dataset Heck reaction. Besides, the simple tokenization of input images fails to model the important local structure (e.g., edges, lines) among neighboring pixels, leading to its low training sample efficiency, and the redundant attention backbone design of ViT leads to limited feature richness in fixed computational budgets and limited training samples. To solve these problems, we introduced T2T-ViT, which introduces 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to Token), such that local structure presented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deepnarrow structure for vision transformers motivated by CNN architecture design after extensive study.

In this paper, we propose a cervical cell image generation model based on taming transformers (CCG-taming transformers) and a classification model based on Tokens-to-Token Vision Transformers (T2T-ViT) with transfer learning. The overall framework of the proposed model is illustrated in Fig. 2 .

First, we used ImageNet to train T2T-ViT, preprocessed the cervical cancer Pap cell smear dataset, used CCG-taming transformers to expand the dataset, and then balanced the weights and numbers of different classes of images. After that, the newly obtained dataset was passed to the T2T-ViT that was pretrained by ImageNet and completed the transfer learning for classification, and finally, the result of the classification was obtained. Next, we elaborate on each part of the model in Fig. 2 .

Taming Transformers [13] are used to generate a variety of high-quaility images; in contrast to CNNs, they contain no inductive bias that prioritizes local interactions. It uses a convolutional VQGAN to learn a codebook of context-rich visual parts, whose composition is subsequently modeled with an autoregressive transformer architecture. A discrete codebook provides the interface between these architectures and a patch-based discriminator enables strong compression while retaining high perceptual quality. This method introduces the efficiency of convolutional approaches to transformer based high resolution image synthesis.

An image x ∈ ℝ H × W × 3 is represented by the spatial collection of codebook entries as z q ∈ℝ hÂwÂnz , and n z represents is the dimensionality of codes. First learn a convolutional network composed of an encoder E and a decoder G, and it will learn the code to represent the image discrete codebook Z ¼ z k f g K k¼1 ⊂ℝ nz . Image x passes through encoder E, We obtain z q using the encodingẑ ¼ E x ð Þ∈ℝ hÂwÂn z and a subsequent elementwise quantization q(⋅) of each spatial codeẑ ij ∈ℝ nz onto its closest codebook entryz k : Where x−x k k 2 2 is a reconstruction loss, sg[⋅] denotes the stop-gradient operation, and sg z q Â Ã −E x ð Þ 2 2 is the so-called "commitment loss" with weighting factorβ. VQGAN uses a discriminator and perceptual loss to maintain good perceptual quality at an increased compression rate. VQGAN replaces B in A with a discriminator D to distinguish between real and reconstructed images.

The complete objective for finding the optimal compression model Q * = {E * , G * , Z * } then reads:

Where we compute the adaptive weight λ according to:

Where L rec is the perceptual reconstruction loss [71] , ∇ GL ⋅ ½ denotes the gradient of its input w.r.t. the last layer L of the decoder, and δ = 10 −6 is used for numerical stability. With E and G available, the cervical cell images can be represented in terms of the codebook indices of their encodings. The quantized encoding of an image x is given by

Þ ∈ℝ hÂwÂn z and is equivalent to a sequence s ∈ {0, …, |Z| − 1} h × w of indices from the codebook, which is obtained by replacing each code by its index in the codebook Z:

By mapping indices of a sequence s back to their corresponding codebook entries, z q ¼ z s ij À Á is readily recovered and decoded to an imagex ¼ G z q À Á . Thus, after choosing some ordering of the indices in s, image generation can be formulated as autoregressive next-index prediction: given indices s <i , the transformer learns to predict the distribution of possible next indices, i.e., p(s i |s <i ) to compute the likelihood of the full representation as p s

Þ. This allows us to directly maximize the log-likelihood of the data representations (As shown in equation (8)), and Fig.  3 is the overview of the framework for the CCG-taming transformers. Figure 4 shows the structure of SE-block and MultiRes-block. We add the SE (squeeze-andexcitation) block [29] in encoder E of VQGAN, which can adaptively recalibrate the characteristic response of each channel by explicitly modeling the interdependence between channels. The SE-block has excellent generalization ability in different datasets and brings significant performance improvement to the convolutional neural network. To improve the image information learning ability of VQGAN for cervical cancer Pap cell smears, this paper introduces the convolutional block attention module (CBAM) [67] to enhance VQGAN's attention to cells and nuclei.

In Fig. 4 , the dimension information in the excitation block represents the output of the layer. First, the global average-pooling layer in the squeeze block is used for compression, and a bottleneck structure is formed by two fully connected layers to construct the correlation between channels and export the same weight as the input feature. After that, the feature dimension is reduced to 1/16 of the input, and then the initial feature dimension is restored through the full connection layer after the activation of ReLU. Then, a weight between 0 and 1 is obtained through the sigmoid activation function, and the weight is weighted to the features of different feature channels through the scale operation. Finally, the feature of the residual block is redefined before the addition. The structure of the decoder is the same as the structure of the encoder, the difference is that the spatial structure is opposite to the encoder structure, and down-sampling becomes up-sampling. The specific structure is shown in Fig. 5 .

The introduction of the SE-block makes the classification model of cervical cancer cell images more nonlinear, which can better fit the complex correlation between channels, greatly reduce the number of parameters and calculations, and solve the problem of gradient disappearance near the input layer on the trunk, which makes the model easy to optimize and increases the generalization ability of the classification model.

We improved the convolution structure in the encoder and introduced the MultiRes-block [25] , which was originally used in the image segmentation model decoder to extract image information. The MultiRes-block is introduced to improve the efficiency of extracting cervical cancer cell images and optimize the structure of the decoder to replace the redundant structure of multilayer convolution and superposition in the original decoder.

We also add a residual connection because of their efficacy in medical image information extraction and for the introduction of the 1 × 1 convolutional layers, which may allow us to comprehend some additional spatial information. We call this arrangement a MultiRes-block, as shown in Fig. 5 . Figure 5 Illustrates the MultiRes-block, where we increased the number of filters in the successive three layers gradually and added a residual connection (along with a 1 × 1 filter for conserving dimensions). Res paths of MultiRes-blocks introduce some additional processing to make the two feature maps more homogeneous

The input of neurons in the same layer in LN has the same mean and variance, and different input samples have different mean and variance; therefore, LN does not depend on the size of the batch and the depth of the input sequence, and is used for a new word representation of Attention output. For standardization processing, adding Layer Normalization processing is to standardize the data, which is convenient for the subsequent non-linear processing of the data by the ReLU activation function in Feed Forward. Normalize the data to the active area of the ReLU activation function through Normalization, which can make the activation function work better.

LN is the normalize operation for the input of all neurons in a certain layer of the deep network according to the following formula. 

For the i th summed input in the l th th layer, H denotes the number of hidden units in a layer. a l i is the vector representation of the summed inputs to the neurons in layer i th ,all the hidden units in a layer share the same normalization termsσ.

To balance the source data set and the number of samples and weights of the images we generate, we used Synthetic Minority Oversampling Technique (SMOTE) and Tomek Links Under sampling in a pipelined approach [50] . Due to random oversampling, the strategy of simply copying samples is adopted to increase the minority samples. This is easy to cause the problem of model overfitting, that is, the information learned by the model is too special and not general enough. The basic idea of the SMOTE algorithm is to analyze the minority samples and artificially synthesize new samples based on the minority samples and add them to the data set. There is a parameter that represents the percentage of oversampling, and its value represents the number of synthetic samples to be created. For each minority instance, the k nearest neighbors make them belong to the same class.

In this case, the minority class is oversampled with the applied 'sampling-strategy' parameter represented as 'k' (k = 0.5) i.e. keeping the sampling strategy parameter as 0.5 increases the number of minority class examples by 50%. However, SMOTE has some blindness when selecting neighbors. Therefore, we use Tomek Links method for under-sampling to clean up noisy samples. Tomek Links can find more samples of opposite categories and delete the majority of samples in the pair. After such processing, the dividing line between the melanoma and benign has become clearer, making the existence of the minority more obvious.

A Tomek Link is the distance between two samples from two different classes say x and y such that for any sample z:

In a pipelined approach the minority Class is oversampled by using SMOTE followed by removing the majority class samples by Tomek Links.

Tokens-to-token vision transformers (T2T-ViT) [70] can progressively tokenize an image to tokens and have an efficient backbone. T2T-ViT consists of two main components (Fig. 6) : the tokens-to-token module (T2T module) is used to model the local structure information of images and reduce the token length progressively; the T2T-ViT backbone is used to draw the global attention relations on the tokens from the T2T module.

Each T2T process has two steps: restructurization and soft splitting. The T2T transformer is shown in Fig. 7 : The restructurization is shown in Fig. 7 , giving a sequence of tokens T from the last layer; here, T denotes tokens from the last layer. T is transformed by the self-attention block.

Where MSA denotes the multihead self-attention operation with layer normalization and MLP is the multilayer perceptron with layer normalization in the standard transformer [11] . Then, the tokens are reshaped as images on the spatial dimension.

Where "Reshape" reorganizes tokens T ′ ∈ ℝ l × c to I ∈ ℝ h × w × c , where l is the length of T ′ , h, w, and c are the height, width and channel, respectively, and l = h × w.

As shown in Fig. 7 , after obtaining the restructurized image I, a soft split is applied to model the local structure information and reduce the token length. Specifically, to avoid information loss in generating tokens from the restructured image, we split the cervical cell image into patches with overlap. As such, each patch is correlated with surrounding patches to establish prior knowledge that there should be stronger correlations between surrounding tokens. The tokens in each split patch are concatenated as one token (tokens-to-token, Fig. 7) ; thus, the local information can be aggregated from surrounding pixels and patches. When conducting the soft split, the size of each patch is k × k with s overlapping and p padding on the image, where k − s is similar to the stride in the convolution operation. Therefore, for the reconstructed image I ∈ ℝ h × w × c , the length of output tokens T O after soft splitting is

Each split patch has size k × k × c. We flatten all patches in spatial dimensions to tokens

After the soft split, the output tokens are fed to the next T2T process.

The T2T-ViT has two parts: the tokens-to-token (T2T) module and the T2T-ViT backbone (Fig. 6) . There are various possible design choices for the T2T module. Here, we set n = 2 as shown in Fig. 8 , which means there are n + 1 = 3 soft split and n = 2 restructurization in the (a) (b) Fig. 8 a Images generated by depth-guided neural rendering b Samples generated from semantic layouts T2T module. The patch size set for the three soft splits is P = [7, 3, 3] , and the overlapping set is S = [3, 1, 1], which reduces the size of the input image from 224 × 224 to 14 × 14 according to Eq. 14. By conducting the above restructurization and soft splitting iteratively, the T2T module can progressively reduce the token length and transform the spatial structure of the image. The iterative process in the T2T module can be formulated as:

For the input cervical cell image I 0 , we apply a soft split first to split the image to tokens: T 1 = SS(I 0 ). After the final iteration, the output tokens T f of the T2T module have a fixed length; thus, the backbone of T2T-ViT can model the global relations on T f . In this paper, the backbone of T2T-ViT selects the T2T-ViT-ResNeXt with the highest classification accuracy of the original T2T-ViT. The mathematical formulation of a classifier with a training algorithm is shown below:

To evaluate the classification performance of our method, three public cervical cell image datasets are used in this paper:

The liquid-based cytology Pap smear dataset [24] consists of a total of 963 LBC images subdivided into four sets representing the four classes: NILM, LSIL, HSIL, and SCC. It comprises precancerous and cancerous lesions related to cervical cancer as per standards under the Bethesda System (TBS).

SIPaKMeD [43] consists The Herlev dataset [27] consists of 917 isolated single-cell images, that is, the images contain one cervical cell. There are a total of seven classes: (1) superficial squamous epithelia, (2) superficial squamous epithelia, (3) columnar epithelia, (4) mild squamous non-keratinizing dysplasia, (5) moderate squamous nonkeratinizing dysplasia, (6) severe squamous nonkeratinizing dysplasia and (7) 

To train the cervical cancer cell classification model, the weights of our T2T-ViT were initialized by the transfer parameters from the pretrained T2T-ViT on the ImageNet dataset. All training data were first resized to 224 × 224 × 3 and divided into minibatches for training. Transfer learning is a machine learning method that uses existing knowledge to solve different but related domain problems [55] . We used the ImageNet dataset to pretrain the V2T-ViT. Finally, we trained the cervical cancer cell images through V2T-ViT pretrained by ImageNet. During training, we set the minibatch size to 32. We used AdamW [37] as the optimizer and cosine learning rate decay [36] . The experiments were run on Python 3.7.1 in a Windows operating system with an i9-9900k processor, 64 GB memory, and one GTX2080Ti graphics card. The entire experiment was based on the open-source deep learning framework PyTorch 1.0 in Anaconda3. We trained T2T-ViT for 100 epochs, 5 training-test folds cross-validation was used for the evaluation of classification performance, our experiment configurations refers to the setting of [15, 51] , we use 80% of the data in iteration. The remaining 20% is used for testing. Specifically, 4 of 5 folds were used as the training set and the other as the validation set for 5 rounds. The classification evaluation metrics were obtained by averaging the results from the 5 validation sets. After five iterations, all the data were tested. The hyperparameters maintained during training are shown in Table 4 .

To comprehensively evaluate the classification performance of the model, accuracy (ACC), sensitivity (SE), specificity (SP), H-mean, and F 1 score were used as evaluation metrics. Accuracy is the overall percentage of correctly identified cells and can be used to evaluate the ability of the classifier to judge an overall sample. Sensitivity is also called the true positive rate or recall, which reports the proportion of correctly identified abnormal cells, specificity reports the proportion of correctly identified normal cells, F 1 score is the harmonic mean of precision and recall. 

The inception score (IS) [49] , which is a measure of the generated image quality, uses a pretrained inception network [47] to extract the features of generated images by computing:

D KL (), p(y|x), p(y), y, and x denote the KL divergence formula, the conditional label distribution of samples, the marginal distribution obtained from all the samples, the given picture, and the main object in the picture, respectively. A high IS indicates that the sample image is similar to a particular ImageNet category. The Fréchet inception distance (FID) [23] embeds a set of generated images into a feature space represented by a specific layer of inception or any CNN [47] . It uses a continuous multivariate Gaussian distribution to represent the embedding feature distributions of the real data and the generated data, and the Fréchet distance between these two Gaussian distributions is computed by:

μ r and C r denote the mean and covariance of the real data, and μ g and C g denote the mean and covariance of the generated data, respectively. FID serves as a good measure for GANs due to its good discriminability, robustness, and computational efficiency. 

We used data from each type of cervical cancer cell in three datasets (liquid-based cytology Pap smear, SIPaKMeD, and Herlev) to generate more high-quality cervical cell images and included them in the classification experiments. In this paper, the international common evaluation IS (inception score) is used to evaluate the quality of the generated cervical cancer cell image. IS is one of the most commonly used methods for evaluating images generated by GANs. The higher the value is, the more realistic the image, and the higher the quality. IS focuses on two aspects of image performance: one is the clarity of the generated image, and the other is the diversity of the generated image. We also use the Fréchet inception distance (FID) as another cervical cell generation evaluation metric; the FID score ranges from 0 to positive infinity, and a smaller score indicates a better model. It embeds a set of generated images into a feature space represented by a specific layer of inception or any CNN [47] . The specific evaluation metrics of cervical cell generation are shown in Table 5 . The comparison results with other generation models are shown in Table 11 . Figure 8a shows the images generated by depth-guided neural rendering on RIN with f = 16 using the sliding attention window. Figure 8b shows the images generated from semantic layouts on S-FLCKR with f = 16 using the sliding attention window. The generated images based on three different cervical cancer cell datasets (liquid-based cytology Pap smear dataset, SIPAKMeD, Herlev) are shown in Figs. 9, 10, 11. Through the CCG-taming transformers, we invited professional inspection experts to distinguish them, and the results showed that the generated cervical cancer cell images have little difference from the real images, which can thus be used as the training dataset.

This paper balances the weights and quantity of different classes and expands the number of various samples on three different cervical cancer cell datasets (liquid-based cytology Pap smear dataset, SIPAKMeD, and Herlev) by SMOTE-Tomek Links. We named these three new datasets CCG1, CCG2, and CCG3. The distribution of different classes of cervical cells is shown in Tables 6, 7, 8.

In this section, we describe the classification results on three cervical cancer cell datasets: (1) the liquid-based cytology Pap smear dataset, (2) SIPaKMeD and (3) Herlev. After 150 epochs of classification in dataset (1), the whole model yielded good results, the test loss reached a minimum, and the corresponding test accuracy of dataset (1) was 98.79%. After 100 epochs of classification in datasets (2) and (3), the model's curve fluctuations tended to be stable, the test loss reached a minimum, and the corresponding test accuracies of datasets (2) and (3) were Tables 9, 10 shows the classification ACC, SE, SP, Hmean, and F 1 score for datasets (1-3) and CCG1-3. The classification results of T2T-ViT in three different datasets are significantly better than those of the other classification models. Although ViT also achieved good classification accuracy, it was not superior to all classification models based on CNN. A confusion matrix, also known as an error matrix, is a standard format for accuracy evaluation. The cervical cancer cell classification model output is divided into different categories n, so it is represented by an n × n matrix. For the dataset (1-3), n is equal to 4, 5, and 7. For the CCG1-3 dataset, n is equal to 4, 5, and 7 (Fig. 15 ).

Cervical cell classification is a challenging issue for automatic screening of cervical cytology. The performance of the classification method determines whether it can bring convenience to cytoscreeners and women. As the current mainstream classification framework, a convolutional neural network (CNN) uses a hierarchical deep architecture to automatically learn high-level features.

For cervical cancer Pap cell smears, there are only two public datasets SIPaKMeD and Herlev in previous research, and the dataset has a small sample size. The number and weight of various samples are unbalanced, and the sample image quality is uneven. This limits future research on cervical cancer and the timeliness of clinical analysis of the patient's condition. To solve these problems, this paper proposes CCG-taming transformers and generates sample images based on the existing published cervical cancer Pap cell smear dataset.

We combine CNN with the transformer, use the convolution method to efficiently learn the codebook of the context-rich visual part, and then learn its global composition model, use CNN-based VQGAN to learn the codebook of the context-rich visual part, and the components of the codebook follow the autoregressive transformer architecture to model. A discrete codebook provides the interface between these architectures, and the patch-based discriminator achieves powerful compression capabilities while maintaining high perceptual quality.

To better verify the experimental results of this paper, we used CCG-taming transformers, StyleGAN [31] , PGAN [30] , DC-GAN [45] , LAPGAN [9] and GAN [16] to generate 1000 cervical cell images, and the samples were evaluated 20 times. The experimental results are shown in Table 11 and Figs. 16, 17, 18. Table 6 Data distribution of the CCG1 dataset   Category  NILM  LSIL  HSIL  SCC  Total quantity   Original quantity  613  163  113  74  963  Added quantity  387  837  887  926  3037  Total quantity  1000  1000  1000 1000 4000 16, 17) . This is because CNN-based GAN processes images according to pixels. Different styles tend to be overly rigid when fused or simply repeat directly; the generated sample images may appear blurred (Fig. 18 . row 2) because GAN's CNN has a limited depth of structure. In the case of low resolution and general quality of the original dataset image, the GAN's identification area will have a low standard for judging the generated samples. These low-quality generated cervical cell images will undoubtedly affect the classification model and also increase the distinguishing difficulty in clinical diagnosis by gynecologists.

The CCG-taming transformers serializes the image to analyze the image information. Through the information interaction between VQGAN and the codebook, more deep details of the image can be learned, making the transition between cervical cancer cells in the generated image more naturally, it avoids the problems that may occur in the abovegenerated images, and makes the cervical cancer sample images generated by CCG-taming transformers closer to the real images to provide more objective reference and valuable sample images for cervical cancer research, which should assist gynecologists in clinical diagnosis and improve the survival rate of cervical cancer patients.

We conducted ablation experiments on the innovative part (SMOTE-Tomek Links, MultiRes-block, SE-block and Layer normalization) of this paper to verify the effectiveness of the innovative part of this paper for improving the performance of the experiment. The results, averaged over the three datasets, are shown in Table 12 . We first consider the use of the SMOTE-Tomek Links. It is found that the BMA with the improved skip connection is 2.9% higher. Subsequently, we consider the introduction of MultiRes-block when the SMOTE-Tomek Links is used. It is seen that the BMA with MultiRes-block is 2.4% higher. Then, assuming that the SMOTE-Tomek Links and MultiRes-block are used, we consider the introduction of the SE block. It is demonstrated that with the SE block, BMA is 2.2% higher. Finally, we consider the introduction of Layer normalization. It is seen that with Layer normalization, BMA is 1.7% higher. It is thus demonstrated that the introduction of SMOTE-Tomek Links, MultiRes-block, SE-block and Layer normalization can significantly improve the BMA of the entire segmentation model.

There have been many classification studies based on CNN's public dataset of cervical cancer Pap cells. Tables 13, 14 compare the results of this study with other CNN-based research results. Tables 13, 14 show that in the public cervical cancer Pap smear dataset, the classification research based on CNN achieved high classification accuracy, but this also shows that the classification of the classification research for these public datasets approached the state of overfitting. In the published cervical cancer Pap cell smear dataset, the classification model based on T2T-ViT in this paper is better than the classification accuracy based on CNN.

Win et al. [66] develop a computer-assisted screening system for cervical cancer using digital image processing of Pap smear images; Haryanto et al. [21] aims to create the classification model of Cervical Cell Images using the Convolutional Neural Network (CNN) algorithm; Talo et al. [53] used a deep learning approach to classify cervical cell images obtained from pap smear slides. Mamunur et al. [38] proposed DeepCervix, a hybrid deep feature fusion (HDFF) technique based on DL to classify the cervical cells accurately; Dounias et al. [12] compared the performance of various intelligent methodologies in the task of pap-smear diagnosis; Marinakis et al. [39] proposed an effective genetic algorithm scheme which is combined with a number of nearest neighbor based classifiers.; Dong et al. [10] proposed a machine learning method based on feature selection algorithm for cervical cell classification.

The above methods have achieved good results in the classification task of cervical cancer. However, the main purpose of these studies is to improve the accuracy of classification, but they did not consider the impact of the limited data set and the number of samples on the classification results. It is also one of the biggest differences between the research in this article and the above studies. In addition, there are too few public sample data sets for cervical cancer, which leads to serious overfitting of cervical cancer classification studies at this stage, which seriously hinders cervical cancer research. In this paper, the accuracy of classification is improved by balancing the weight and number of samples between different categories, and can generate more samples of cervical cancer cells in different categories, providing more objective reference value for the research of cervical cancer; Second, most of the above studies use CNN-based networks to classify cervical cancer, and this paper combines CNN and Transformer to further improve the classification model's ability to capture global contextual information on the feature images of the classified samples, this can also make up for the deficiencies of the CNN-based classification model and improve the performance of the classification model.

Considering that the potential relationship between cervical cell images is ignored in the process of CNN feature learning, our goal is to learn the relationship between cell images through the CBAM module in V2T-ViT to enhance the discrimination ability of V2T-ViT features.

To verify the effectiveness of balancing the sample size and weight of various types of samples and using high-quality samples for classification to improve classification accuracy, we conducted experiments on the influence of different sample types from three cervical cancer cell datasets: (1) the liquid-based cytology Pap smear dataset, (2) SIPaKMeD and (3) Then, we used the classification framework of this paper to train these four datasets and analyze the results.

According to the analysis of the experimental results obtained in Tables 15, 16 , 17, we can simply overweight the classes with few samples (Dataset I) and equalize the number for each class by showing minority examples more often. Compared with the method of using CCGtaming transformers to expand fewer sample classes, Dataset II has a lower ACC and other valuating metrics, while Dataset V, formed by combining Dataset I and Dataset II, does not obtain better results. We can see that Dataset III and Dataset IV both use the method of expanding fewer sample classes through the CCG-taming transformers, and Dataset IV balances the weight of each sample class based on the Dataset III method. Finally, we find that Dataset IV requires less training time than Dataset III and obtains higher evaluation metrics.

In addition, we also conducted an image generation experiment on the ISIC2019 skin cancer dataset to verify the generalization ability of CCG-Taming Transformers. The results are shown in Table 18 .

The CCG-Taming Transformers scores best in four metrics compared to the other four models, and it is proven to be a better image generation model comprehensively. Considering that the potential relationship between cervical cell images is ignored in the process of CNN feature learning, our goal is to learn the relationship between cell images through the CBAM module in V2T-ViT to enhance the discrimination ability of V2T-ViT features. This is also the reason why the transformer-based classification model is better than CNN in the field of image classification, segmentation and pattern recognition at this stage. For the generation of medical images, various GANs, including CNNs, not only exhibit strong local deviations but also exhibit deviations from spatial invariance by using shared weights at all locations. If a more comprehensive understanding of the input is required, these deviations will become invalid. In this paper, CCG-taming transformers do not use pixels to represent the image but represent it as the synthesis of the perceptually rich image components of the codebook.

When the model was tested and simulated, we found that it is necessary to combine the artificial intelligence system with the clinical experience of dermatological doctors when some indistinguishable cervical cancer cell images are found. The significance of computer-aided diagnosis is good in curative effect and prognosis. It can also assist the pathological judgment and diagnosis problems of primary dermatological clinicians. Our work can not only help gynecologists in diagnosis but also improve patient health regarding cervical cancer awareness and provide a valuable reference for cervical cancer cell image analysis based on artificial intelligence. The research content of this paper alleviates the problem of the lack of public cervical cancer data sets and the imbalance of the weight and quantity of various samples to a certain extent. The generated images are close to the sample quality of the source data set. The images of cervical cancer cell samples we generated can save obstetricians and gynecologists 

Our model also has some shortcomings. There are still some technical challenges in cervical cancer cell image classification under microscopy. Just as two kinds of cervical cancer cells overlap in the same image, different stages of cervical cancer cells will also increase the classification model's difficulty, which may lead to misclassification. Although the model shows high segmentation accuracy, these problems have not been well solved, so it is necessary to improve the classification model further and expand the dataset samples. More attention will focus on sample data expansion to provide better help for clinical medicine of skin diseases in future research.

In this paper, a cervical cancer cell sample image generation model based on taming transformers (CCG-taming transformers) was proposed to balance the unbalanced weights and numbers of various class in the cervical cancer data set, and use T2T-ViT combing transfer learning to classify the samples. Our CCG-taming transformers improve the encoder structure by introducing and improving MultiRes-block, SE-block and Layer normalization to improve the classification model's ability to capture the subtle differences in the feature map of cervical cancer cells and the feature information of the global context; we introduce Layer Normlization to standardize the data; we also introduce SMOTE-Tomek Links to balance the source data set and the number of samples and weights of the images The quality of the images we generated on these three cervical cancer data sets is very close to the source data set, the final inception score (IS), Fréchet inception distance (FID), Recall and Precision are 3.75, 0.71, 0.32 and 0.65 respectively. The proposed model is superior to other GANs based model in different quantitative evaluation metrics.

The T2T-ViT was constructed combined with transfer learning. The classification accuracy in the liquid-based cytology Pap smear dataset (4-class), SIPAKMeD (5-class), and Herlev (7class) are 98.79%, 99.58%, and 99.88%, respectively. This also proves from the side that the image quality of cervical cancer samples generated by CCG-Taming-Transformer is better, which promotes the improvement of the accuracy of the classification model.

The whole process of this paper, including cervical cancer cell image synthesis, classification model construction, the balance of various sample sizes and weights, and image classification, is applicable for other analyses of medical images, especially those datasets with intraclass-imbalanced data or insufficient labeled data. This work provides a meritorious reference for medical image analysis based on deep learning and artificial intelligence.

In future work, we will work to expand the applicability of the model in this paper, so that it can adapt to more different types of medical image data sets with a small number of samples and unbalanced weights.

The Six Fronts of the Generative Adversarial Networks

Pap smear image classification using convolutional neural network

Automated classification of pap smear images to detect cervical dysplasia

End-to-end object detection with transformers

Automatic cervical cell segmentation and classification in pap smears

Pre-trained image processing transformer

Deep convolutional neural network and emotional learning based breast Cancer detection using digital mammography

A review of computational methods for cervical cells segmentation and abnormality classification

Deep generative image models using a Laplacian pyramid of adversarial networks

Cervical cell classification based on the CART feature selection algorithm

An image is worth 16x16 words: Transformers for image recognition at scale

Automated identification of cancerous smears using various competitive intelligent techniques

Taming transformers for high-resolution image synthesis

Considerations for a PAP smear image analysis system with CNN features

Cervical cancer classification using convolutional neural networks and extreme learning machines

Generative adversarial nets

Skin Lesion Classification Using Deep Neural Network

Automatic classification of whole slide pap smear images using CNN with PCA based feature interpretation

Skeletal bone age prediction based on a deep residual network with spatial transformer

Synthesizing diverse lung nodules wherever massively: 3D multi-conditional GAN-based CT image augmentation for object detection

The Utilization of Padding Scheme on Convolutional Neural Network for Cervical Cell Images Classification

Optimized input for CNN-based hyperspectral image classification using spatial transformer network

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Liquid based-cytology pap smear dataset for automated multi-class diagnosis of pre-cancerous and cervical cancer lesions

MultiResUNet: rethinking the U-net architecture for multimodal biomedical image segmentation

Classification of the COVID-19 infected patients using DenseNet201 based deep transfer learning

Pap-smear benchmark data for pattern classification

Global cancer statistics

Squeeze-and-excitation networks

Progressive growing of gans for improved quality, stability, and variation

A style-based generator architecture for generative adversarial networks

Internet of health thingsdriven deep learning system for detection and classification of cervical cells using transfer learning

Transformers in Vision: A Survey

Practical guide to surgical pathology with cytologic correlation: a text and color atlas

Imagenet classification with deep convolutional neural networks

Sgdr: Stochastic gradient descent with warm restarts

DeepCervix: A Deep Learning-based Framework for the Classification of Cervical Cells Using Hybrid Deep Feature Fusion Techniques, arXiv e-prints

Pap smear diagnosis using a hybrid intelligent scheme focusing on genetic algorithm based feature selection and nearest neighbor classification

The diagnostic value of vaginal smears in carcinoma of the uterus

Vision transformer for COVID-19 CXR Diagnosis using Chest X-ray Feature Corpus

Diagnosis of cervical precancerous lesions based on multimodal feature changes

SIPAKMED: A new dataset for feature and image based classification of normal and pathological cervical cells in Pap smear images

Augmenting data with GANs to segment melanoma skin lesions

Unsupervised representation learning with deep convolutional generative adversarial networks

Segmentation and classification of brain tumors using modified median noise filter and deep learning approaches

World Health Organization, international agency for research on cancer

Improved techniques for training gans

Detection of Melanoma from Skin Lesion Images using Deep Learning Techniques

Cervical cell classification with graph convolutional network

Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries

Diagnostic classification of cervical cell images from pap smear slides

Fusing of deep learning, transfer learning and Gan for breast cancer histopathological image classification

Transfer learning

Medical Transformer: Gated Axial-Attention for Medical Image Segmentation

Attention is all you need

Scene classification with recurrent attention of VHR remote sensing images

CNN-generated images are surprisingly easy to spot... for now

Looking closer at the scene: multiscale representation learning for remote sensing image scene classification

Heck reaction prediction using a transformer model based on a transfer learning strategy

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Deep convolutional neural networks for detecting cellular changes due to malignancy

A review of image analysis and machine learning techniques for automated cervical cancer screening from pap-smear images

Cervical cancer classification from pap-smears using an enhanced fuzzy C-means algorithm

Computer-assisted screening for cervical Cancer using digital image processing of pap smear images

Cbam: Convolutional block attention module

Automatic classification of cervical cancer from cytological images by usieng convolutional neural network

CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation

Tokens-to-token vit: Training vision transformers from scratch on imagenet

The unreasonable effectiveness of deep features as a perceptual metric

Automatic cytoplasm and nuclei segmentation for color cervical smear image using an efficient gap-search MRF

Dermoscopy image classification based on StyleGAN and DenseNet201

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

College of Computer Science and Technology. His major research topics include image processing and machine learning

Research directions include intelligent architecture and medical images

His major research topics include image processing and machine learning

Multimedia Tools and Applications

Deputy Director of Gastroenterology Department, The Affiliated Changzhou No.2 People's Hospital of Nanjing Medical University

D. associate professor, master's supervisor. Research directions include medical image pathological analysis, medical information retrieval and mining, deep learning application technology

Acknowledgments Chen Zhao contributed to the writing and editing of the paper and the operation and editing of the code. Renjun Shuai (corresponding author) contributed to technological guidance and provided experimental equipment and major financial support. Li Ma contributed technical support and guidance for the paper concept. Wenjia Liu contributed to the technical guidance, and as a consultant in the medical consultant field, and Menglin Wu contributed to the direction of the paper and the funding of support and related work.Funding This work was supported in part by The National Natural Science Foundation of China NO.61701222.

The online version contains supplementary material available at https://doi.org/ 10.1007/s11042-022-12670-0.