key: cord-0562141-tn5kdv22
authors: Xu, Fanxin; Li, Xiangkui; Yang, Hang; Wang, Yali; Xiang, Wei
title: TE-YOLOF: Tiny and efficient YOLOF for blood cell detection
date: 2021-08-27
journal: nan
DOI: nan
sha: d8122a4d7933907bbb980f26e5aaf530c8c5c87b
doc_id: 562141
cord_uid: tn5kdv22

Blood cell detection in microscopic images is an essential branch of medical image processing research. Since disease detection based on manual checking of blood cells is time-consuming and full of errors, testing of blood cells using object detectors with Deep Convolutional Neural Network can be regarded as a feasible solution. In this work, an object detector based on YOLOF has been proposed to detect blood cell objects such as red blood cells, white blood cells and platelets. This object detector is called TE-YOLOF, Tiny and Efficient YOLOF, and it is a One-Stage detector using dilated encoder to extract information from single-level feature maps. For increasing efficiency and flexibility, the EfficientNet Convolutional Neural Network is utilized as the backbone for the proposed object detector. Furthermore, the Depthwise Separable Convolution is applied to enhance the performance and minimize the parameters of the network. In addition, the Mish activation function is employed to increase the precision. Extensive experiments on the BCCD dataset prove the effectiveness of the proposed model, which is more efficient than other existing studies for blood cell detection.

The analysis of blood cell in microscopic images plays a vital role in disease recognition field by identifying the different cellular objects. In blood cell field, there are three important components in blood: White Blood Cells(WBC), Red Blood Cells(RBC), and Platelets [1] . The proportion and number of these blood cells are seriously affect the doctor's judgment of the illness [2, 3] . Finding an automated algorithm based on deep convolutional neural network to detect blood cells accurately and efficiently can improve the effectiveness of the medical system [4] .

Object detection which is the study of finding the coordinates of objects in an image and classifying objects is widely used in computer vision tasks, such as machine vision, pedestrian identification, abnormal detection and so on. Fast, accurate algorithms for object detection would allow computers to instead of part of manual checking instances and unlock the potential of humans for the better purpose.

Current object detectors are either based on two-stage or on one-stage mechanism [5] . In two-stage field, most networks are based on the R-CNN framework which generates a sparse set of candidate object locations in the first stage and classifies each candidate location as the foreground classes or background in the second stage [6, 7, 8, 9] . Through a sequence of advances, such two-stage models reach highest accuracy with slow operation. While one-stage detectors, such as YOLO(You Only Look Once) and SSD(Single Shot MultiBox Detector), reframe object detection as a single regression problem by learning the class probabilities and bounding box coordinates in a freeway [10, 11, 12, 13] . With the help of focal loss, RetinaNet [14] which belong to one-stage detectors outperforms the alternative high-accuracy to two-stage detectors. Cascading dilated convolutions is utilized in YOLOF(You Only Look One-level Feature) [15] to obtain the same effect on dense small sample detection by feature pyramid networks(FPN) [16] .

Inspired by [17] , in this research, we propose a new onestage detector that provide high accuracy besides high efficiency to solve the problem of the blood cell detection with low accuracy. In a nutshell, the contributions of this research are divided into the following items:

• Applying YOLOF to the blood cell detection field for the first time and using the EfficientNet CNN as the backbone to increase efficiency and flexibility. • Depthwise Separable Convolution module is applied to enhance the performance and minimize the parameters in the decoder. • Mish activation function is proved to be the relatively optimal method to increase the precision compared to other functions, such as Swish, ReLU, MetaAconC and so on. • Extensive experiments on BCCD dataset indicates the importance of each component. Moreover, we conduct comparisons with YOLOv3, Deformable DETR. We can achieve comparable results with a higher mAP.

Before the advance of one-stage detectors, region proposals are used to locate and classify objects in images instead of sliding windows in two-stage mechanism [18, 6] . As Faster R-CNN came up, region proposals can be automatic generated by Region Proposal Network(RPN) [8] . So far, object detection is unified into a pure neural network framework without any hand-designed parts to extract features and predict the object. These complex pipelines can help two-stage detectors to achieve high accuracy at the expense of speed.

In one-stage field, Multiple versions of YOLO series have been published to reframe object detection as a single regression problem [10, 11, 12] , they forward an image only once to predict where and what objects are present. Feature Pyramid Network(FPN) have beed used in YOLOv3 to realize object detection of different sizes. These methods are much faster than the two-stage detectors with comparable accuracy. With the help of focal loss that enable us to train a high accuracy by addressing the class imbalance problem [14] , one-stage detectors are able to achieve an excellent balance of precision and speed. While in [15] , YOLOF use dilated encoder to obtain the feature maps that has the comparable effect as FPN, bridging the performance gap between detectors with FPN and others without FPN in small object detection. In the following, detailed approaches that we utilized in blood cell detection have been described with extensive experiences.

In a blood cell image, the distribution of WBC and RBC are relatively unbalanced. The distribution of RBC is relatively dense with different sizes, while the distribution of WBC and platelet are relatively sparse. In [19] , YOLOv3 has been utilized to detect discrete RBC and WBC for the purpose of counting RBC and WBC in images. Image density estimation algorithm [20] has been used to the counting tasks of RBC. In [19] , the maximum mAP is achieved as 88.26% by the method of YOLOv3 with approximately 62 million parameters. While in [17] , FED detector update the mAP to 88.33% with approximately 14 million parameters. With the effect of Swish and DIoU, mAP is improved to 89.86% .

Depthwise separable convolutional network is proposed to reduce computation and model size. It has been used in neural network design by MobileNet [21] and has become more popular since its inclusion in the TensorFlow framework. In the method proposed in this paper, depthwise separable convolution is presented for reducing the parameters and improve the effect.

Compound scaling method is used to balance network width, depth, and resolution according to a certain ratio. Before this method came up, increasing the depth of the network is the most common way to increase performance of the network [22, 23] . Additionally, although the input image with higher resolution is helpful for feature extraction and better performance [21, 24] , it increases the number of parameters. EfficientNet use compound scaling method to achieve state-of-the-art accuracy with an order of magnitude fewer parameters and FLOPs on both ImageNet and other transfer learning datasets [25] . Therefore, in this study, EfficientNet is applied to this research to increase efficiency.

The goal of object detection is finding the coordinates of objects in an image and also classification of its category. In this research, the proposed detector regards object detection as a regression problem, which is equivalent to the methodology of the YOLO detector. Also the proposed detector that only use the last output feature of the backbone can converges faster and achieves promising performance, compared to YOLOv3 which use FPN to enhance performance. This proposed detector is modified by YOLOF which is detailed described in [15] .

The flow chart of the proposed detector has been shown in Figure 1 . The entire operation of the proposed model includes 8 steps. The details of these steps are as follows.

• Step 1: The images with data augumentation of BCCD dataset are loaded in a dataset.

• Step 2: The loaded images are divided into train, val and test set with annotation format of COCO style.

• Step 3: In this step, creating the YOLOF model in order to create the proposed model subsequently.

• Step 4: Overall architecture modification process of this model is detailed in this step. These modifications involve the backbone of the model, the 

Based on the solutions above, we propose a lightweight framework as TE-YOLOF. The sketch of TE-YOLOF is shown in Figure 2 . The proposed framework consists of three parts: the backbone, the encoder, and the decoder. In this section, a brief introduction of the main components which we used in the proposed detector is given as follows.

Backbone. EfficientNet series are adopted as our backbone in all models. For the purpose of efficiency and flexibility, EfficientNet-B0 to EfficientNet-B3 are chose to analysis with the balance of precision and parameters. All models are pre-trained on ImageNet. The output of the backbone is the C6 feature map which has the default 1280 channels with different width enhancement factor in different backbones, and with a downsample rate of 32. Encoder. The specific architecture of encoder is shown in Figure 3 . The Projector is added after the backbone with two projection layers (one 1 × 1 and one 3 × 3 convolution), resulting in a feature map with 512 channels. Then using residual blocks, which consist of three main components: the first 1 × 1 convolution is applied to channel reduction with a reduction rate of 4, then a 3 × 3 convolution is used to enlarge the receptive field with different dilation factor in different block, a 1 × 1 convolution is employed to recover the number of channels at last. Decoder. The decoder, which consists of two parallel task-specific heads: the classification head and the regression head, is shown in Figure 2 . There are four Depthwise Separable Convolution modules on the regression head while only have two on the classification head. The architecture of Depthwise Separable Convolution Module is shown in Figure 4 . Each convolution layer followed by batch normalization layer and Mish layer in the module. We follow Autoassign [26] and use objectness prediction for each anchor on the regression head to prove whether the anchor containing the object. The final predictions of the classification scores are resulted by multiplying the classfication output with the objectness prediction. Other Details. Non-Maximum Suppression is utilized in this object detection algorithm to be sure that the detector detects each object only once. The detected boxes which overlap the box with the highest score exceed a threshold would be removed. Focal loss which is presented to address the unbalance between positive and negative samples problem is used as the classification loss in this model. GIoU [27] is utilized to solve the regression loss.

Compound scaling is an effective method to achieve higher performance by uniformly scaling network width, depth and resolution in a principled way. Although only increasing one of the factors can improve the accuracy, the network become too large with enormous amount of parameters and operations. A compound coefficient φ is used to balance different factors in a restricted condition and reach higher precision. The detailed accomplishment is in Equation (1).

Intuitively, φ is a user-specified parameter that controls how many more resources are available for model scaling, while α, β, γ determine the expansion of network width, depth, and resolution respectively. The best values for EfficientNet-B0 are α = 1.2, β = 1.1, γ = 1.15.

EfficientNet-B1 to B7 are obtained with different φ under Equation (1).

In this study, for efficiency and flexibility, EfficientNet-B0 to B3 are chosen to analysis as backbone with 5.3, 7.8, 9.2, 12 million parameters respectively. On the other hand, different backbones are utilized to prove the effectiveness of the model between accuracy and resource. 

Depthwise Separable Convolution is an effective method to minimize the parameters by factorizing a standard convolution into a depthwise convolution and a 1 × 1 convolution called a pointwise convolution [21, 28] . For the proposed detector, the depthwise convolution applies a single filter to each input channel and the pointwise convolution applies a 1 × 1 convolution to combine the outputs from the depthwise convolution. The specific architecture is shown in Figure 4 . The batch normalization layer and Mish activation function is followed after the convolution layer.

For more details, the standard convolution is presented in Figure 5 and the corresponding depth separable convolution is shown in Figure 6 . Suppose the input channel is N , the output channel is M , and the size of the convolution kernel is N k × N k . According to the Figure 5 , the standard convolution kernels is N × N k × N k × M . As a reference in Figure 6 , the depthwise convolution kernels is N × N k × N k , while the pointwise convolution kernels is N × 1 × 1 × M . Compared with standard convolution, depthwise separable convolution uses 8 to 9 times less parameters.

Mish is a smooth, continuous, self regularized, nonmonotonic activation function mathematically defined as:

According to the Equation (2), Mish uses the Self-Gating property where the non-modulated input is multiplied with the output of a non-linear function of the input. While the Swish is defined as f (x) = x · (1 + exp (−βx)) −1 . Figure  7 shows the graph of Swish for β = 1 and Mish versus ReLU [29, 30, 31] . As illustrated in the Figure 7 (a), Mish, similar to Swish, is bounded below and unbounded above with a range of [≈ −0.31, ∞). As the Figure 7 (b) shows that the Mish function converges to one faster than Swish in the positive value. The benefits of Mish are given as bellows.

• Mish eliminated the Dying ReLU phenomenon by design the preconditions to preserve a small amount of negative information.

• Mish is bounded below and avoids saturation by near-zero gradients. It is unbounded above so that the outputs do not saturate to the maximum value.

• Compared to ReLU and Swish, Mish is continuously differentiable and plays a role in better gradient flow.

In this section, we focus on the implementation details and experiments analysis. We evaluate our detector on the BCCD dataset and conduct comparisons with Deformable DETR and YOLOv3. Then, we provide a detailed ablation study of each component's design with quantitative results and analysis. The details are as follows. Implementation Details. In this study, Python and Pytorch, an Nvidia GeForce 1080Ti have been utilized for implementing the proposed detector. The implementing tools is detailed introducing in Table 1 . The proposed model is trained with SGD over 1 GPU with 4 images per mini-batch. The ablation study is based on the '1x' schedule setting with initial learning rate of 0.015. To stabilize the training at the beginning, we set the number of warmup iterations with 1500. NMS with a threshold of 0.6 is utilized to post-process the results.

Dataset. The BCCD dataset, which includes 364 images that each of them has dimentions of 640 × 480 × 3 and contains different proportions of Platelets, WBC and RBC,has been used to evaluate the proposed detector. All the images are resize to 416 × 416 × 3. The division protocol of this dataset has divided images into training, val and test sets with a ratio of 7:2:1, the detailed information of objects per categories is shown in [19] with batchsize of 8. YOLOF utilize the setting of the method in [17] with AdamW [33] . The + represents that YOLOF use the same setting of TE-YOLOF shown in Implementation Details.

In the bottom section, for those models marked with suffix of 'B(N)', they adopt EfficientNet-B(N) as backbone by default. Table 4 : Comparison with FED on the BCCD validation set. Platelets, RBC and WBC represents the AP of each category. The top section shows the result with the setting of FED in [17] . The '*' represents that AdamW is utilized as the optimizer with 12 epochs. The setting of mAP is IoU = 0.4.

flip, vertical flip, randomly crop between 0 and 15 percent of the image, random brigthness adjustment, random exposure adjustment, are applied to augmentation. All cells are annotated in COCO format.

Comparison with Deformable DETR. Deformable DETR [32] is a recent proposed detector which introduces transformer with deformable attention to object detection of small target detection problem. Although it can achieve surprising results on the BCCD dataset with higher mAP(The evaluation metrics of AP 50 ), it suffers convergence with more epochs and need dense operations per epoch. The result is shown in Table 3 . The proposed method of TE-YOLOF converge much faster (4×) compared with Deformable DETR, and also achieve a higher mAP.

Comparison with YOLOv3. YOLOv3 is firstly used to blood cell detection in [19] . It achieve the goal of 86.79 in mAP, while we achieve 86.7 on the batchsize of 8 with the same setting in our device. While YOLOF [15] can perform better in the same setting with AdamW [33] . Table  3 shows that TE-YOLOF can achieve better performance with lower epochs, the number of parameters and operations.

Comparison with FED. FED [17] is the most efficient model with the minimal parameters in BCCD dataset. It utilize EfficientNet-B3 as backbone and YOLOv3 head as detection head. Modified the component based on YOLOv3 to observe better performance. Through the comparison of Table 4 , compared to FED, TE-YOLOF-B0 can achieve better performance in mAP with lower parameters. When using the same backbone, TE-YOLOF can achieve mAP to 91.9 with the main contribution of increasing accuracy in RBC.

We run a number of ablations to analyze TE-YOLOF. We first provide an overall analysis of the three proposed components. Then we show the ablation experiments on detailed designs of each component. Results are shown in Table 5 and discussed in detail next. For the credibility, all experimental data are the average of 3 experiments.

Backbone of EfficientNet: Table 5 shows that Efficient-Net as backbone is better than original backbone of ResNet-50 that result shows in Table 3 Mish Activation function: The default activation is ReLU. Compared with three different activation functions, we can prove the effectiveness of the Mish activation function. MetaAconC [34] utilize 1 × 1 convolution to accomplish activation of the neurons or not. Table 5 shows that MetaAconC is unstable to the improvement in different backbone, while Swish and Mish is stable on the effectiveness of result. Mish is better than Swish in terms of the contribution to the result.

The detection results of the proposed model have been visualized in Figure 8 . The model used is TE-YOLOF-B0 with the smallest amount of parameters and operations. It can be seen in the Figure 8 that the model can detect the all the categories well. Platelets, RBC and WBC correspond to small, medium and large targets respectively. Our model can accurately identify different size categories, and can detect dense RBCs with high precision.

An traditional automated count of blood cells is usually performed via Flow Cytometry Instrumentation [35] . For medical diagnostics in real situation, blood cell detection combined with computer vision technology can solve the challenge of detection speed and accuracy [36] . One of the difficulties in blood cell detection is that the sparsity of platelets and WBCs and the denseness of RBCs cause the detection of various categories to be unbalanced. the FED model in [17] can accurately detect platelets and WBCs, but the detection accuracy of RBCs can only reach 80.4 due to the density of RBCs.

In this research, the proposed model has the ability to utilize Efficientnet's excellent feature extraction capabilities.

With the help of the development of transfer learning theory in the field of computer vision, the proposed detector actually do not need time-consuming re-training. The pretrained weights on the large dataset can be migrated to the new model, and then combined with small number of epochs, the final result will also meet expectations.

In addition, The advantage of enlarging the features' receptive field in dilated convolution [37] can be utilized to generate multi-scale features with different expansion factor. Stacking dilated convolution one by one without weight sharing can achieve the feature fusion effect of FPN. With its advantages, our model can achieve the balance and accuracy of the results in different size categories and the sparsity of the object. Conclusively, the proposed model, TE-YOLOF, overcomes the problem of low RBCs accuracy in FED as shown in Table 4 . Thanks to the solution of the RBCs detection problem, the final result is significantly improved.

In this work, TE-YOLOF has been proposed to address blood cell detection problem. Based on YOLOF's excellent characteristics in dense object detection, compound scaling method, Depthwise Separable Convolution, and Mish activation function are utilized to lightweight transformation with better performance. The results of experiments and comparisons demonstrate that TE-YOLOF is more efficient than other existing methods for blood cell detection with minimum parameters. Besides, TE-YOLOF is a flexible detector because of the different backbone version that can be chosen to balance the precision and parameters.

Due to the pandemic of COVID-19 coronavirus and late diagnosing dangerous and considering the effect of the blood cell detection in medical clinical diagnosis, it can be vital to develop an actual equipment using the proposed method.

Besides, This solution can be considered to transfer to other medical detection fields as future work.

Raman spectroscopy of blood and blood components

Atomic force microscopy and raman spectra profile of blood components associated with exposure to cigarette smoking

Nanoscale changes on rbc membrane induced by storage and ionizing radiation: A minireview

Detection of subtype blood cells using deep learning

Recent advances in deep learning for object detection

Rich feature hierarchies for accurate object detection and semantic segmentation

Fast r-cnn

Faster r-cnn: Towards real-time object detection with region proposal networks

Mask r-cnn

You only look once: Unified, real-time object detection

Yolo9000: Better, faster, stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Yolov3: An incremental improvement

Ssd: Single shot multibox detector

Kaiming He, and Piotr Dollar. Focal Loss for Dense Object Detection

You only look one-level feature

Feature pyramid networks for object detection

A fast and yet efficient YOLOv3 for blood cell detection

Smeulders. Selective search for object recognition

Cell counting algorithm based on YOLOv3 and image density estimation

Learning to count objects in images

Mobilenets: Efficient convolutional neural networks for mobile vision applications

Deep residual learning for image recognition

Identity mappings in deep residual networks

Mobilenetv2: Inverted residuals and linear bottlenecks

IEEE/CVF Conference on Computer Vision and Pattern Recognition

Rethinking Model Scaling for Convolutional Neural Networks. 36th International Conference on Machine Learning, ICML 2019

Autoassign: Differentiable label assignment for dense object detection. ArXiv, abs

Generalized intersection over union: A metric and a loss for bounding box regression

Yolov3-lite: A lightweight crack detection network for aircraft structure based on depthwise separable convolutions

Mish: A self regularized nonmonotonic activation function

Hinton. Imagenet classification with deep convolutional neural networks

Searching for activation functions

Deformable detr: Deformable transformers for end-to-end object detection

Decoupled weight decay regularization

Activate or not: Learning customized activation

Flow cytometry instrumentation -an overview

Deep cytometry: Deep learning with real-time inference in cell sorting and flow cytometry

Multi-scale context aggregation by dilated convolutions

This research was supported in part by Southwest Minzu University for Nationalities Excellent Student Cultivation Project (2021NYYXS78).