key: cord-0741545-jw729ps2
authors: Li, Zhongwen; Jiang, Jiewei; Qiang, Wei; Guo, Liufei; Liu, Xiaotian; Weng, Hongfei; Wu, Shanjun; Zheng, Qinxiang; Chen, Wei
title: Comparison of deep learning systems and cornea specialists in detecting corneal diseases from low-quality images
date: 2021-10-22
journal: iScience
DOI: 10.1016/j.isci.2021.103317
sha: 2a39003791382baa18397e13a8a430bbb5b12780
doc_id: 741545
cord_uid: jw729ps2

The performance of deep learning in disease detection from high-quality clinical images is identical to and even greater than that of human doctors. However, in low-quality images, deep learning performs poorly. Whether human doctors also have poor performance in low-quality images is unknown. Here, we compared the performance of deep learning systems with that of cornea specialists in detecting corneal diseases from low-quality slit lamp images. The results showed that the cornea specialists performed better than our previously established deep learning system (PEDLS) trained on only high-quality images. The performance of the system trained on both high- and low-quality images was superior to that of the PEDLS while inferior to that of a senior corneal specialist. This study highlights that cornea specialists perform better in low-quality images than the system trained on high-quality images. Adding low-quality images with sufficient diagnostic certainty to the training set can reduce this performance gap.

Recently deep learning has attained remarkable performance in disease screening and diagnosis (Cheung et al., 2021; Hosny and Aerts, 2019; Li et al., 2020a Li et al., , 2020b Li et al., , 2020c Li et al., , 2020d Matheny et al., 2019; Zhou et al., 2021) . The performance of deep learning is comparable with and even superior to that of human doctors in many clinical image analyses (Li et al., 2021a (Li et al., , 2021b (Li et al., , 2021c (Li et al., , 2021d Li et al., 2020a Li et al., , 2020b Li et al., , 2020c Li et al., , 2020d Li et al., 2019; Ting et al., 2017; Xie et al., 2020; Zhang et al., 2020) . For example, the accuracy of a deep learning system in distinguishing coronavirus pneumonia from computed tomography images reached the level of senior radiologists (87.5% versus 84.5%; p ＞ 0.05) and exceeded the level of junior radiologists (87.5% versus 65.6%; p < .05) (Zhang et al., 2020) . In discerning corneas with contraindications for refractive surgery based on corneal tomographic images, comparable accuracy was observed between a deep learning system and refractive surgeons (95% versus 92.8; p = 0.72) (Xie et al., 2020) . Our previous study also demonstrated that a senior cornea specialist and a deep learning system had similar performance (accuracy: 96.7% versus 97.3%; p = 0.50) in screening for keratitis from slit lamp images (Li et al., 2021a (Li et al., , 2021b (Li et al., , 2021c (Li et al., , 2021d .

For facilitating feature extraction, most studies only utilize high-quality images to establish deep learning systems (Cheung et al., 2021; Esteva et al., 2017; Li et al., 2020a Li et al., , 2020b Li et al., , 2020c Li et al., , 2020d Li et al., 2021a Li et al., , 2021b Li et al., , 2021c Li et al., , 2021d Luo et al., 2019; Xie et al., 2020; Zhang et al., 2020) . Although deep learning acquires good performance in high-quality images, its performance was poor in low-quality images, which were inevitable in real clinical scenarios due to many factors such as patient noncompliance, hardware imperfections, and operator errors (Li et al., 2020a (Li et al., , 2020b (Li et al., , 2020c (Li et al., , 2020d Li et al., 2021a Li et al., , 2021b Li et al., , 2021c Li et al., , 2021d Trucco et al., 2013) . For instance, in screening for lattice degeneration/retinal breaks, glaucomatous optic neuropathy, and retinal exudation/drusen, deep learning systems achieved area under the receiver operating characteristic curves (AUCs) of 0.990, 0.995, and 0.982 in high-quality fundus images, respectively, whereas achieved AUCs of 0.635, 0.853, and 0.779 in low-quality fundus images, respectively (Li et al., 2020a (Li et al., , 2020b (Li et al., , 2020c (Li et al., , 2020d .

To date, whether human doctors also perform poorly in low-quality images is not well investigated. If the performance of human doctors in low-quality images is better than deep learning, this result exposes a vulnerability of deep learning systems, and further studies are needed to build more robust deep learning systems. If the performance of the human doctors was similar to that of deep learning systems in low-quality images, it indicates that the detection of diseases from low-quality images may be inherently difficult.

To explore this issue, our study aimed to compare the performance of a previously established deep learning system (PEDLS) (Li et al., 2021a (Li et al., , 2021b (Li et al., , 2021c (Li et al., , 2021d with that of cornea specialists in low-quality slit lamp images for classifying keratitis, other corneal abnormalities, and normal cornea. In addition, this study investigated whether the performance of the deep learning system in low-quality images would be improved by training a deep learning network with both high-and low-quality images.

In total, 12,411 high-quality images (keratitis = 5,586; other corneal abnormalities = 2,293; and normal cornea = 4,532) and 1,705 low-quality images (keratitis = 628; other corneal abnormalities = 516; and normal cornea = 561) were used in this study. Representative examples of low-and high-quality images are shown in Figure 1 . The detailed information on the development dataset and external test dataset is described in Table 1 .

For the classification of keratitis, other corneal abnormalities, and normal cornea, the cornea specialist with 3 years of experience achieved accuracies of 82.8% (78.5-87.0), 69.9% (64.7-75.0), and 81.8% (77.4-86.1), respectively, and the senior cornea specialist with 7 years of experience achieved accuracies of 93.7% (91.0-96.4), 93.0% (90.2-95.9), and 94.7% (92.2-97.2), respectively, whereas the PEDLS achieved accuracies of 69.9% (64.7-75.0), 60.6% (55.1-66.1), and 70.9% (65.7-76.0), respectively, in low-quality images from the external test dataset. The overall performance of the PEDLS is lower than that of the cornea specialists (p < .05) ( Table 2) .

In the internal text dataset, the NDLS achieved AUCs of 0.854 (95% confidence interval [CI], 0.801 to 0.904), 0.872 (95% CI, 0.822 to 0.913), and 0.941 (95% CI, 0.908 to 0.967), respectively, for classifying keratitis, other corneal abnormalities, and normal cornea from low-quality images and achieved AUCs of 0.997 (95% CI, iScience Article 0.996 to 0.999), 0.993 (95% CI, 0.990 to 0.997), and 1.000 (95% CI, 1.000 to 1.000), respectively, for classifying keratitis, other corneal abnormalities, and normal cornea from high-quality images ( Figure 2 ).

In the external text dataset, the AUCs of NDLS were 0.860 (95% CI, 0.809 to 0.908), 0.886 (95% CI, 0.838 to 0.927), and 0.894 (95% CI, 0.856 to 0.926), respectively, for classifying keratitis, other corneal abnormalities, and normal cornea from low-quality images ( Figure 3 ) and were 0.988 (95% CI, 0.986 to 0.991), 0.976 (95% CI, 0.972 to 0.979), and 0.990 (95% CI, 0.988 to 0.992), respectively, for classifying keratitis, other corneal abnormalities, and normal cornea from high-quality images ( Figure S1 ).

Further information including accuracies, sensitivities, and specificities of the NDLS in the internal and external test datasets is displayed in Table 3 . 

In low-quality images from the external test dataset, the overall performance of the NDLS was greater than that of the PEDLS for detecting keratitis, other corneal abnormalities, and normal cornea (p < .05) ( Table 2) . The corresponding ROCs and confusion matrices of the NDLS and PEDLS are described in Figures iScience Article

The accuracies of the NDLS in detecting keratitis and normal cornea were comparable to that of the cornea specialist with 3 years of experience (p ＞ 0.05), whereas the accuracy of the NDLS in detecting other corneal abnormalities is greater than that of the cornea specialist with 3 years of experience (p < .05) in low-quality images from the external test dataset ( Table 2 ). The overall performance of the NDLS was lower than that of the cornea specialist with 7 years of experience (p < .05) for classifying keratitis, other corneal abnormalities, and normal cornea in low-quality images from the external test dataset (Table 2) .

In high-quality images from the external test dataset, the overall performance of the NDLS was lower than that of the PEDLS for detecting keratitis, other corneal abnormalities, and normal cornea (p < .05) ( Table S1 ). The corresponding ROCs and confusion matrices of the NDLS and PEDLS are presented in iScience Article Figure S1 . The t-SNE technique showed that the features of each category learned by the NDLS in highquality images were less separable than those of the PEDLS ( Figure S2B ).

In this study, we conducted performance comparisons between a deep learning system and cornea specialists for classifying keratitis, other corneal abnormalities, and normal cornea in low-quality images. We found that the performance of cornea specialists in low-quality images greatly exceeded that of the PEDLS trained on high-quality images. This result indicates that the PEDLS is not so robust as the cornea specialists in detecting abnormal cornea findings from low-quality slit lamp images. The reason is that the deep learning system might perceive the noise in low-quality images as part of the object and its texture while human experts often treat the noise as a layer in front of the image (Geirhos et al., 2017) . Besides, human experts, through experience and evolution, were exposed to some low-quality images and thus have an advantage over the PEDLS (Geirhos et al., 2017) .

To improve the resiliency of a deep learning system in real-world settings, we trained the NDLS using both high-and low-quality images. The NDLS achieved higher accuracies than those of the PEDLS in identifying keratitis (86.1% versus 69.9%), other corneal abnormalities (85.4% versus 60.6%), and normal cornea (82.8% versus 70.9%) from low-quality images. It demonstrates that the performance of the deep learning network on low-quality images could be improved if low-quality images with sufficient diagnostic certainty are added to the training set. In addition, heatmaps were generated to interpret the decision-making rationales of the PEDLS and NDLS in low-quality images. As shown in Figure 4 , the heatmaps of the NDLS are more interpretable than those of the PEDLS, and this further substantiates the effectiveness of the NDLS.

Performance comparison between the NDLS and cornea specialists in identifying keratitis, other corneal abnormalities, and normal cornea from low-quality images was conducted in the present study. The results showed that the performance of the NDLS in low-quality images was comparable with that of the cornea specialist with 3-year clinical experience. However, the performance of the cornea specialist with 7-year clinical experience still exceeded that of the NDLS in low-quality images. A possible explanation is that the senior cornea specialist may have more prior experience in analyzing low-quality images and therefore has better performance. Increasing the sample size of low-quality images in a training set could potentially reduce a performance gap between the deep learning system and the senior cornea specialist in low-quality images. iScience Article Although the NDLS has higher performance than the PEDLS in low-quality images, the NDLS performance in high-quality images was slightly lower than that of the PEDLS. This illustrates that adding low-quality images to the training set brings the deep learning system some noise, which has negative influence on the system in screening for abnormal cornea findings from high-quality images. Further research is required to find an approach that would not decrease the performance of a deep learning system in high-quality images while increasing its performance in low-quality images.

In summary, this study shows that cornea specialists achieve higher accuracies than those of the PEDLS in detecting keratitis, other corneal abnormalities, and normal cornea from low-quality slit lamp images. The performance of the system (NDLS) in low-quality images is improved when adding low-quality images with sufficient diagnostic certainty to the training set. However, its performance is still below that of the senior cornea specialist. Further studies are needed to further reduce and close this performance gap and develop a robust deep learning system that can perform well in both high-and low-quality images.

A potential limitation of this study is that we only confirmed the cornea specialists had greater performance in low-quality slit lamp images than that of the deep learning system. Whether this phenomenon also appears in other types of clinical images is not investigated and left for future work. 

The authors report no declarations of interest. iScience 24, 103317, November 19, 2021 iScience Article PEDLS vs. corneal specialists in low-quality slit lamp images Two cornea specialists who had 4 and 7 years of clinical experience were recruited to investigate their performance in low-quality slit lamp images for classifying keratitis, other corneal abnormalities, and normal cornea. The cornea specialists were not informed of any clinical information related to these images. The low-quality images from external test datasets were used to compare the performance of PEDLS to that of the corneal specialists.

Developing a deep learning system with images of both high and low quality Both high-and low-quality images with clear diagnoses from NEH were used to develop a new deep learning system (NDLS) for the classification of keratitis, other corneal abnormalities, and normal cornea. The images were randomly divided into training (70%), validation (15%), and internal test (15%) datasets at the subject level. No overlap between these datasets was allowed.

In the image preprocessing phase, pixel values of slit lamp images were normalized to a range of 0-1, and the size of the slit lamp images was resampled to a resolution of 224 3 224 pixels. Data augmentation was applied to increase the heterogeneity of the training dataset, avoiding overfitting and bias during the training process (Bloice et al., 2019) . The training dataset was increased to 6-fold of the original size (from 5,505 to 33,030) using horizontal and vertical flips, random cropping, and random rotations around the image center.

The NDLS was trained using the DenseNet121 algorithm which exhibited the optimal performance in detecting keratitis, other corneal abnormalities, and normal cornea in our previous study (Li et al., 2021a (Li et al., , 2021b (Li et al., , 2021c (Li et al., , 2021d . The DenseNet121 has 8.1 3 10 6 trainable parameters and contains 121 layers densely connected through jointing all preceding layers into subsequent layers to achieve strengthened feature propagation and alleviate the vanishing-gradient problem (Huang et al., 2017) . Weights pre-trained on the ImageNet database of 1.4 million images were used to initialize the DenseNet121 architecture (Russakovsky et al., 2015) . Transfer learning was performed because it could improve the accuracy of image-based deep learning (Kermany et al., 2018) . In this study, we set the learning rate of the parameters of the Softmax classification layer to 10 times that of other layers' parameters. This transfer learning technology guaranteed that the parameters of the Softmax classification layer were fully trained while the parameters of other layers were only fine-tuned using slit lamp images.

The NDLS was built in Python programming language leveraging PyTorch (version 1.6.0, https://pytorch. org/docs/stable/) as a backend. The adaptive moment estimation (ADAM) optimizer was utilized for training and the hyper-parameters were set as follows: learning rate = 0.001, b1 = 0.9, b2 = 0.999, weight decay = 1 3 10 À4 . During the training process, the cross-entropy loss and accuracy were calculated on the validation dataset after each epoch for monitoring the performance. After 80 epochs, the training was stopped due to the absence of further improvement in both cross-entropy loss and accuracy. The model with the lowest loss on the validation dataset was saved as the optimal model. The images of the external test dataset used to evaluate the performance of the deep learning model in this study were assembled from Zhejiang Eye Hospital (ZEH), Jiangdong Eye Hospital (JEH), and Ningbo Ophthalmic Center (NOC). The running time of the DenseNet121 in the whole training process is 2.30 hours and the average time that the model needs in testing every image is 0.18 seconds with NVIDIA RTX 2080Ti GPU. The process of the development and assessment of the NDLS is displayed in Figure S3 .

Gradient-weighted Class Activation Mapping (Grad-CAM) was utilized to create "visual explanations" for the decisions from the deep learning system by superimposing a visualization layer at the end of the CNN model (Selvaraju et al., 2017) . This technique leverages the gradients of any target concept, flowing into the last convolutional layer to create a localization map highlighting crucial regions in the image for predicting the concept. Redder regions denote more significant features on the system's classification.

The low-quality images from the external test dataset were used to compare the performance of the NDLS to that of PEDLS to investigate whether training a deep learning network with both high-and low-quality images could improve its performance in low-quality images for classifying keratitis, other corneal ll OPEN ACCESS iScience 24, 103317, November 19, 2021 11 iScience Article abnormalities, and normal cornea. The t-distributed stochastic neighbor embedding (t-SNE) technique was employed to show the embedding features of each category learned by the deep learning system in a two-dimensional space (van der Maaten and Hinton, 2008) . Also, a performance comparison was conducted between the NDLS and corneal specialists in low-quality images.

The high-quality images from the external test dataset were used to compare the performance of the NDLS to that of PEDLS to investigate whether training a deep learning network with both high-and low-quality images could improve/decline its performance in high-quality images for the classification of keratitis, other corneal abnormalities, and normal cornea.

The one-versus-rest strategy was employed to evaluate the performance of the deep learning systems and cornea specialists. The receiver operator characteristic (ROC) curves were plotted using the packages of matplotlib (version 3.3.1, https://matplotlib.org/3.3.1/) and Scikit-learn (version 0.23.2, https://scikitlearn.org/stable/whats_new/v0.23). The 2-sided 95% confidence intervals (CIs) were Wilson score intervals for sensitivity, specificity, and accuracy, and were Delong intervals for AUC. The proportion comparisons were conducted using the McNemar test. Statistical analyses were performed using Python 3.7.8 (https://www.python.org/downloads/release/python-378/, Wilmington, Delaware, USA). All statistical tests were 2-sided, and findings were considered statistically significant at p < .05.

This study did not generate additional data.

Biomedical image augmentation using Augmentor

A deeplearning system for the assessment of cardiovascular disease risk via the measurement of retinal-vessel calibre

Dermatologist-level classification of skin cancer with deep neural networks

Comparing deep neural networks against humans: object recognition when the signal gets weaker

Artificial intelligence for global health

Densely connected convolutional networks

Identifying medical diagnoses and treatable diseases by Image-Based deep learning

Deep learning for automated glaucomatous optic neuropathy detection from ultra-widefield fundus images

Deep learning for detecting retinal detachment and discerning macular status using ultrawidefield fundus images

Development and evaluation of a deep learning system for screening retinal hemorrhage based on Ultra-Widefield fundus images

A deep learning system for identifying lattice degeneration and retinal breaks using ultrawidefield fundus images

Deep learning from "passive feeding" to "selective eating" of real-world data

Automated detection of retinal exudates and drusen in ultra-widefield fundus images based on deep learning

Preventing corneal blindness caused by keratitis using artificial intelligence

Development of a deep learning-based image quality control system to detect and filter out ineligible slit-lamp images: a multicenter study

Development of a deep learning-based image eligibility verification system for detecting and filtering out ineligible fundus images: a multicentre study

A deep learning system for differential diagnosis of skin diseases

Realtime artificial intelligence for detection of upper gastrointestinal cancer by endoscopy: a multicentre, case-control, diagnostic study

Artificial intelligence in health care: a report from the national academy of medicine

ImageNet large scale visual recognition challenge

Grad-CAM: visual explanations from deep networks via gradient-based localization

Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes

Validating retinal fundus image analysis algorithms: issues and a proposal

Visualizing Data using t-SNE

Screening candidates for refractive surgery with corneal Tomographic-Based deep learning

Clinically applicable AI system for accurate diagnosis, quantitative measurements, and prognosis of COVID-19 pneumonia using computed tomography

Ensembled deep learning model outperforms human experts in diagnosing biliary atresia from sonographic gallbladder images

This study was approved by Ningbo Eye Hospital (NEH) Ethics Review Committee (protocol number 2020qtky-017) and conducted following the Declaration of Helsinki. Because deidentified images were used, the review committee denoted that patient consent was not required in this study.

The information including the technical, clinical details, and performance of the PEDLS for classifying keratitis, other corneal abnormalities, and normal cornea has been described previously (Li et al., 2021a (Li et al., , 2021b (Li et al., , 2021c (Li et al., , 2021d . Notably, this PEDLS was developed and evaluated only based on high-quality slit lamp images and a total of 302 low-quality slit lamp images from external test datasets were excluded (Li et al., 2021a (Li et al., , 2021b (Li et al., , 2021c (Li et al., , 2021d . The image quality was considered ''poor'' if the cornea was blurred and/or distorted.