key: cord-0791259-dcqb6hmv authors: Yap, Moi Hoon; Hachiuma, Ryo; Alavi, Azadeh; Brungel, Raphael; Cassidy, Bill; Goyal, Manu; Zhu, Hongtao; Ruckert, Johannes; Olshansky, Moshe; Huang, Xiao; Saito, Hideo; Hassanpour, Saeed; Friedrich, Christoph M.; Ascher, David; Song, Anping; Kajita, Hiroki; Gillespie, David; Reeves, Neil D.; Pappachan, Joseph; O'Shea, Claire; Frank, Eibe title: Deep Learning in Diabetic Foot Ulcers Detection: A Comprehensive Evaluation date: 2020-10-07 journal: Computers in biology and medicine DOI: 10.1016/j.compbiomed.2021.104596 sha: 6e2632dd3b577f436994e01f8fef98f4b3707e17 doc_id: 791259 cord_uid: dcqb6hmv There has been a substantial amount of research involving computer methods and technology for the detection and recognition of diabetic foot ulcers (DFUs), but there is a lack of systematic comparisons of state-of-the-art deep learning object detection frameworks applied to this problem. DFUC2020 provided participants with a comprehensive dataset consisting of 2,000 images for training and 2,000 images for testing. This paper summarises the results of DFUC2020 by comparing the deep learning-based algorithms proposed by the winning teams: Faster R-CNN, three variants of Faster R-CNN and an ensemble method; YOLOv3; YOLOv5; EfficientDet; and a new Cascade Attention Network. For each deep learning method, we provide a detailed description of model architecture, parameter settings for training and additional stages including pre-processing, data augmentation and post-processing. We provide a comprehensive evaluation for each method. All the methods required a data augmentation stage to increase the number of images available for training and a post-processing stage to remove false positives. The best performance was obtained from Deformable Convolution, a variant of Faster R-CNN, with a mean average precision (mAP) of 0.6940 and an F1-Score of 0.7434. Finally, we demonstrate that the ensemble method based on different deep learning methods can enhanced the F1-Score but not the mAP. According to the International Diabetes Federation Saeedi et al. (2019) , in 2019 there were approximately 463 million adults with diabetes worldwide. This number is expected to grow to 700 million by 2045. A person with diabetes has a 34% lifetime risk of developing a diabetic foot ulcer (DFU). In other words, 1 in every 3 people with diabetes will develop a DFU in their lifetime Armstrong et al. (2017) . Infection of a DFU Technologies developed to enhance ulcer diagnostics and care plans have the potential to revolutionise diabetic foot care. Detection tasks can be challenging when taking into account the numerous environmental elements in real-world settings. Examples of some observations include: • Newly acquired and subtle early stages of ulceration can easily be missed by care personnel during visual assessment of priorly acquired conditions due to the short time designated for standard treatment • Low-quality images with bad focus, motion blur, occlusion, poor lighting, and backlight are common in wound documentation due to limited available time for treatment and documentation, even when performed by trained personnel • Malformed toenails, deep rhagades, folded amputation scars, and fresh epithelialization are examples for false positive detections that require manual correction, which can be time consuming when documenting DFU • Very small and very large and curved ulcers are problematic for certain detectors, but are common in typical wound care documentation It is essential to develop a technological solution capable of transforming current screening practices that have the potential to significantly reduce clinical time burdens. With the emerging growth of deep learning, automated analysis of DFU has become possible. However, deep learning requires large-scale datasets to achieve results comparable with those of human experts. Currently, medical imaging researchers are working in isolation and the majority of their research is not reproducible. To bridge the gap and to motivate data sharing amongst researchers and clinicians, Yap et al. Yap et al. (2020c,b) proposed the diabetic foot ulcer challenges. This paper presents an overview of the state-of-the-art computer methods in DFU detection, provides an overview of the publicly available datasets, presents a comprehensive evaluation of the popular object detection frameworks on DFU detection, proposes an ensemble method and Cascade Attention DetNet for DFU detection, and conducts a comprehensive evaluation of the deep learning algorithms trained on the DFUC2020 dataset. The growing number of reported cases of diabetes has resulted in a corresponding growth in research interest in DFU. Early attempts in training deep learning models in this domain have shown promising results. Previous research Goyal et al. (2020a Goyal et al. ( , 2017 Goyal et al. ( , 2019b trained models capable of classification, localisation and segmentation. These models reported high levels of mean average precision (mAP), sensitivity and specificity in experimental settings. The existing method on localisation was trained using Faster R-CNN with Inception v2 and two-tier transfer learning from the Microsoft Common Objects in Context (MS COCO) dataset. However, despite the high scoring performance measures, these models were trained and evaluated on small datasets (<2000), therefore the results cannot be regarded as conclusive evidence of their efficacy in real-world settings. Brown et al. Brown et al. (2017) created the MyFootCare mobile app which was designed to encourage patient selfmonitoring using diaries, goals and notifications. The app stores a log of patient foot images and is capable of semiautomated segmentation. This novel solution to maintaining foot records utilises a method of automatic photograph capture where the phone is placed on the floor and the patient is guided using voice feedback. However, this particular function of the system was not tested during the actual experiment, so it is not known how well it performed in real-world settings. Wang et al. Wang et al. (2015 , 2017b ) devised a method of consistent DFU image capture using a box with a glass surface containing mirrors which reflect the image back to a camera or mobile device. Cascaded two-stage support vector classification was used to ascertain the DFU region, followed by a twostage super-pixel classification technique used for segmentation and feature extraction. Despite being highly novel, this method exhibited a number of limitations, such as risk of infection due to physical contact between wound and capture box. The design of the capture box also limited monitoring to DFU that are present on the plantar surface of the foot. The sample size was also statistically insignificant, with only 35 images from real patients and 30 images of wound moulds. The DFU datasets provided by The Manchester Metropolitan University and Lancashire Teaching Hospitals NHS Trust Goyal et al. (2020a,b) ; Cassidy et al. (2020) are digital DFU image datasets with expert annotations. The aim of the publication of this data is to encourage more researchers to work in this domain and to conduct reproducible experiments. There are three types of datasets made publicly available for researchers. The first dataset consists of foot skin patches for wound classification Goyal et al. (2020a) ; the second dataset contains regions of interests for infection and ischaemia classification Goyal et al. (2020b) ; and the third is the most recently published dataset for DFU detection Cassidy et al. (2020) . The third dataset is the largest dataset to date, and increased usage of this data is the driving force for the organisers of the DFU challenges. The researchers involved in organising the yearly DFU challenges Yap et al. (2020c,b) , in conjunction with the MICCAI conferences, aim to attract wider participation to improve the diagnosis/monitoring of foot ulcers and to raise awareness of diabetes and DFU. There are numerous aspects to take into account in the development of accurate detection algorithms. As is the case with other medical imaging research fields, increasing the number of images is only one of them. The Diabetic Foot Ulcers Grand Challenge (DFUC2020) dataset consist of 2,000 training images, 200 validation images and 2,000 testing images Cassidy et al. (2020) ; Goyal et al. (2019b) . The data consists of 2,496 ulcers in the training set and 2,097 ulcers in the testing set. In an attempt to promote model robustness, some of the images in the testing set do not exhibit DFUs. The details of the dataset are described in Cassidy et al. (2020) . To improve the performance of the deep learning methods and to reduce computational costs, all images were resized to 640 × 480 pixels. Since the release of the DFUC2020 training dataset on the 27th April 2020, we received requests from 39 international institutions, as shown in Fig. 1 . There are a total of 31 submissions to the challenge from 11 teams. We report the top scores from each team and discuss their methods according to the object detection approaches they implemented. This section presents a comprehensive description of the DFU detection methods used, grouped according to the popular deep learning object detection algorithms they apply, i.e. Faster R-CNN, YOLOv3, YOLOv5 and EfficientDet. We also include descriptions of an ensemble method and a new Cascade Attention DetNet (CA-DetNet). Faster R- CNN Ren et al. (2017) is one of the two-stage object detection models, which generates a sparse set of candidate object locations using a Region Pooling Network (RPN) based on shared feature maps, which then classifies each candidate proposal as the foreground or background class. After extracting shared feature maps with a CNN, the first stage RPN takes shared feature maps as an input and generates a set of bounding box candidate object locations, each with an "objectness" score. The size of each anchor is configured using hyperparameters. Then, the proposals are used in the region of interest pooling layer (RoI pooling) to generate subfeature maps. The subfeature maps are converted to 4,096 dimensional vectors and fed forward into fully connected layers. These layers are then used as a regression network to predict bounding box offsets, with a classification network used to predict the class label of each bounding box proposal. The RoI pooling layer quantizes a floating-number RoI to the discrete granularity of the feature map. This quantization introduces misalignments between the RoI and the extracted features. Therefore, the model evaluated in this paper employs a RoIAlign layer, which is introduced in Mask R- CNN He et al. (2017) , instead of the RoI pooling layer. This removes the harsh quantization of the RoI pooling layer, properly aligning the extracted features with the input. Additionally, the Feature Pyramid Network (FPN) Lin et al. (2017) is employed as the backbone of the network. FPN uses a top-down architecture with lateral connections to build an innetwork feature pyramid from a single-scale input. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, with the remainder of the approach being similar to ResNet. Using a ResNet-FPN backbone for feature extraction with Mask R-CNN gives excellent gains in both accuracy and speed. Specifically, we employ ResNeXt101 Xie et al. (2017) with the FPN feature extraction backbone to extract the features. In this challenge, the images in the dataset were captured from different viewpoint angles, cameras with different focal lengths and varying levels of blur. Also, the training dataset contains only 2, 000 images, which could be considered small for training deep learning models. Therefore, we employ various data augmentation techniques for robust prediction. Specifically, we employ the following augmentations: • HSV and RGB: As the lighting conditions vary between dataset images, we apply random RGB and HSV shift to the images. Especially, we randomly add/subtract from 0 to 10 RGB values and 0 to 20 HSV values in the images. • Blurring: As the dataset contains images captured from different focal lengths, some images are blurred and contain camera noise. Therefore, we apply Gaussian and median blur filters with the filter size set to 3. The filters are applied with the probability of 0.1. • Affine transformation: As the images are captured from different camera angles, we apply random affine transformations. Specifically, we apply random shift, scaling (0.1) and rotation (90 degrees). • Brightness: As the images are captured in various environments, we employ brightness and contrast data augmentation. More specifically, we randomly change the brightness and contrast in a scale from 0.1 to 0.3, with probability set to 0.2. In this paper, we fine-tune a model pretrained on MS- COCO Lin et al. (2014) . We employ Stochastic Gradient Descent Optimizer with a momentum of 0.9 and weight decay set to 0.0001. During training, we employ a warm up learning rate scheduling strategy, using lower learning rates in the early stages of training to overcome optimization difficulties. More specifically, we linearly increase the learning rate to 0.01 in the first 500 iterations, then multiply by 0.1 at epochs 6, 12 and 30. We implemented the methods based on the mmdetection repository 2 . Several papers have proposed variants of Faster R-CNN. In this paper, we implement Faster R-CNN, three variants of Faster R-CNN and ensemble the results. The three variants of Faster R-CNN are as follows: • Cascade R- CNN Cai and Vasconcelos (2021) : this variant implements a different architecture for the ROI head (the module that predicts the bounding boxes and the category label). Cascade R-CNN builds up a cascade head based on Faster R- CNN Ren et al. (2017) to refine detection progressively. Since the proposal boxes are refined by multiple box regression heads, Cascade R-CNN is optimal for more precise localization of objects. • Deformable Convolution Zhu et al. (2019) : in this variant, the basic architecture of the network is the same as Faster R-CNN. However, we replace the convolution layer with a deformable convolution layer Zhu et al. (2018) at the second, third and fourth ResNeXt blocks of the feature extractor. The deformable convolution adds 2D offsets to the regular grid sampling locations in the standard convolution, enabling free-form deformation of the sampling grid. The offsets are learned from the feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense and adaptive manner. • Prime Sample Attention Cao et al. (2020) (PISA): PISA is motivated by two considerations: samples should not be treated as independent and equally important, and the classification and localization are correlated. Thus, it employs a ranking strategy that places the positive samples with highest IoUs around each object, and the negative samples with highest scores in each cluster at the top of the ranked list. This directs the focus of the training process via a simple re-weighting scheme. It also employs a classification-aware regression loss to jointly optimize the classification and regression branches. At test time, we employ a test-time augmentation scheme: we augment the test image by applying two resolutions, and we also flip the image. As a result, we augment a single image to four images and merge the predictions obtained for the four images. We employ soft NMS (non maximum suppression) Bodla et al. (2017) with a confidence threshold of 0.5 as the post-processing of predicted bounding boxes. Combining predictions from different models can improve generalization and usually yields more accurate results compared to a single model. During the post-processing stage for Faster R-CNNs, we employ soft NMS Bodla et al. (2017) to select the predicted bounding boxes for each method. Such methods work well on a single model, but they only select the boxes and cannot produce averaged localization of predictions combined from various models effectively. Therefore, after predicting the bounding boxes for each method, we ensemble these predicted bounding boxes using Weighted Boxes Fusion Solovyev et al. (2021) . Unlike NMS-based methods that simply exclude part of the predicted bounding boxes, the Weighted Boxes Fusion algorithm uses the confidence scores of all proposed bounding boxes to form the average boxes. The reader is referred to Solovyev et al. (2021) for further details of the algorithm. We ensemble four models (pure Faster R-CNN, Cascade R-CNN, Faster R-CNN with Deformable Convolution and Faster R-CNN with Prime Sample Attention model). We set equal weights when fusing the predicted bounding boxes of each model. You-Only-Look-Once (YOLO) Redmon et al. (2016) is a unified, real-time object detection algorithm that reformulates the object detection task to a single regression problem. YOLO employs a single neural network architecture to predict bounding boxes and class probabilities directly from full images. Hence, when compared to Faster R- CNN Ren et al. (2017) , YOLO provides faster detection. Over time, improvements of YOLO were implemented and released as distinct and independent software packages by the originators Redmon et al. (2016) ; Farhadi (2017, 2018) . As a result of increased publicity and popularity, a model zoo containing further YOLO adaptations emerged. Subsequently, further maintainers continued to improve the Dark-Net 3 -based versions, and Bochkovskiy et al. (2020) created ports for other machine learning libraries such as PyTorch 4 Paszke et al. (2019) . In this paper, two approaches are selected for DFU detection using the DFUC2020 dataset: YOLOv3 and YOLOv5. We discuss the networks and present descriptions of our implementation in the following subsections. YOLOv3 Redmon and Farhadi (2018) was developed as an improved version of YOLOv2 Redmon and Farhadi (2017) . It employs multi-scale schema, predicting bounding boxes on different scales. This allows YOLOv3 to be more effective for detecting smaller targets when compared to YOLOv2. YOLOv3 uses dimension clusters as anchor boxes in order to predict bounding boxes around the desired objects in given images. Logistic regression is used to predict the objectness score for a given bounding box. Specifically, as illustrated in Fig. 2 , the algorithm predicts the four coordinates of the bounding box (t x , t y , t h , t w ) as in Equation 1. where (c x , y y ) are offsets from the top left corner of the image, and (p w , p h ) are bounding box prior height and weight. The kmeans clustering algorithm is used to determine bounding box priors, while the sum of squared errors is used for training the network. Lett * be the ground truth for some coordinate prediction, and t * be the network prediction during training. Then, the gradient ist * − t * , which can be easily computed by inverting equation 1. The backbone of YOLOv3 is a hybrid model called Darknet-53 (as shown in Table 1 ), which is used for feature extraction. As the name indicates, DarkNet-53 is made of 53 convolutional layers that also take advantage of shortcut connections. As the detection algorithm is required to detect only one type of object, the complexity of the problem is reduced from multiclass detection to single object detection. Hence, for the purpose of detecting diabetic foot ulcers, we have employed a simplified version of YOLOv3. We employ transfer learning by using the pre-trained Dark-Net weights which are provided by Redmon and Farhadi (2018) . Then, we train our detector in 2 steps, using the following settings: Adam optimizer with learning rate 1e-3, number of epochs=100, batch size=32 and using 20% of the data for validation. First, we start by freezing the top DarkNet-53 layers and train the algorithm with the above settings. Then, we retrain the entire network to improve performance. Similar to the original YOLOv3, our trained network extracts features from 3 different pre-defined scales, which is a similar concept to feature pyramid networks Lin et al. (2017) . We then use the trained network for detecting diabetic foot ulcers in blind test images. Post-processing As observed from Fig. 3 , in rare cases, the resulting algorithm may produce double detections or false positives. To reduce such examples, we include a post-processing stage. Our post-processing steps consist of two stages. First, we identify double detections by flagging the detected bounding boxes with more that 80% overlap. Among the overlapping detected boxes we only keep the box with the highest confidence result. Finally, we further post-process the results by removing any detection with a confidence score < 0.3, with the aim of reducing the rate of false positive detections. YOLOv5 was first published on GitHub 5 in May 2020 in v1.0 Jocher et al. (2020a) . The maintainer is already well known for a YOLOv3 Redmon and Farhadi (2018) port for Py-Torch 6 Jocher et al. (2020b) . The maintainer named the network YOLOv5 to avoid naming conflicts due to the prior release of YOLOv4 Bochkovskiy et al. (2020) . However, YOLOv5 is not to be confused with a descendent of the original DarkNetbased 7 YOLO-series. A scientific paper reporting on the improvements in YOLOv5 has not yet been published, but is currently pending 8 . YOLOv5 is currently under active development, with the latest version being v5.0 Jocher et al. (2021a) at the time of writing. New features and improvements in YOLOv5 are mainly focused on the incorporation of the state-of-the-art for deep learning networks, such as activation functions and data augmentation. These were partly adopted from YOLOv4 9 such as the CSPNet backbone Wang et al. (2020) , and partly had their origin in prior YOLOv4 contributions by the YOLOv5 maintainer. One of the most notable data augmentation aspects is the mosaic loader in which four images are altered and combined to form a new image. This allows detection of objects outside of their normal context and at smaller sizes, which reduces the need for large mini-batch sizes. YOLOv5 reports high inference speed and small model sizes, allowing a convenient translation to mobile use cases via model export. The approach on DFU detection via YOLOv5 described in the following is based on the early version v1.0 10 Jocher et al. (2020a) commit a1c8406 11 from 14 July 2020 that still exhibited several issues. Initially, image data of the training dataset was analyzed via AntiDupl 12 in version 2.3.10 to identify duplicate images, yielding a set of 39 pair findings. A spatial analysis of duplicate pair annotation data was performed, utilizing the R language 13 R Core Team (2020) in version 4.0.1 and the Simple Features for R (sf) package 14 Pebesma (2018) in version 0.9-2. Originally, none of the duplicate pair images showed bounding box intersections by themselves. After joining duplicate pair annotations, several intersections were detected with a maximum of two involved bounding boxes. These represented different annotations of the same wound in two duplicate images, now joint in one image. To resolve these, each intersection of two bounding boxes BBox 1 and BBox 2 were merged into a single bounding box BBox by using their outer boundaries, as shown in Equ. 2. The applied duplicate cleansing and annotation merging strategy resulted in n = 1, 961 images with k = 2, 453 annotations in the cleansed training dataset. Boundaries of merged bounding boxes were checked for consistency. Finally, annotation data was converted to the resolution-independent format used by YOLO implementations. Reviewing image data of all dataset parts (training, validation and test), showed pronounced compression artifacts and color noise due to a high compression rate and downscaling to a low resolution. As both compression artifacts and color noise had derogatory effects on the detection performance, images were enhanced using a fast implementation of the non-local means algorithm Buades et al. (2005) for color images, utilizing the Python language 15 in version 3.6.9 with the OpenCV on Wheels (opencv-python) 16 package in version 4.2.0.34. The algorithm parameters were set to h = 1 (luminance component filter strength) and hColor = 1 (color component filter strength) with templateWindowSize = 7 (template patch size in pixels) and searchWindowSize = 21 (search window size in pixels). Resulting images show less definitive compression artifact borders and notably reduced color noise. Some textures are also more pronounced. Examples of results at a macroscopic and a detail level are shown in Fig. 4 . YOLOv5 in v1.0 implements three sets of data augmentation techniques. The first set comprises alterations of colorspace components (hue, saturation, value), the second set comprises geometric distortions (random scaling, rotation, translation and shearing), and the third set is represented by the mosaic loading of images. A normalized fraction of 0.014 images received hue augmentation, 0.68 received saturation augmentation and 0.36 received value augmentation. Scaling was applied in a normalized range of ±0.5. Rotation, translation and shearing were disabled. Settings for colorspace component alterations and geometric distortions are definitions for distributions, generated during run- Fig. 4 . Effects of the non-local means (NLM) algorithm are shown for two example images (a) and (e) from the training dataset in (b) and (f). At a macroscopic level the changes are not obvious. At a detail level borders of compression artifacts on homogeneous areas and color noise of (c) are visibly reduced in (d). Vague textures of (g) are also more pronounced in (h). time by a random sampler for the augmentation function 17 . Using this approach, no image is presented more than once during training. Mosaic data augmentation is comparable to CutMix, but takes four images instead of two and does not overlap them. Image parts are placed as quadrants in a new image with random ratios, thereby allowing the model to detect objects in different contexts and at different sizes. This reduces the need for large mini-batch sizes. However, the mosaic loader had to be disabled in the presented approach due to a bug, leading to invalid bounding boxes in resulting predictions. 17 YOLOv5 question on data augmentation: https://github.com/ ultralytics/yolov5/issues/2164 (accessed 2021-04-28) YOLOv5 includes four different models ranging from the smallest YOLOv5s with 7.5 million parameters (plain 7 MB, COCO pre-trained 14 MB) and 140 layers to the largest YOLOv5x with 89 million parameters and 284 layers (plain 85 MB, COCO pre-trained 170 MB). In the approach considered in this paper, the pre-trained YOLOv5x model is used. The general YOLOv5 v1.0 architecture is displayed in Fig. 5 . Different model sizes s, m, l and x vary in set depth and width factors for the model and its layer channels, which are 1.33 and 1.25 for the YOLOv5x model. The YOLOv5x model uses a detector that consists of a Cross Stage Partial Network (CSPNet) Wang et al. (2020) backbone trained on MS COCO Lin et al. (2014) , and a model head using a Path Aggregation Network (PANet) Liu et al. (2018) for instance segmentation. The backbone further incorporates a Spatial Pyramid Pooling (SPP) network He et al. (2015) , which allows for dynamic input image size and is robust against object deformations. The hardware setup used for the experiment comprised a single NVIDIA® V100 18 tensor core graphical processing unit (GPU) with 16 GB memory as part of an NVIDIA® DGX-1 19 supercomputer for deep learning. YOLOv5 was set up using a provided Docker container 20 , executed via Nvidia-Docker 21 in version 19.03.5. Training was organized in two stages: Initial training and 18 NVIDIA® V100: https://www.nvidia.com/en-us/data-center/ v100/ (accessed 2020-08-30) 19 NVIDIA® DGX-1: https://www.nvidia.com/en-us/ data-center/dgx-1/ (accessed 2020-08-30) 20 YOLOv5 Docker Hub container: https://hub.docker.com/r/ ultralytics/yolov5 (accessed 2020-08-30) 21 Nvidia-Docker GitHub repository: https://github.com/NVIDIA/ nvidia-docker (accessed 2020-08-30) self-training. The initial training stage uses the original available training data to train a model. The self-training approach, also called pseudo-labelling, extends available training data by inferring detections on images for which originally no annotation data is available Koitka and Friedrich (2017) . This is realized using the model resulting from the initial training stage; yielded detections are then used as pseudo-annotation data. Resuming the initial training in the self-training stage with the extended training data generalizes detection capabilities of the model. A five-fold cross-validation was performed for each training stage to approximate training optima. Both stages used the default set of hyperparameters (including parameters related to the data augmentation procedures): optimizer = SGD, lr0 During the initial training stage, a base model was trained on the pre-processed training dataset for 60 epochs with a batch size of 30. This base model was initialized with weights from the MS COCO pre-trained YOLOv5x model. For the selftraining approach, the base model was then used to create the extended training dataset for self-training. Pseudo-annotation data was inferred for the validation and test datasets, using the best-performing epoch automatically saved at epoch 58. The resulting extended training dataset contained 4, 161 images, of which 3, 963 included 4, 638 wound annotations. During the self-training stage, the base model training was resumed at its latest epoch, but trained further on the extended training dataset with a batch size of 20. Three final training states were created: (1) after an additional 30 epochs, (2) after an additional 40 epochs, and (3) after an additional 60 epochs of self-training (referred to as E60 SELF90, E60 SELF100, and E60 SELF120). The minimum confidence threshold for detection was set to 0.70, so that only highly certain predictions were exported. This applies for pseudo-annotation data of the extended training dataset created for self-training as well as for the final predictions. Predictions for our experiments were inferred via the final training states E60 SELF90, E60 SELF100, and E60 SELF120, using the best epochs 88, 96 and 118 respectively. An additional experiment was conducted based on the training state E60 SELF100 involving the built-in test-time augmentation and non-maxima suppression (NMS) features of YOLOv5 for inference. Test-time augmentation (TTA) is a data augmentation method which involves several augmented instances of an image that are presented to the model. For each instance, pre-dictions are made; these predictions for the image provide an ensemble of instance predictions. This can enable a model to detect objects it may not be able to detect in a "clean" image. However, TTA may also cause multiple distinct detections for the same object that can harm evaluation scores. To tackle these, NMS was applied to collapse multiple intersecting detections into a single bounding box. The intersection over union (IoU) threshold was set low to IoU ≥ 0.30, as in cases of multiple wounds in an image usually a distinct spatial demarcation was given. Thus, the risk of interfering detections of different wounds was low. The EfficientDet architecture Tan et al. (2020) is an object detection network created by the Google Brain team, and utilises the EfficientNet ConvNet Tan and Le (2019) classification network as its backbone. EfficientDet uses feature fusion techniques in the form of a bidirectional feature pyramid network (BiFPN) which combines representations of input images at different resolutions. BiFPN adds weights to input features which enables the network to learn the importance of each feature. The outputs from the BiFPN are then used to predict the class of the detected object and to generate bounding boxes using bounding box regression. The main feature of EfficientDet is its ability to utilise compound scaling, which allows all parts of the network to scale in accordance to the target hardware being used for training and inference Tan et al. (2020) . An overview of the EfficientDet architecture is shown in Fig. 6 . The dataset was captured with different types of camera devices under various lighting conditions. To counter variations in noise and lighting found in the dataset images, the Shades of Gray (SoG) color constancy algorithm was used Ng et al. (2019) . Examples of pre-processed DFU images using SoG are shown in Fig. 7 . Data Augmentation techniques have been proven to be an important tool in improving the performance of deep learning algorithms for various computer vision tasks Goyal et al. (2019a) ; Yap et al. (2020a) . For the application of EfficientDet, we augmented the training data by applying identical transformations to the images and associated bounding boxes for DFU detection. Random rotation and shear transformations were used to augment the DFUC2020 dataset. Shearing involves the displacement of the image at its corners, resulting in a skewed or deformed output. Examples of these types of data augmentation are shown in Fig. 8. EfficientDet algorithms achieved state-of-the-art accuracy on the popular MS- COCO Lin et al. (2014) object detection dataset. EfficientDet pre-trained weights are classed from D0 to D7, with D0 having the fewest number of parameters and D7 having the highest number of parameters. Tests on the MS-COCO dataset indicate that training using weights with more parameters results in better network accuracy. However, this comes at the cost of significantly increased training time. Given that the DFUC2020 dataset images were resized to 640x480, we selected the EfficientDet-D1 pre-trained weights for DFU detection Goyal and Hassanpour (2020) . We trained the EfficientDet-D1 method on an NVIDIA Quadro RTX 8000 GPU (48GB) with a batch-size of 16, SGS optimizer with a learning rate of 0.00005, momentum of 0.9 and number of epochs set to 50. We used the validation accuracy with early stopping to select the final model for inference. We further refined the EfficientDet architectures with a score threshold of 0.5 and removed overlapping bounding boxes to minimize the number of false positives. The scores were compared between the overlapping bounding boxes, with the bounding box with the highest score used as the final output. Given that the DFUC2020 dataset has only 2,000 images for training, we use two data augmentation methods to complement the dataset in order to avoid over-fitting when training models. A more generalized model can be obtained through data augmentation in order to make it adapt to the complex clinical environment. We use common data augmentation methods including horizontal and vertical image flipping, random noise and a central scaling method (which scales with ground truth as the center). Additionally, we increase the number of training images by using the visually coherent image mixup method Zhang et al. (2018) . The original purpose of this method is to overcome the problem of disturbance rejection. Since Zhang et al. Zhang et al. (2019) introduced this method into object detection, many researchers have used it in data augmentation to enhance network robustness. The principle of this algorithm involves the random selection of two sample images which are then used to generate a new sample image according to equation 3 and equation 4.x where (x i , y i ), (x j , y j ) are the points of two sample images and λ ∈ [0,1], which is randomly generated by the Beta(alpha, alpha) distribution. The new sample (x,ŷ) is used for training. As shown in Fig. 9 , two images of DFU are mixed in a certain ratio. We use Beta(1.5,1.5) for the images' synthesis. DFU detection can be challenging in complex environments, such as clinical settings, due to the large number of objects that might be present. To improve the ability of detection, we use the mobile fuzzy method for data augmentation, as shown in Fig. 10 . The Cascade R-CNN Cai and Vasconcelos (2018) is the first cascaded object detection model. Due to the superior performance of the cascade structure, it is widely used in the field of object detection Zhao et al. (2020) . We use the cascade structure in conjunction with DetNet Li et al. (2018) , which is designed to address the problems incurred by down-sampling repeatedly, as such a process reduces the accuracy of positioning. DetNet makes full use of the dilated convolutions to enhance the receptive field instead of down-sampling repeatedly. The overall framework of our method, Cascade Attention DetNet (CA-DetNet) is shown in Fig. 11 . The detection of DFU is different from common object detection tasks. For common object detection tasks, objects can appear anywhere in the image. For the detection of DFU, the wounds can only appear on the foot, which is a good fit for applying an attention mechanism, which we added into the DetNet by adopting the mask branch of the Residual Attention Network Wang et al. (2017a) . The Attention DetNet (A-DetNet) is composed of 6 stages. The first stage consists of a 7 × 7 convolution layer (with a stride of 2) and a max-pooling layer. The second, third and fourth stages contain an A-Resbody, with the fifth and sixth stages containing an A-Detbody. The A-Resbody and A-Detbody are similar to those in the original DetNet. The difference between A-DetNet and the original DetNet is the addition of an attention branch into the Resbody and Detbody. The attention branch is similar to the mask branch of the Residual Attention Network, while we take other parts from the original Resbody or Detbody as the trunk. The attention branch of the Resbody comprise of two zoom structures, which consist of a max-pooling layer and an up-sampling layer, followed by two 1 × 1 convolution layers activated by sigmoid functions. Given that the five times down-sampling results in a feature map that is too small to recover the original size by up-sampled, we only add one zoom structure into the attention branch of the A-Detbody. The feature map from the trunk is multiplied by the mask from the attention branch. To avoid consuming the value of the feature and breaking the identity mapping, we refer to the Residual Attention Network and add one to the mask. For the cascade structure, we set the total number of cascade stages to 3, with the intersect over union (IOU) threshold set to 0.5, 0.6 and 0.7 for each of the three stages. During training we use DetNet pre-trained model, which has been trained Fig. 11 . The framework of CA-DetNet. "Image" is an input image. "A-DetNet" is a backbone network. "Pool" represents region-wise feature extraction. "H" is a network head. "B" is a bounding box and "C" represents classification. "B0" is the proposal in all architectures. The structure of the A-DetNet is based on the DetNet. The attention mechanism is applied in Resbody and Detbody. Different bottleneck blocks in the Detbody or Resbody are similar to those in the DetNet. on the ImageNet dataset, to accelerate model convergence. We train on one GPU (NVIDIA Tesla P100) for 60 epochs, with a batch size of 4 and a learning rate of 0.001. The learning rate decreases 10 times at the 10th epoch, and then decreases another 10 times at the 20th epoch. We optimize the model with the Adam optimizer. Noise from the external environment can lead to many low confidence bounding boxes. These bounding boxes will reduce the performance of the detector, so we adopt a special threshold suppression method to suppress bounding boxes with low thresholds except when the detector detects only one bounding box. We set the threshold to 0.5. We report and analyse the results obtained using the methods described above. The evaluation metrics are the number of true positives (TP), the number of false positives (FP), recall, precision, F1-Score and mAP, as described in the diabetic foot ulcer challenge 2020 Cassidy et al. (2020) . For the common object detection task, mAP is used as the main evaluation metric. However, in this DFU task, miss-detection (a false negative) has potentially severe implications as it may affect the quality of life of patients. An incorrect detection (a false positive) could increase the financial burden on health services. Therefore, we regard F1-Score as equally important as mAP for performance evaluation. Table 2 summarizes the quantitative results of pure Faster R-CNN, its variants, and the final ensemble model. From the table, the performance of pure Faster R-CNN is on par with Cascade R-CNN. In contrast, employing the Deformable convolution or PISA module significantly improves the performance. After we ensemble the model, we reduce FP substantially, with a reduction in TP also observed. Although the ensemble method improves the precision of DFU detection, it does not improve the overall score. Therefore, the best result is achieved by Deformable Faster R-CNN, with a mAP of 0.6940 and F1-Score of 0.7434. The qualitative results of Faster R-CNN with Deformable Convolution is summarized in Fig. 12 . It can be seen that our model successfully detected the wounds in the image, even though the wounds are small (top-left, bottom-left and bottomright images) or the images are blurred (top-right image). However, we observed the miss-detection as in the bottom-right image. In this image, the background texture of the blood was incorrectly detected as a DFU. To improve prediction accuracy, the training data should be captured in various environments so that the network is better able to discern between DFU and background objects. Table 3 shows the final results of the proposed YOLOv3 method on the testing dataset. The results are reported for two different batch sizes, with and without post-processing. The qualitative results of Faster R-CNN with Deformable Convolution, which shows the best performance among Faster R-CNN based methods. It is noted that the network is able to detect small ulcers as shown in (a),(b) and (c). An example of a FP generated by the network is shown in (d). As the results indicate, using a batch size of 50 leads to a better overall performance compared to using a batch size of 32. It also demonstrates that removing the overlaps leads to improvement in both F1-score and Precision, while resulting in slight decreases to both mAP and Recall. As the gain overpowers the loss, we conclude that removing overlaps results in better overall performance. While removing the detections with less than 0.3 confidence results in slightly better precision, it reduces recall, F1-score and mAP. Therefore, unless precision is the priority, removing the low confidence detections would not lead to an improvement. Examples of final detections for YOLOv3 are presented in Fig. 13 . Additionally, we added 60 copyright-free images of healthy feet 22 to the training set to observe the effect on detection performance. As shown in Table 3 , this results in an improvement of F1-Score, but reduces mAP. Table 4 summarizes the results of YOLOv5. Fewer additional self-training epochs in method E60 SELF90 achieved better results than E60 SELF100 and E60 SELF120. However, the application of TTA with NMS on E60 SELF100 achieved the best results in E60 SELF100 TTA NMS. Examples of detections with E60 SELF100 TTA NMS on the test set are shown in Fig. 14, Fig. 15 shows additional examples of false negative and false positive cases. Table 3 . YOLOv3: Results of different settings, post-processing and adding extra copyright free foot images. B50 and B32: compares the performance of the method with batch size 50 with 32. Overlap-Removed: indicates the performance of the method, with overlap removal post processing. conf0.3: shows the impact of ignoring predictions with < 0.3 confidence. Extra: demonstrates the effect on performance of adding extra images of healthy feet. Settings Metrics Table 5 shows the results of the EfficientDet model on the DFUC2020 testing set both with and without post-processing. The results indicate that the number of both TP and FP cases are reduced with the post-processing method. However, with the post-processing method, the percentage of TP cases (from 1,626 to 1,593) is 2.02% compared to FP cases (from 720 to 594), which is 17.50%. Hence, the post-processing method results in an important improvement in both Precision (67.86% to 72.84%) and F1-score (72.38% to 74.37%), with a slight decrease in both mAP (57.82% to 56.94%) and Recall (77.44% to 75.97%). The EfficientDet with post-processing method achieved the highest F1-Score and Precision (least number of FP cases) in DFUC2020. Examples of final outputs by the refined EfficientDet architecture are shown in Fig. 16 . Table 6 summarizes the results of the Cascade Attention DetNet on the DFUC2020 testing dataset. The results are reported for two different data augmentation methods, two different backbones and with or without a pre-trained model. From the results, we observe that CA-DetNet with two data augmentation methods and the pre-trained model achieves the best result. It achieves the highest score of 63.94% on mAP and 70.01% on F1-Score. The C-DetNet achieves the highest score of 74.11% on Recall, while the CA-DetNet with the mobile fuzzy method achieves the highest score of 66.67% on Precision. From the analysis, we observe that the mobile fuzzy data augmentation method brings about a striking effect and improves 1.46% on mAP and 1.03% on F1-Score. However, we note that using the single mixup method in data augmentation did not enhance the performance. The results suggest that the mobile fuzzy method allows the model to adapt to the noise from the external environment, while the mixup method is detrimental. The attention mechanism contributes to the improved performance of detection and increases mAP by 0.02% and F1-Score by 0.03%. Moreover, training with a pre-trained model can accelerate the convergence of the model and improve its ability to detect DFU. Our approach was effective for the vast majority of the detected cases, as shown in Fig. 17 . However, due to the visual complexity of clinical environments, there are also some failure cases in our approach. From our observations, such failures are generally due to the false identification of toenails, interference from the external environment and low image quality. For the false identification of toenails, we believe that the appearance of leuconychia is similar to wounds and some cases of DFU are located on or around the toenail. Background objects may also sometimes interfere with detection results. We use the attention mechanism to deal with this problem to some extent. For image quality, we observe that there are several images which are blurry. We use data augmentation methods like the mobile fuzzy method to partially address this problem. We speculate that a two-stage architecture with an initial stage to detect and segment the relevant foot area could be used to address this issue. However, additional labeled data may be required be achieve this goal. The results from the popular deep learning object detection methods and the proposed CA-DetNet are comparable. Table 7 shows the overall results when evaluated on the DFUC2020 testing set, where we present the best mAP from each object detection method. Considering the ranking based on mAP, the best result is achieved by the variant of Faster R-CNN using Deformable Convolution, with 0.6940. This method achieves the highest TP and the best Recall. It is noted that YOLOv5 achieved the lowest number of FP, but it has lower mAP and F1-Score. In Table 8 , the ranking according to F1-Score shows the highest F1-Score of 0.7437 obtained by EfficientDet, however, this network reports the lowest mAP at 0.5694. On the other hand, the Faster R-CNN approach achieves a comparable F1-Score of 0.7434 with the highest mAP of 0.6940. Fig. 18 visually compares the detection results on DFUs with less visible appearances. In Fig. 18(a) , the ulcer was detected by all the methods. However, in Fig. 18(b) , only Faster R-CNN and EfficientDet detected the ulcer. Fig. 18 (c) is another challenging case and was detected by CA-DetNet and Faster R-CNN. In Fig. 18(d) , we demonstrate a case where only Faster R-CNN successfully localised the ulcer. In Section 5.1, we demonstrate that the ensemble method using Weighted Boxes Fusion did not improve the results of four Faster R-CNN approaches. This observation suggests that additional experiments based on different deep learning approaches should be investigated. We ran experiments based on combinations of two approaches (Faster R-CNN + (CA-DetNet / Effi-cientDet / YOLOv3 / YOLOv5)), three approaches and a combination of all approaches, as summarised in Table 9 . From our observation, the ensemble methods reduces the number of TPs and FPs, i.e., the more approaches used, the lower the number of TPs and FPs. This approach did not improve mAP, but in the majority of the ensembles there are notable improvements in precision, hence led to an improvement in F1-Score. The best F1-Score for the ensemble method is 0.7617, achieved by ensembling Faster R-CNN with Deformable Convolution and EfficientDet. Apart from fine-tuning each deep learning method to achieve maximum performance, the methods are highly dependent on the pre-processing stage, selection of data augmentation, postprocessing methods and ensemble method. We address the limitations and future challenges of our work in the following section. In this section, we discuss the performance of each object detection method and future work to improve DFU detection. Whilst most of the results show an F1-Score >70%, there are many challenges ahead to enable the use of deep learning algorithms in real-world settings. Faster R-CNN based approaches detected DFU in the DFUC2020 testing set with high mAP and F1-Score. In addition, the variants of Faster R-CNN largely improve the performance of the original Faster R-CNN. After ensembling the results of four models, we reduced the number of false positives, however, the overall performance was less than the individual variants of Faster R-CNN. The reason may be that even though we are fusing the prediction of four models into one prediction, similar results are predicted among these four models because all models are based on Faster R-CNN. Therefore, in future work, a one-stage object detection method such as Cen-terNet Zhou et al. (2019) could potentially be included in the ensemble method to produce more accurate results. The YOLOv3 algorithm is able to reliably detect DFU and ranked third place in both mAP and F1-Score ranking. We have observed that post-processing (by removing overlaps), along with the removal of low confidence detections, leads to an improvement in precision but at the expense of the number of true positives and recall. Additionally, our analysis indicates that adding additional images of healthy feet, along with postprocessing, can result in a higher F1-score. We aim to further investigate the results of pre-processing, as well as studying a more effective post-processing method. The YOLOv5 approach demonstrated reliable detection performance with an overall high precision over the different model configurations. Application of the NLM algorithm for image enhancement and generalization via self-training helped to further increase precision. Improvements via duplicate cleansing and bounding box merging were marginal due to the limited number of cases, but could prove beneficial on larger datasets. Use of TTA with NMS further increased true-positive detections at the cost of increased false-positive cases, yet increased the mAP and F1-Score. For the presented approach, several optimizations may be possible. The least self-trained model performed best, indicating that models with less selftraining epochs may perform better. Model Ensembling 23 could allow further performance improvements when fusing different specialized models. In addition, investigation of Hyperparameter Evolution 24 allows general hyperparameter optimization, given the required resources. As the presented results were obtained with the initial release v1.0 Jocher et al. (2020a) of YOLOv5, the resulting performance is limited compared to that achievable with matured upto-date versions of the network Brüngel and Friedrich (2021) . Since its release, YOLOv5 has been improving rapidly and its full potential could not be taken advantage of during the DFUC2020. E.g., in v1.0 the novel Mosaic data augmentation method was not functioning correctly on custom data. At the time of writing, the matured version v5.0 25 Jocher et al. (2021a) is available, featuring numerous bug fixes, improvements and novelties. E.g., in the meantime the activation function changed from Leaky ReLU Maas et al. (2013) in v1.0 (used here) to Sigmoid Linear Unit (SiLU) Hendrycks and Gimpel (2020) (since v4.0) Jocher et al. (2021b) , further increasing detection performance. Due to its reasonable performance and mobile-focus, YOLOv5 will prove helpful when performing DFU detection tasks directly on mobile devices. The refined EfficientDet algorithm is able to detect DFU with a high recall rate. The pre-processing stage using the Shades of Gray algorithm improved the color consistency of the images in the dataset. We extensively used data augmentation techniques to learn the subtle features of DFUs of various sizes and severity. The post-processing stage we implemented has refined the inference of the original EfficientDet method by removing overlapping bounding boxes. Due to low mAP, further work will focus on investigating the larger EfficientDet network architectures, particularly EfficientDet-D7. The performance of Cascade Attention DetNet on the DFUC2020 testing set is not entirely satisfactory. We evaluated our model on 10% of the DFUC2020 training set and it achieved an mAP of 0.9. We analyzed the possible reasons and speculate that the model may be over-fitting, to which ensemble learning may provide a possible solution. We further aim to use appropriate data augmentation methods to improve the robustness of the model. The ensemble methods based on the fusion of different backbones have reduced the number of predicted bounding boxes substantially. Faster R-CNN with Deformable Convolution predicted 2,240 bounding boxes. However, after ensembling with EfficientDet, only 1,847 bounding boxes were predicted. The number of predicted bounding boxes dropped to 1,475 when we ensembled the results from all five networks. Consequently, the ensemble methods reduced the number of TPs and FPs. It is crucial for future research to focus on true positives, i.e. correctly locate the DFUs. One of the aspects required to overcome this issue is to understand the threshold setting of IoU. Our experiments used IoU ≥ 0.5, which is the guideline set by object detection for natural objects. However, in medical imaging studies Drukker et al. (2002) ; Yap et al. (2008) , they used an IoU (or Jaccard Similarity Index) threshold of 0.4. When we evaluated the performance of the best ensemble method, the number of TPs increased to 1,594, with IoU ≥ 0.3 the number of TPs increased to 1,668. With Faster R-CNN with Deformable Convolution, the number of TPs increased to 1,743 and 1,883 for IoU thresholds of 0.4 and 0.3, respectively. Currently, clinicians (podiatrists and diabetes consultants) visually assess the diabetic foot for detection of ulcers, taking photographs at the diagnosis stage and periodically reevaluating on subsequent patient clinic appointments. This pro-vides accurate assessment of wound healing progress by visually comparing photographs of ulcers at different stages of the disease. We have developed AI algorithms so that in the near future, patients can use mobile devices in their homes so that detection, assessment of wound progress and prognostication can be completed remotely without the need for frequent appointments to foot clinics. Our research in evaluating the performance of different deep learning frameworks in DFU detection is a crucial step to support future development in this field. We conduct a comprehensive evaluation of the performance of deep learning object detection networks for DFU detection. While the overall results show the potential of automatically localising the ulcers, there are many false positives, and the networks struggle to discriminate ulcers from other skin conditions. A possible solution to address this issue might be to introduce a second classifier in the form of a negative dataset to train future networks on. However, in reality, it may prove impossible to gather all possible negative examples for supervised learning algorithms. This approach could also impact network size and complexity, which could negatively impact inference speed. Segmenting the foot from its surroundings might provide another possible solution to this problem, so that trained models do not have to account for objects in complex environments. Other future research challenges include: • Increasing the size of the existing dataset with clinical annotations which would include metadata indicating the development stage of each DFU. However, in the real-world, there are still barriers in data sharing and clinical annotation is expensive and time consuming. It will be important to encourage the co-creation of such datasets via machine learning and clinical experts to foster a better understanding of the annotated data. While increasing the number of images may benefit the training process, other aspects such as ulcer location and image capture from subjects with various skin tones should be considered. • Create self-supervised and unsupervised deep learning algorithms for DFU detection. These methods were developed and implemented for natural object detection tasks and remain under-explored in medical imaging. • For inspections of DFU, accurate delineation of an ulcer and its surrounding skin can help to measure the progress of the ulcer. Goyal et al. Goyal et al. (2017) developed an automated segmentation algorithm for DFU. However, they experimented on a small dataset and future work will potentially enable a larger scale of experimentation. • The use of DFU classification systems that can be used by clinicians to analyse ulcer condition. Automated analysis and recognition of DFU can help to improve the diagnosis of DFUs. The next challenge (DFUC2021 Yap et al. (2020b)) will focus on multi-class DFU recognition. • With the growth in the number of people diagnosed with diabetes, remote detection and monitoring of DFU can reduce the burden on health services. Research in optimization of deep learning models for remote monitoring is another active research area that has the potential to change the healthcare landscape globally. Due to accompanying and challenging phenomena, such as malformed toenails, rhagades and hyperkeratosis, the wound area needs to be distinguished sufficiently first. Without an accurate deep learning algorithm for DFU detection, a reliable and performant segmentation and accurate wound size estimation is not possible. A reliable detection on typical wound care documentation images, created under uncontrolled (non-laboratory) conditions, remains the first and cardinal problem. Ultralytics' YOLOv3 GitHub repository YOLOv4 GitHub repository Simple Features for R (sf) GitHub repository Diabetic Foot Ulcers and Their Recurrence YOLOv4: Optimal Speed and Accuracy of Object Detection Soft-NMS -Improving Object Detection with One Line of Code MyFootCare: A Mobile Self-Tracking Tool to Promote Self-Care amongst People with Diabetic Foot Ulcers Detr and yolov5: Exploring performance and self-training for diabetic foot ulcer detection A Non-Local Algorithm for Image Denoising Cascade R-CNN: Delving Into High Quality Object Detection Cascade R-CNN: High Quality Object Detection and Instance Segmentation Prime sample attention in object detection DFUC2020: Analysis Towards Diabetic Foot Ulcer Detection Computerized lesion detection on breast ultrasound A Refined Deep Learning Architecture for Diabetic Foot Ulcers Detection Region of Interest Detection in Dermoscopic Images for Natural Data-augmentation DFUNet: Convolutional Neural Networks for Diabetic Foot Ulcer Classification Recognition of ischaemia and infection in diabetic foot ulcers: Dataset and techniques Robust Methods for Real-Time Diabetic Foot Ulcer Detection and Localization on Mobile Devices Fully convolutional networks for diabetic foot ulcer segmentation 2017 IEEE International Conference on Computer Vision (ICCV), IEEE Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Gaussian Error Linear Units (GELUs) 2020a. ultralytics/yolov5: Initial release ultralytics/yolov5: v5.0 -YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations ultralytics/yolov5: v4.0 -nn.SiLU() activations, Weights & Biases logging Optimized Convolutional Neural Network Ensembles for Medical Subfigure Classification DetNet: Design Backbone for Object Detection Feature Pyramid Networks for Object Detection Microsoft COCO: Common Objects in Context Path Aggregation Network for Instance Segmentation Proceedings of the 30th International Conference on Machine Learning (ICML) 2013, ICML The effect of color constancy algorithms on semantic segmentation of skin lesions Pytorch: An imperative style, high-performance deep learning library Simple Features for R: Standardized Support for Spatial Vector Data R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing You Only Look Once: Unified, Real-Time Object Detection YOLO9000: Better, Faster, Stronger YOLOv3: An Incremental Improvement Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas Weighted boxes fusion: Ensembling boxes from different object detection models EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks EfficientDet: Scalable and Efficient Object Detection CSPNet: A New Backbone that can Enhance Learning Capability of CNN Residual Attention Network for Image Classification Area Determination of Diabetic Foot Ulcer Images Using a Cascaded Two-Stage SVM-Based Classification Smartphone-Based Wound Assessment System for Patients With Diabetes Aggregated Residual Transformations for Deep Neural Networks A novel algorithm for initial lesion detection in ultrasound breast images Breast ultrasound region of interest detection and lesion localisation Diabetic Foot Ulcers Grand Challenge Diabetic Foot Ulcers Grand Challenge mixup: Beyond Empirical Risk Minimization Bag of Freebies for Training Object Detection Neural Networks Pointer Defect Detection Based on Transfer Learning and Improved Cascade-RCNN Deformable Convolutional Neural Networks for Hyperspectral Image Classification Deformable ConvNets V2: More Deformable We gratefully acknowledge the support of NVIDIA Corporation for the use of GPUs for this challenge and sponsoring our event. A.A., D.B.A. and M.O. were supported by the National Health and Medical Research Council [GNT1174405] and the Victorian Government's OIS Program.