Microsoft Word - ESWA-D-16-03854.docx 1 Effective Features to Classify Skin Lesions in Dermoscopic images Zhen Maa, João Manuel R. S. Tavaresb a Instituto de Ciência e Inovação em Engenharia Mecânica e Engenharia Industrial, Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, s/n 4200- 465 Porto, Portugal; email: zhen.ma@fe.up.pt b Instituto de Ciência e Inovação em Engenharia Mecânica e Engenharia Industrial, Departamento de Engenharia Mecânica, Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, 4200-465 Porto, Portugal; email: tavaresfe.up.pt Corresponding Author: Prof. João Manuel R. S. Tavares Faculdade de Engenharia da Universidade do Porto (FEUP) Departamento de Engenharia Mecânica (DEMec) Rua Dr. Roberto Frias, s/n, 4200-465 PORTO - PORTUGAL Tel: +315 22 5081487, Fax: +315 22 5081445 Email: tavares@fe.up.pt, Url: www.fe.up.pt/~tavares 2 Effective Features to Classify Skin Lesions in Dermoscopic images Abstract Features such as shape and color are indispensable to determine whether a skin lesion is a melanoma or not. However, there are no fixed guidelines to define which features are effective and how to combine them for classification. This lack of definition impedes the development of the automatic analyses of dermoscopic images. In this work, a search for effective features was carried out using a support vector machine. Three image databases were used to verify the feasibility and sensitivity of the automatic classification used. The results showed which features had a major influence on the classification performance, and confirmed the need to use various types of features in this process. Keywords Skin lesion; melanoma; ABCD rule; shape features; color features; feature selection. 3 1. Introduction Dermoscopy has been widely used for the diagnosis of skin lesions. The additional details provided by a dermoscope, compared to the inspection by the naked-eye, contribute to a significant improvement in the detection rate of melanoma (Binder et al., 1995; Argenziano et al. 2002; Kittler et al., 2002; Heymann, 2005). However, the experience of the dermatologist can significantly affect the diagnostic accuracy (Binder et al., 1995; Kittler et al., 2002); therefore, an effective automatic computer-aided model to analyze dermoscopic images is urgently needed. A scheme for such a model includes three major steps: first, the border of the skin lesion is identified on the image under study (Ma & Tavares, 2016; Silveira et al., 2009); then, features are extracted based on the segmentation, and with these features the lesion is classified into an appropriate category. Many mature algorithms have been proposed for the classification step, such as support vector machines (SVMs) and artificial neural networks (ANNs) (Kulkarni et al., 1998; Hastie et al., 2009); however, their performances rely on the features from the second step. Unfortunately, there is no recommended set of features for detecting melanomas in the current literature. Although many novel features have been proposed to improve the classification accuracy of these algorithms, the validation is normally carried out using different methods with different feature sets. The use of redundant or irrelevant features for classification not only decreases the computational efficiency but also affects the performance. Moreover, in many cases, cross validation is used to test a classification scheme, which can present over-optimistic results since images in the same database are prone to have similar imaging conditions and alike structures of skin lesions. Feature selection algorithms have been proposed to find the significant features, especially in the 4 area of bioinformatics (Guyon et al., 2002; Saeys, Inza, & Larranaga, 2007). Nevertheless, the key benefit of using such an algorithm is to reduce the dimensionality of features without loss of significant information (Bermingham et al., 2015). Limitations such as the choice of the evaluation metric and image database can affect the decision of whether a feature should be kept or not. In this work, effective features are explored without focusing on factors such as the accuracy of segmentation, the choice of the classification algorithm, and the representativeness of the training set. Three dermoscopic image databases were used in this study and the feasibility of automatic analysis on dermoscopic images was investigated based on the inter-database validation. This article is structured as follows: in the next section, the features selected for classification are introduced; then, in Section 3, the experiments are described and the results are presented and discussed; in the last section, the conclusions and perspectives for future work are pointed out. 2. Features The ABCD rule or ABCDE rule (Friedman, Rigel, & Kopf, 1985) is a clinical guideline to determine whether a skin lesion is a melanoma or a common nevus (Abbasi, Shaw, & Rigel, 2004; Friedman et al., 1985). This evaluation criteria is composed of the asymmetry of the lesion region (A), the geometric properties of the lesion border (B), the color of the skin lesion (C), the diameter of the lesion region (D), and the elevation/evolution of the skin lesion (E). The E rule requires follow-up inspections, and the other four criteria can be grouped into two categories: measures that reflect the geometric properties of a skin lesion (A, B, and D rules), and measures that are related with the lightness and chromatic information (C rule). The ABCD rule requires a 5 subjective evaluation of the different aspects of a skin lesion; hence, in order to automize the process in an expert system, these rules need to be quantified. 2.1 Shape features To measure the shape of a skin lesion, the following six features were adopted to reflect the B and D rules: perimeter of the lesion; area of the lesion; perimeter to area ratio; aspect ratio - defined as the width divided by height of the minimum rectangle that bounds the lesion region; extent ratio - defined as the area of the lesion region divided by the area of the minimum bounding rectangle; and solidity ratio - defined as the area of skin lesion divided by the area of the convex hull of the lesion boundary (Haidekker, 2010; Olsen, 2011). The first two measures describe the size of the skin lesion, while the remaining ones indicate its regularity. Besides these commonly used features, the total variation of the lesion border is adopted as the seventh feature to represent the smoothness: 𝑉 = 𝜅 𝑑𝑠& , (1) where the lesion border 𝐶 is parameterized as 𝑢 𝑠 , and 𝜅 = ∇ ∙ ∇+ ∇+ is the mean curvature with ∇ denoting the divergence operator. For the symmetry property - A rule, an ideal measure should be irrelevant to the position, size and orientation of the contour. Invariant image moments are well suited to these requirements (Flusser & Suk, 2006), and Hu’s invariant set has been shown to be effective in various applications of image analysis (Hu, 1962; Flusser & Suk, 2006). Therefore, the seven image moments in Hu’s invariant set were selected as the eighth to the fourteenth shape features for classification. 6 2.2 Color features Although the appearance of a skin lesion can vary considerably, information concerning its color is always the main visual feature distinguishing it from the neighboring skin. The changes of pigment inside a skin lesion are important for both segmentation and classification steps. An ideal color feature should capture the differences between a melanoma and a non-melanoma nevus. Additionally, the imaging conditions can appreciably affect the appearance of a skin lesion on dermoscopic images; therefore, the ideal color features should be able to handle the diverse lighting conditions in different image databases. The color information of a dermoscopic image is stored as a triplet 𝑟,𝑔,𝑏 for each image pixel. The three color channels in the RGB color space are correlated and consequently they are hard to be used to evaluate the differences between colors. To overcome this problem, the CIE L*a*b* and CIE L*u*v* color spaces were adopted (Gonzalez and Woods, 2007). The benefit of separating lightness information from chromaticity can considerably decrease the distortions caused by lightness when defining color features. Accordingly, the first ten color features were chosen as the means and standard deviations of the 𝐿∗,𝑎∗,𝑏∗,𝑢∗ and 𝑣∗ values of the skin lesion. In addition, given that 𝑎∗,𝑏∗ and 𝑢∗,𝑣∗ are the coordinates representing the position of a chromaticity in the color coordination system, the difference between a chromaticity and the mean chromaticity of the skin lesion is calculated as the Euclidean distances: 𝑑8 = 𝑎∗ − 𝑎8 ∗ : + 𝑏∗ − 𝑏8 ∗ :, (2) 𝑑: = 𝑢∗ − 𝑢8 ∗ : + 𝑣∗ − 𝑣8 ∗ :, (3) 7 where 𝑎8∗ , 𝑏8∗ , 𝑢8∗ and 𝑣8∗ are the means of 𝑎∗,𝑏∗,𝑢∗ and 𝑣∗ of the skin lesion and correspondingly 𝑎8∗,𝑏8∗ and 𝑢8∗,𝑣8∗ are the chromatic geometric centroids of the skin lesion. Then, the next four color features were defined as the mean and standard deviation of 𝑑8 and 𝑑: inside the skin lesion: 𝑑8, 𝑑8_𝜎, 𝑑:, and 𝑑:_𝜎. Moreover, the same 14 features of the neighboring healthy skin (the means and standard deviations of L∗,a∗,b∗,u∗,v∗ and the means and standard deviations of the two Euclidean distances to the geometric color centroids of the healthy skin) next to the skin lesion were added to the color features; these features not only provide the information of color transitions from normal skin to skin lesion, but also compensate the possible inaccuracy of the lesion border. Also, the lightness difference and the Euclidean distance between the geometric centroids of healthy skin and skin lesion were adopted as three additional color features: 𝑑B = 𝑎C ∗ − 𝑎8 ∗ : + 𝑏C ∗ − 𝑏8 ∗ :, (4) 𝑑D = 𝑢C ∗ − 𝑢8 ∗ : + 𝑣C ∗ − 𝑣8 ∗ :, (5) 𝑑E = 𝐿C − 𝐿8 , (6) where 𝐿C and 𝐿8 are the mean lightness of the neighboring skin and skin lesion, and 𝑎C∗, bC∗ , uC∗ , and vC∗ are the means of a∗, b∗, u∗, and v∗ of the neighboring skin. Additionally, color saturation was used to correlate the lightness with chromaticity: S = 0 if R + G + B = 0 1 − LMN O,P,Q ORPRQ B otherwise , (7) and the 32nd to 35th color features were defined as the mean and standard deviation of the saturation of the skin lesion (𝑠8, 𝑠8_𝜎) and the neighboring skin (𝑠C, 𝑠C_𝜎), plus the 36th feature as the ratio of the means of the two regions: 𝑑S = 𝑠8 𝑠C. (8) 8 Table 1 lists all the shape and color features used for selection. 3. Experiments 3.1 Training and testing Although an unbiased classification requires the skin lesions in the training set to be representative, different types of skin lesions can have large variations in shape and appearance, so it is difficult to define how typical a skin lesion is. Also, due to the size of the available image databases, a routine validation of a classifier or new feature is through the leave-one-out strategy. Consequently, the evaluation of the performance may be biased and the high sensitivity and specificity achieved with one image database may not hold true for another. In order to make the evaluation objective, three databases of dermoscopic images were used. Details of these databases are given in Table 2, with samples of melanoma and non- melanoma skin lesions shown in Figure 1. As can be seen from the Figs 1a to f, the images were acquired under different conditions and from patients of diverse origins. In these databases, the skin lesions were actually classified into more accurate sub-categories; but for simplicity, this additional information was ignored and each skin lesion was classified as either melanoma or non-melanoma. The ground truth of lesion borders were provided in all the databases, which were manually segmented by qualified technicians. In order to obtain the statistics of neighboring healthy skin referred to in Section 2.2, a 10-pixel-wide band next to the lesion border was used; this outside band covers a moderate region that is sufficient to reflect the transition of healthy skin to skin lesion in the images. Given the sizes of the databases, we adopted the strategy to choose one database as the testing set and combined the other two as the training set. However, if the training set was 9 formed by the 1st and 2nd image databases shown in Table 2, the total number of melanomas (69) would not be large enough to be used for extracting the differential information between melanoma and non-melanoma nevi. Consequently, the classification would not be able to achieve a good performance on the 3rd database, particularly with the fact that the 3rd database contains Spitz and Reed nevi (Lyon, 2010; Yoradjian et al., 2012). Thus, the other two cases were adopted: the combination (Comb 1) that used the 1st and 3rd databases as the training set and the 2nd database as the testing set; and the combination (Comb 2) that used the 2nd and 3rd databases as the training set and the 1st database as the testing set. The next problem to solve was related to the different magnitudes of the 50 features chosen for classification; for example, the value of lightness ranges from 0 to 100, while saturation only varies from 0 to 1. A feature scaling step was necessary to balance such differences. Among the features to be selected, 31 were related to the L, a∗, b∗, u∗, and v∗ channels, and since their region-based values were already at similar intervals. These features were kept unchanged and then the remaining 19 features were scaled as follows: x = 50 ∗ VWVXYZ VX[\WVXYZ , if xLMN ≠ xL`V 0, if xLMN = xL`V , (9) where xLMN and xL`V are the minimum and maximum values of feature x in the training set. The support vector machine was chosen for this two-class classification, due to its effectiveness already shown in diverse areas of studies (Chu & Wang, 2005; Filho et al., 2015), and the radial basis function kernel was adopted for its flexibility, stability and general popularity (Celebi et al., 2007). Accordingly, three parameters had to be defined: 𝐶, 𝜀, and 𝛾. Parameter 𝐶 controls the trade-off between the error of a classifier on the 10 training set and the margin between classes; parameter 𝜀 determines the accuracy of the approximation; and parameter 𝛾 is related to the values of feature vectors and determines the behaviors of kernel functions for approximation. Values of these parameters can appreciably affect the performance of classification; therefore, to assure a stable evaluation of a feature combination, the values of 𝛾, 𝐶 and 𝜀 of the SVM were fixed as 1𝑒 − 3, 500 and 1 in the experiments. 3.2 Pre-selection criteria In order to find the best combination among the 14 shape features and 36 color features with a training set and a testing set, an exhaustive search requiring 2EC classifications would have to be made. This however could lead to an unfeasibly long computation time. To overcome this problem, the standard deviation of a feature would not be used if the mean of that feature was not used; for example, if the mean of a∗ of the skin lesion (a8∗ ) was not used for classification then the standard deviation of a∗ of the skin lesion (a8∗_σ) would not be used either. This strategy was adopted because there are 15 measures with both their means and standard deviations chosen as the features for selection. While the standard deviation indicates the variations of a measure, it can also be reflected by other features; for example, the variation of a* channel inside the skin lesion is partly reflected by the mean of distance d1, This strategy decreases the total number of classification to 2EC ∙ B D 8E . In addition, since chromaticity is described by two-dimensional coordinates in the color spaces, the following strategies were implanted: features of a∗ channel are bound with the features of b∗ channel; for example, if a8∗ was used, b8∗ would be used; if a8∗_σ was used, then b8∗_σ would be. Likewise, features of u∗ and v∗ were bound for use; and features related to d8 and d: were bound; and the same for dB and dD. 11 Furthermore, the following procedure was used to implement separate searches among the shape features and color features: In the first phase, all the color (shape) features were used for classification, and the combinations of shape (color) features were searched to find the optimal set; then, with the optimal set of shape (color) features, the combinations of color (shape) features were reversely searched to find the best match. With the color (shape) features selected from the second phase, we can again search the combinations of shape (color) features to find the best match. Hence, the two-phase procedure can be iterated, and the best performance of classification would be improved after each iteration. Thus, if the search is to find the best color features for a set of shape features, the number of classifications would be 2:B ∙ B D 8C ; and if the searching is to find the best shape features for a set of color features, the number would be 28D. In the experiments, the performance of classification was evaluated based on three measures: overall accuracy (OA), which is the percentage of the skin lesions that are correctly classified; sensitivity of a classification (S1) that is defined as the percentage of correctly classified melanomas among all the melanomas; and specificity (S2), which is defined as the percentage of correctly classified non-melanoma nevi among all the non- melanoma nevi. Higher values of these indices indicate a better classification. 3.3 First phase Following the procedure defined above, all the shape features were used for classification then the search was among the combinations of color features. For Comb 1, the highest overall accuracy was 83.5% and was achieved by two sets of color features. Both of them contained the mean lightness of skin lesion (𝐿8), lightness difference (𝑑E), standard deviation of lightness inside the skin lesion (𝐿8_𝜎), mean saturation of the skin lesion (𝑠8), 12 mean of 𝑑8 and mean of 𝑑: inside the skin lesion (𝑑8 and 𝑑:); while one set included two extra features: mean lightness of neighboring skin (𝐿C) and ratio of mean saturations (𝑑S). With all the shape features and the six mutual color features, the sensitivity was 87.5% and the specificity was 82.5%; while with the additional two color features, these two indices were 77.5% and 85.0%, respectively. The highest sensitivity achieved in this phase was 97.5%, but the corresponding specificity was only 40.0%, indicating that many non-melanoma nevi were wrongly classified as melanomas. If both sensitivity and specificity were considered, a ranking on the average of these two values gave the best combination that achieved 92.5% of sensitivity and 80.0% of specificity; the set of color features has 82.5% of overall accuracy and was composed of six elements: 𝐿8, 𝑠8, 𝑑8, 𝑑:, 𝑑E and 𝑑S. However, when we performed the classification for Comb 2 with all the shape features and the aforementioned three sets of color features, the overall accuracy of classification was only 55.0%, 54.0% and 61.0%, respectively. For Comb 2, the best overall accuracy of classification was 72.0% with 62.1% sensitivity and 76.1% specificity, achieved by a set of three color features: mean of 𝑎∗ and mean of 𝑏∗ inside the skin lesion (𝑎8∗ and 𝑏8∗), and ratio of mean saturations (𝑑S). Similarly, if this feature set is used for Comb 1, the overall accuracy was only 52.5% with 45.0% of sensitivity and 54.4% of specificity. On the other hand, if all the color features were used for classification, the best results after searching among the combinations of shape features gave 84.5% of overall accuracy for Comb 1 with 60.0% of sensitivity and 90.6% of specificity, and 70% of overall accuracy for Comb 2 with 34.5% of sensitivity and 84.5% of specificity. Like the findings above, the set of shape features that achieved good results for Comb 1 did not perform well for Comb 2, and vice versa. 13 The results show that the sets of features that achieved the highest overall accuracy were not the ones with the highest sensitivity or specificity; consequently, a measure is needed to decide which set of features should be used in the second phase of the search. Given the composition of the training sets of Comb 1 and Comb 2, the total number of wrong classifications (WN) in the two tests was chosen as the index to rank the features. Accordingly, seven sets of color features (found with all the shape features) and nine sets of shape features (found with all the color features) were chosen for the second phase of the search. Table 3 lists these 16 feature sets and the indices of the corresponding classification; the performance of these feature sets was moderate, but more balanced. The minimum of WN (Wrong Number) equals 70 when all the shape features were used, the set of color features included: mean lightness of skin lesion (L8), lightness difference (dE), mean saturation of neighboring skin and skin lesion (sC and s8), mean of d8 and mean of d: inside the skin lesion (d8 and d:). The minimum of WN found with all the color features was 72, achieved by five sets of shape features. All of them contained the following three features: aspect ratio, smoothness of lesion border, and area of skin lesion; the difference between them was the choice of the Hu's seven invariant image moments. The frequency of occurrence (FO) of each feature appearing in the feature sets with 𝑊𝑁 ≤ 80 and 𝑊𝑁 ≤ 90 were calculated and are illustrated in Figure 2. For the color features, mean lightness of skin lesion (𝐿8), mean lightness of neighboring skin (𝐿C), lightness difference (𝑑E), standard deviation of lightness inside the skin lesion (𝐿8_𝜎), mean saturation of the neighboring skin and skin lesion (𝑠C and 𝑠8), mean of 𝑑8 and mean of 𝑑: inside the skin lesion (𝑑8 and 𝑑:), and means of 𝑎∗,𝑏∗, 𝑢∗ and 𝑣∗ inside the skin lesion (𝑎8∗, 𝑏8∗, 𝑢8∗, 𝑣8∗) were features that had high frequencies. For the shape features, aspect ratio, smoothness of lesion border, and area of skin lesion had the highest 14 frequencies. Additionally, the difference found among the frequencies of Hu’s seven image moments was small, which implies that the effectiveness of these moments may vary when paired with other features, but generally they are of equal importance. 3.4 Second phase This phase included 16,384 classifications for each of the seven sets of color features indicated in Table 3, and 472,392 classifications for each of the nine sets of shape features. With the color features fixed, the optimal set of shape features found in this phase improved the performance for both Comb 1 and Comb 2. For example, for the first set of color features in Table 3, the optimal set of shape features achieved 78.0% of overall accuracy for Comb 2; and for the second set of color features referred to in Table 3, the highest overall accuracy for Comb 1 increased to 90.5% with 92.5% of sensitivity and 90.0% of specificity. Table 4 lists the best matches of shape and color features found in this phase. Like in the first phase, the effective shape features were different for each of the seven sets of color features. Figure 3 shows the FO of each shape feature in the feature sets whose WN ≤ 50,60,70,80,90, and Figure 4 illustrates the FO of each color feature based on the results for the nine sets of shape features. Aspect ratio, smoothness of lesion border, and area of the lesion region were the three shape features that always had high frequencies, especially among the feature sets with overall accuracy above 80.0%. This agrees with the finding of the first phase and indicates that these three shape features are effective and have a large influence on the performance of the classification. Figure 3 shows that the ranking of frequencies is generally the same, and when the WN increases, the difference between frequencies becomes smaller since more and more combinations 15 of shape features can achieve the same lower performance. For the color features, mean saturation of neighboring skin and skin lesion (𝑠C and 𝑠8), mean of 𝑑8 and mean of 𝑑: inside the skin lesion (𝑑8 and 𝑑:), and lightness difference (𝑑E) were the ones with the highest frequencies, especially when the overall accuracy was greater than 80.0%. Similarly, the difference of frequencies among color features decreases when the WN increases. These five color features were also included in the ones with high frequencies of the first phase, which confirms their effectiveness; the missing features from the first phase, such as 𝐿8, 𝑎8∗ and 𝑏8∗, are the ones that directly describe the color information; this phenomenon was probably caused by the large variations in the appearances of skin lesions. 3.5 Further selection Since iterating the two-phase procedure generates a no-worse result, a further search was carried out based on the results of the second phase. The sets of features used in this phase are listed in Table 4. After this search, we confirmed that most of the combinations in Table 4 were already the optimal matches, however, we found two extra combinations that were able to achieve equal performance, and are listed in Table 5; the first one is a set of color features that achieved the same WN as the one presented in Table 4, and the second one is a set of shape features with which the classification can achieve an even better performance. Nevertheless, another search with these two sets of features confirmed that the matches of shape features and color features in Table 5 were already the best and no further sets of features would able to achieve an equal or better performance. Hence, it is safe to conclude that with the search procedures performed the sets of features in Tables 4 and 5 were the optimal ones for classification. 16 Then, the frequencies of the shape features in the feature sets whose 𝑊𝑁 ≤ 50,60,70,80,90 were calculated. In the calculation we excluded the feature sets that had one of the following conditions: 𝑆1 ≤ 60.0% for Comb 1, or 𝑆1 ≤ 40.0% for Comb 2, or 𝑆2 ≤ 70.0% for either Comb 1 or Comb 2. These exclusions guaranteed that only the ones that had a good performance in all the indices of classification were taken into account. Similarly, the frequencies of color features were calculated based on the results of the second phase and the extra search; the results are shown in Figs. 5 and 6. The images show that the shape and color features with the highest frequencies in this phase are in line with the findings of the first and the second phase searches; hence, the effectiveness of these features was once again demonstrated. 3.6 Discussion Although the training set of Comb 2 contains more samples of skin lesions, the classification achieved a better performance for Comb 1. One possible reason is due to the low imaging resolution of the 1st database, which considerably affects the calculation of color features. However, the 2nd database contains images with incomplete profiles of skin lesions, and so the lesion borders on these images are inevitably inaccurate and can lead to wrong values of shape features. The inaccuracy of shape features and color features both have negative impacts on the classification; a pertinent question is which type of feature has the greater influence on the classification. Following up this idea, classification was carried out using features of just one type. The results showed that with only the color features, the best overall accuracy was around 83% for Comb 2, with sensitivity ranging from 48.3% to 72.4% and specificity from 87.3% to 97.2%, respectively. On the other hand, with only the shape features, the best overall accuracy 17 was about 59% with 31.0% sensitivity (also the highest value found for this index) and 70.4% specificity, respectively. Table 6 indicates the best sets of shape features and color features found according to the WN. The Table shows that the minimum of WN with features of only one type was not much different to the minimum found using both types. However, the sensitivity and the specificity with both types were higher, and the performance of classification was more stable for different image databases. The results also indicated that the color features are more robust for classification, which confirms the fact that these features are region-based. The empirical search with Comb 1 and Comb 2 identified the features that are effective to achieve a satisfactory classification. However, it is worth pointing out that only using these features does not guarantee the best performance. In fact, classification using the five color features and three shape features that have the highest frequencies only achieved a moderate result with 83.0% of overall accuracy for Comb 1 and 73.0% of overall accuracy for Comb 2; remembering that there are features of low frequencies appearing in the set of features with the best performance. Nevertheless, despite this randomness, these eight features are most likely to achieve a good classification performance, and so should be prioritized in selection. 4. Conclusions The features used in the automatic analysis of dermoscopic images have a critical influence on the performance of classification. However, a set of features that achieves good results on one database may not have an equal performance on another database, and a “redundant” feature may become indispensable to correctly detect a specific type 18 of skin lesion. In this work, we aimed to find the effective features for detecting melanomas experimentally. The three image databases used in the study were from different origins and acquired under diverse imaging conditions, which provides an objective basis for evaluation. The results obtained confirmed the effective use of both shape and color features and suggested the need to combine them to acquire high classification accuracy and robustness. A comprehensive study on features for classification has many practical constraints, because image databases from diverse studies are always different and the diagnostic accuracy of dermoscopy itself is not 100% even under the optimal exam conditions (Kittler et al., 2002). Nonetheless, the findings in our work showed that the performance of the automatic analysis of skin lesions in dermoscopic images is comparable to experienced dermatologists. Furthermore, the ABCD rule may not always be able to detect a small-sized melanoma or a melanoma with a regular shape and homogeneous color (Grin et al., 1990). In addition, there are some specific features that can be effective in classifying a particular type of skin lesion; for example, ridges and furrows were shown to be effective in detecting acral lentiginous melanoma (Iyatomi et al., 2008; Bradford et al., 2009; Yang et al., 2017). These subjects are challenges to be explored, and future work will continue to focus on solving these issues and finding more effective features. Acknowledgements This work was funded by European Regional Development Funds (ERDF), through the Operational Program ‘Thematic Factors of Competitiveness (COMPETE), and Portuguese Funds, through “Fundação para a Ciência e a Tecnologia” (FCT), under the 19 project: FCOMP-01-0124-FEDER-028160/PTDC/BBB-BMD/3088/2012. The first author also thanks FCT for the post-doc grant: SFRH/BPD/97844/2013. Authors gratefully acknowledge the funding of Project NORTE-01-0145-FEDER- 000022 - SciTech - Science and Technology for Competitive and Sustainable Industries, co-financed by “Programa Operacional Regional do Norte” (NORTE2020), through “Fundo Europeu de Desenvolvimento Regional” (FEDER). References Abbasi, N., Shaw, H., & Rigel, D. (2004). Early Diagnosis of Cutaneous Melanoma. JAMA: The Journal of Medical Association, 292, 2771–2776. Argenziano G., Soyer, H. P., De Giorgi, V., Piccolo, D., Carli, P., Delfino, M., Ferrari, A., Hofmann-Wellenhof, R., Massi, D., Mazzocchetti, G., Scalvenzi, M., & Wolf, I. H. (2002). Dermoscopy: a tutorial. EDRA Medical Publishing & New Media. Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., Wright, A. F., Wilson, J. F., Agakov, F., Navarro, P., & Haley, C. S. (2015). Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific Reports, 5, 10312. Binder, M., Schwarz, M., Winkler, A., Steiner, A., Kaider, A., Wolff, K., & Pehamberger, H. (1995). Epiluminescence microscopy. A useful tool for the diagnosis of pigmented skin lesions for formally trained dermatologists. Archives of Dermatology, 131, 286–291. Bradford, P. T., Goldstein, A. M., McMaster, M. L., Tucker, M. A. (2009). Acral lentiginous melanomaincidence and survival patterns in the United States, 1986- 2005. JAMA Dermatology, 145, 427-434. 20 Celebi, M. E., Kingravi, H. A., Uddin, B., Iyatomi, H., Aslandogan, Y. A., Stoecker, W. V., Moss, R. H. (2007). A methodological approach to the classification of dermoscopy images. Computerized Medical Imaging and Graphics, 31, 362-373. Celebi, M. E., Wen, Q., Hwang, S., Iyatomi, H., & Schaefer G. (2013) Lesion border detection in dermoscopy images using ensembles of thresholding methods. Skin Research and Technology, 19, e252-e258. Chu, F., & Wang, L. (2005). Applications of support vector machines to cancer classification with microarray data. International Journal of Neural Systems, 15, 475-484. Filho, M., Ma, Z. & Tavares, J. M. R. S. (2015). A review of the quantification and classification of pigmented skin lesions: from dedicated to hand-held devices. Journal of Medical Systems, 39: 177. Flusser, J., & Suk, T. (2006). Rotation moment invariants for recognition of symmetric objects. IEEE Transactions on Image Processing, 15, 3784–3790. Friedman, R. J., Rigel, D. S., & Kopf, A. W. (1985). Early detection of malignant melanoma: the role of physician examination and self-examination of the skin. CA: A Cancer Journal for Clinicians, 35, 130–151. Gonzalez, R. C., & Woods, R. E. (2007). Digital image processing (3rd ed.). Prentice Hall. Grin, C. M., Kopf, A. W., Welkovich, B., Bart, R. S., & Levenstein, M. J. (1990). Accuracy in the clinical diagnosis of malignant melanoma, Archives of Dermatology, 126, 763-766. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389-422. 21 Haidekker, M. (2010). Advanced biomedical image analysis. (1st ed.) New Jersy: John Wiley & Sons, (Chapter 9). Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. (2nd ed.). Springer Science & Business Media. Heymann, W. R. (2005). Clinical and microscopic diagnosis of melanoma. Journal of the American Academy of Dermatology, 52, 133–134. Hu, M. K. (1962). Visual Pattern Recognition by Moment Invariants. IRE Transactions on Information Theory, 8, 179–187. Iyatomi, H., Oka, H., Celebi, M. E., Ogawa, K., Argenziano, G., Soyer, H. P., Koga, H., & Saida, T. (2008). Computer-based classification of dermoscopy images of melanocytic lesions on acral volar skin. Journal of Investigative Dermatology, 128, 2049-2054. Kittler, H., Pehamberger, H., Wolff, K., & Binder, M. (2002). Diagnostic accuracy of dermoscopy. Lancet Oncology, 3, 159–165. Kulkarni, S. R., Lugosi, G., & Venkatesh, S. S. (1998). Learning pattern classification-A survey. IEEE Transactions on Information Theory, 44, 2178–2206. Lyon, V. B. (2010). The Spitz Nevus: Review and Update. Clinics in Plastic Surgery, 37, 21–33. Ma, Z., & Tavares, J. M. R. S. (2016). A Novel Approach to Segment Skin Lesions in Dermoscopic Images Based on a Deformable Model. IEEE Journal of Biomedical and Health Informatics, 20, 615–623. Mendonca, T., Ferreira, P. M., Marques, J. S., Marcal, & A. R., Rozeira, J. (2013). PH² - a dermoscopic image database for research and benchmarking. Conf Proc IEEE Eng 22 Med Biol Soc. 5437-5440. Olson, E. (2011). Shape factors and their use in image analysis–part 1: theory. J GXP Compliance, 15, 85–96. Saeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23, 2507–2517. Silveira, M., Nascimento, J. C., Marques, J. S., Marçal, A. R. S., Mendonça, T., Yamauchi, S., Maeda, & J., Rozeira, J. (2009). Comparison of segmentation methods for melanoma diagnosis in dermoscopy images. IEEE Journal on Selected Topics in Signal Processing, 3, 35–45. Yang, S., Oh B., Hahm S., Chung, K. Y., & Lee, B. U. (2017). Ridge and furrow pattern classification for acral lentiginous melanoma using dermoscopic images. Biomedical Signal Processing and Control, 32, 90-96. Yoradjian, A., Simoes, M. M., Enokihara, S., & Paschoal, F. M. (2012). Nevo de spitz e nevo de reed. Anais Brasileiros de Dermatologia, 87, 349–359. 23 FIGURE CAPTIONS Fig. 1 Examples of dermoscopic images from the three databases overlapped with the lesion borders (blue contours): (a) non-melanoma in the 1st database, (b) melanoma in the 1st database, (c) non-melanoma in the 2nd database, (d) melanoma in the 2nd database, (e) non-melanoma in the 3rd database and (f) melanoma in the 3rd database. Fig. 2 Illustration of the frequencies of features based on the first search phase: (a) color features and (b) shape features. Fig. 3 Illustration of the frequencies of shape features based on the second search phase. Fig. 4 Illustration of the frequencies of color features based on the second search phase. Fig. 5 Illustration of the frequencies of shape features based on the second search phase and the extra searches. Fig. 6 Illustration of the frequencies of color features based on the second search phase of searching and the extra searches. 24 TABLES Table 1 Features for selection Type Number Details Shape features 14 Ratio of perimeter to area of the skin lesion; smoothness of the lesion border; the seven Hu’s invariant image moments; perimeter of the lesion border; area of the skin lesion; aspect ratio; extent ratio; solidity ratio. Color features 36 𝐿8, 𝑎8∗, 𝑏8∗, 𝑢8∗, 𝑣8∗, 𝐿C, 𝑎C∗, 𝑏C∗, 𝑢C∗, 𝑣C∗, 𝑑E, 𝑑B, 𝑑D, 𝑠8, 𝑠C, 𝑑S, 𝐿8_𝜎, 𝑎8∗_𝜎, 𝑏8∗_𝜎, 𝑢8∗_𝜎, 𝑣8∗_𝜎, 𝐿C_𝜎, 𝑎C∗_𝜎, 𝑏C∗_𝜎, 𝑢C∗_𝜎, 𝑣C∗_𝜎, 𝑠8_𝜎, 𝑠C_𝜎, 𝑑8, 𝑑:, 𝑑w (the mean distance to the geometric centroid 𝑎C∗,𝑏C∗ of neighboring healthy skin), 𝑑x (the mean distance to the geometric centroid 𝑢C∗,𝑣C∗ of neighboring healthy skin), 𝑑8_𝜎, 𝑑:_𝜎, 𝑑w_𝜎, 𝑑x_𝜎. 25 Table 2 Image databases used in the study Database Size Melanoma Origin 1 100 29 Private clinics in US and Australia (Ma and Tavares, 2016; Celebi et al., 2013) 2 200 40 PH2 image database (Mendonca et al., 2013) 3 404 105 Interactive atlas of dermoscopy (Argenziano et al. 2002) 26 Table 3 Feature sets chosen for the second search phase Features WN3 Comb 1 Comb 2 OA3 S13 S23 OA3 S13 S23 1000000000100110000000000000110000001 70 82.5% 70.0% 85.6% 65.0% 41.4% 74.6% 1000000000100101000000000000110000001 74 82.5% 92.5% 80.0% 61.0% 44.8% 67.6% 0000010000100111000000000000110000001 75 81.5% 70.0% 84.4% 62.0% 44.8% 69.0% 1000010000100110100000000000110000001 75 81.5% 60.0% 86.9% 62.0% 51.7% 66.2% 1111100000000110100000000010110000001 75 78.5% 25.0% 91.9% 68.0% 44.8% 77.5% 1000000000100110100000000010110000001 76 80.0% 45.0% 88.8% 64.0% 51.7% 69.0% 1000010000000110000000000000110000001 76 80.5% 60.0% 85.6% 63.0% 41.4% 71.8% 010100011011002 72 84.5% 60.0% 90.6% 59.0% 34.5% 69.0% 010100101011002 72 84.5% 60.0% 90.6% 59.0% 34.5% 69.0% 010101001011002 72 84.5% 60.0% 90.6% 59.0% 34.5% 69.0% 010101010011002 72 84.5% 60.0% 90.6% 59.0% 34.5% 69.0% 010101011011002 72 84.5% 60.0% 90.6% 59.0% 34.5% 69.0% 110111100111002 73 81.5% 57.5% 87.5% 64.0% 31.0% 77.5% 110111101111002 73 81.5% 57.5% 87.5% 64.0% 31.0% 77.5% 110111110111002 73 81.5% 57.5% 87.5% 64.0% 31.0% 77.5% 110111111111002 73 81.5% 57.5% 87.5% 64.0% 31.0% 77.5% 1 Combination of color features represented by a binary string, ‘0’ – unused and ‘1’ – used. From left to right: the sequence of features is listed in Table 1. 2 Combination of shape features represented by a binary strings, ‘0’ – not used and ‘1’ – used. From left to right: the sequence of features is listed in Table 1. 3 WN – Wrong number; OA – Overall accuracy; S1 – Sensitivity; S2 – Specificity. 27 Table 4 Feature sets selected for another search Features W N Comb 1 Comb 2 OA S1 S2 OA S1 S2 1000000000100110000 000000000110000001 011001000011102 45 90.0% 75.0% 93.8% 75.0% 44.8% 87.3% 011001001011102 90.0% 75.0% 93.8% 75.0% 44.8% 87.3% 110100011011002 90.5% 72.5% 95.0% 74.0% 44.8% 85.9% 111100011011002 46 90.0% 72.5% 94.4% 74.0% 44.8% 85.9% 111101101011002 89.5% 72.5% 93.8% 75.0% 44.8% 87.3% 110101000011002 90.0% 67.5% 95.6% 74.0% 44.8% 85.9% 1000000000100101000 000000000110000001 111001001011002 45 90.0% 95.0% 88.8% 75.0% 34.5% 91.5% 111000001011002 46 89.0% 95.0% 87.5% 76.0% 37.9% 91.5% 010100011011002 0000000000000101000000000010000000001 46 88.5% 67.5% 93.8% 77.0% 41.4% 91.5% 010100101011002 0000000000000000000000000000110011001 45 89.5% 62.5% 96.3% 76.0% 65.5% 80.3% 010101010011002 0000000000000000000000000000111100001 46 89.5% 62.5% 96.3% 75.0% 62.1% 80.3% 1 Combination of color features with the sequence defined in Table 1. 2 Combination of shape features with the sequence defined in Table 1. 28 Table 5 Feature sets selected for an extra search Features WN Comb 1 Comb 2 OA S1 S2 OA S1 S2 110101000011002 1000000000100110000000000000110000001 46 90.0% 67.5% 95.6% 74.0% 44.8% 85.9% 0000000000000001000000000000110000001 88.5% 60.0% 95.6% 77.0% 65.5% 81.7% 0000000000000101000000000010000000001 010000010011002 43 90.0% 72.5% 94.4% 77.0% 44.8% 90.1% 010100011011002 46 88.5% 67.5% 93.8% 77.0% 41.4% 91.5% 1 Combination of color features with the sequence defined in Table 1. 2 Combination of shape features with the sequence defined in Table 1. 29 Table 6 Best feature sets based on the classification with features of one type Shape features / Color features Comb 1 Comb 2 OA S1 S2 OA S1 S2 1 000101000110102 90.0% 65.0% 96.3% 72.0% 6.9% 98.6% 0000010011000001000001000000000000001 86.5% 57.5% 93.8% 80.0% 31.0% 100.0% 2 100011100110102 89.5% 60.0% 96.9% 72.0% 3.4% 100.0% 0000011100000011000001000000000000001 87.0% 57.5% 94.4% 79.0% 37.9% 95.8% 3 000100111110102 89.5% 70.0% 94.4% 71.0% 3.4% 98.6% 0110010000000001000001000000000000001 90.5% 70.0% 95.6% 72.0% 10.3% 97.2% 4 000110010110102 89.0% 70.0% 93.8% 72.0% 3.4% 100.0% 0000011100000010000001000000000000001 87.5% 52.5% 96.3% 77.0% 27.6% 97.2% 5 000111001011102 89.5% 55.0% 98.1% 71.0% 6.9% 97.2% 0000010011000000000001000000000000001 87.0% 60.0% 93.8% 77.0% 20.7% 100.0% 6 001011011101102 89.5% 55.0% 98.1% 71.0% 3.4% 98.6% 0110010000100000000001000000110011001 88.0% 52.5% 96.9% 74.0% 37.9% 88.7% 1 Combination of color features with the sequence defined in Table 1. 2 Combination of shape features with the sequence defined in Table 1. 30 FIGURES Figure 1a Figure 1b 31 Figure 1c Figure 1d 32 Figure 1e Figure 1f 33 Figure 2a Figure 2b 34 Figure 3 35 Figure 4 36 Figure 5 37 Figure 6