1 Introduction

Since Fisher’s original publication in the context of linear discriminant analysis [4], the iris flower dataset turned into one of the most well-known and explored datasets in statistical classification and machine learning (ML), with over 18,000 citations until the date of publication of this work.

The dataset consists of 150 observations, equally divided into three classes (Iris setosa, Iris virginica and Iris versicolor), which are described by four features: the length and the width of the sepals and petals, in centimeters. The measurements were taken by Edgar Anderson [1], who was interested in measuring the morphological variation of these species, while Fisher was the first to use the dataset in a statistical learning context.

Iris is generally considered an easy classification problem and is frequently used as ML’s “hello world” (i.e., as the first example one comes into contact with as a beginner in this area). Most classification techniques have no trouble achieving accuracies well above 90% for iris with various hyperparameter configurations, as shown by multiple benchmark results available at OpenML [17].

In this paper, we introduce a new iris flower toy dataset, which breaks away from the original dataset’s simplicity by turning the problem of iris classification into a computer vision (CV) task. Since the rise of deep learning (DL), the literature of CV has seen many of its well-known problems, including MNIST [9], ImageNet [3], and the newer Fashion-MNIST [18], being mostly solved, i.e., there have been models which achieved near-human or even better-than-human accuracy for these tasks.

Hence, this work introduces a dataset which features CV challenges such as fine grained categorization, background noise, real world environment conditions (e.g. lighting variation), as well as different scales and non-centered objects. The main idea is to make this dataset available so that it can be used to validate machine/deep learning approaches considering difficult CV scenarios. The paper also contains the classification results of traditional machine learning algorithms, as well as some state-of-the-art deep learning architectures using the Iris-CV dataset.

The remainder of this paper is organized as follows: Sect. 2 shows some related works using the iNaturalist dataset, Sect. 3 describes our new dataset, Sect. 4 presents benchmark results, including experiments with traditional ML algorithms and state of the art deep neural networks, and finally, Sect. 5 contains our final remarks.

2 Related Works

The iNaturalist is a dataset proposed by [16] and has 675,170 images from more than 5 thousand different species of animals and plants. The category of plants has almost 200 thousand images. Those species images were captured from all over the world, with different cameras and image qualities. Also, it has a large class imbalance [16]. This dataset is constantly being updated.

Although iNaturalist is a toy dataset that was used in Kaggle competitions, it was already used in published papers. Some works were focused on detection [16] and others’ purposes were to make classifications such as this work.

Plant classification was the focus of a study using the iNaturalist dataset [12], where the authors used a convolutional neural network to classify the plant’s subclasses. Data Augmentation was used to reduce overfitting and balance the classes. Then, a transfer learning approach based on ResNet50 was utilized. This work classifies different plant species rather than only iris unlike our work.

Another study used the entire iNaturalist dataset to make both classification and detection [16]. For classification, the following deep network architectures were performed: ResNets, Inception V3, Inception ResNetV2 and MobileNet. From those models, Inception ResNetV2 SE had the best performance. This work also classifies different plant species rather than focusing on iris flowers, and their experiments did not include classic algorithms unlike ours.

This work uses the iNaturalist dataset as a single source, but some other researchers used it combined with different datasets. An example of it was the work proposed by [7]. After selecting different species of plant images from three datasets, they applied deep learning techniques for plant classification. Their goal was to achieve at least 50 percent accuracy as a baseline classification and ResNet50 was able to classify almost half of the iNaturalist observations. The iNaturalist dataset performed better than the Portuguese Flora dataset, but the Google Image Search observations were better than both.

As the purpose of this paper is to classify the different species of Iris, the iNaturalist dataset provided the images necessary to do that. The paper differs from works described in this section since it focuses on benchmark analysis, thus it covers a variety of algorithms and computational cost for each of them. In addition, we focus specifically on iris flowers, whereas the other works used plants in general.

3 A New Iris Dataset

Our new iris dataset, called Iris-CV, consists of 5,139 examples extracting from iNaturalist (September 10th 2020). Each example representing an RGB image associated with a label corresponding to five different species: Bearded Iris (Be) (Iris x germanica, 928 images), Douglas Iris (Do) (Iris douglasiana, 944 images), Dwarf Crested Iris (Dw) (Iris cristata, 1290 images), Western Blue Iris (We) (Iris missouriensis, 1036 images), and Yellow Iris (Ye) (Iris pseudacorus, 941 images), as shown in Table 1. All images were gathered from iNaturalist [10], a website that provides Creative Commons-licensed pictures of fauna and flora taken by users worldwide.

Table 1. Dataset size per class

After downloading the images, we manually removed photos that had too much noise, i.e., pictures of many different flowers, human hands covering the majority of the frame, and blurry images, which could confuse learning models, resulting in a harder problem.

The original images have different sizes, therefore we resized them to a resolution of 256x256, which maintained the images’ main features while avoiding high memory requirements. We also kept the color information, which is coded in three RGB channels, as it can be important to differentiate the classes. Table 2 shows resized examples of each class. The classes show different color patterns, particularly the Yellow Iris, which is appropriately named, and the Dwarf Crested Iris, with its white and orange crests.

Table 2. Class names and examples from the Iris-CV dataset.

The pictures also show heavy background information, thus models must learn how to separate the flowers from the background to perform well. Images may contain multiple and/or non-centered flowers. Additionally, due to the different image and flower sizes, some individuals may become small compared to others after we resize the pictures, as seen in some examples in Table 2. There can also be very different lighting conditions across pictures.

Finally, successful models will have to learn that each species can show different colors, for instance, Dwarf Crested Irises can be lavender, lilac, pale blue, purple, white, or pink. As a result of all these features, Iris-CV can be a challenging computer vision problem.

4 Benchmark Results

We begin our experimental analysis by validating our dataset with eight different classic algorithms using scikit-learn and XGBoost libraries, as listed below:

  • Decision tree (DT);

  • Extra tree (ET);

  • Gradient boosting (GB);

  • Extreme gradient boosting (XGB);

  • Multilayer perceptron (MLP);

  • Perceptron;

  • Random forest (RF);

  • K-nearest neighbors (KNN).

Since these algorithms are not originally equipped to receive pixel matrices as input, we flattened the 256x256x3 images, obtaining vectors with 196,608 dimensions which can then be used as input. After obtaining the input vectors, we rescaled the images by dividing each pixel by 255 before training and testing the algorithms.

Experiments were carried out using 5-fold stratified cross-validation, which maintains class proportions across folds. The code was implemented using python’s scikit-learn library [11].

The algorithms and their hyperparameter values were chosen based on their performances on Fashion-MNIST’s benchmark [18] and we used several machines to streamline the execution of tasks. Despite each machine has different settings (i.e. processor, graphic cards and memory), it is important to notice that this only affects the training time, not the algorithms overall performance. The best results achieved per classifier can be seen in Table 3.

We tested different values (Table 3) for the following hyperparameters: criterion; objective; splitter; max_depth; loss; n_estimators; activation; hidden_layer_sizes; penalty; n_neighbors; weights; and p. See scikit-learn’s documentation [11] for more details about these parameters. In addition, Table 4 describes the hardware resources for each classifier.

Table 3. Results with standard deviation (Std) of classic algorithms for the Iris-CV dataset using 5-fold cross validation. The time column refers only to training time. Hyperparameter names are shown as they appear in scikit-learn’s documentation.
Table 4. Hardware settings for training and evaluate each classifier

Even though most results surpassed the expected accuracy of a random classifier (0.251), the overall performance shows how difficult this problem is for classic methods without any extra preprocessing specifically designed to improve their results. Best results were obtained by XGBoost with 500 estimators, multi:softmax as the split objective, and 3 as the max depth, with a mean accuracy of 0.614. Results were also poor compared to Fashion-MNIST [18] and MNIST [9], where the same algorithms can reach accuracy values over 0.85 and 0.95, respectively.

As mentioned in Sect. 3, our dataset’s difficulty can be explained by some factors, such as somewhat high-dimensional images, RGB channels, non-centered images, noisy backgrounds, and petals with different colors within each species. Thus, this problem cannot be tackled by simple approaches.

4.1 Deep Neural Net Results

We now turn to state-of-the-art Convolutional Neural Network (CNN) architectures, implemented using TensorFlow 2 [5], to see how these techniques performing in the classification of Iris-CV images. We chose to the hyperparameters used in the DenseNet [8] paper, since this is a state-of-the-art network. We trained each network with the stochastic gradient descent (SGD) optimizer, using 0.9 for the Nesterov momentum and 40 epochs. Regarding the learning rate, we chose a starting value of 0.1, decaying at a pace of \(10^\frac{epoch}{20}\) as a way of tuning network performance, which was observed empirically. In addition, we used different state-of-the-art architectures such as EfficientNet, MobileNet and ResNet50 to verify if it has any performance improvement regarding state-of-the-art network architectures.

Also, we performed data augmentation to balance the class proportions. We tested several combinations of the parameters of the ImageDataGenerator provided by TensorFlow and Keras, viewing the images we had as a result and choosing the ones that kept most of the original information. Below we list the augmentation parameters:

  • Rescale: 1/255

  • Rotation Range: 20\(^{\circ }\)

  • Width Shift Range: 0.1

  • Height Shift Range: 0.1

  • Horizontal Flip: True

  • Shear Range: 0.1

  • Zoom Range: 0.4 - 0.5

  • Fill Mode: nearest

More information about these settings and the process used by ImageDataGenerator is available in the official Tensorflow documentationFootnote 1. The trained architectures are listed below:

DenseNet121. Proposed by Huang [8], this deep net aims to optimize the flow of information between the layers of the network, making a dependency link between them, trying to minimize convergence time through shorter paths.

EfficientNetB0. This architecture, introduced by Mingxing Tan [15], is a neural network obtained through a compound scaling method of ConvNets’s depth, width, and resolution.

InceptionV3. This architecture is a refinement of its antecessors, firstly by the introduction of batch normalization, and later by additional factorization ideas in the third iteration [14].

MobileNetV2. This network uses light convolution layers to filter the features in the intermediate layer, is based on a residual inverted structure [13].

ResNet50. ResNet [6] is an abbreviation for Residual Networks. This type of deep convolutional neural network uses residual blocks and can work with many layers of depths, avoiding the vanishing gradient problem. Specifically, ResNet50 is a 50-layer residual network.

Xception. Xception [2] is a novel deep convolutional neural network architecture inspired by Inception [14], where Inception modules have been replaced with depthwise separable convolutions.

The models were evaluated using 5-fold stratified cross-validation. Table 5 shows that all CNNs outperformed almost all classic algorithms, as expected. Most architectures achieved accuracies over 0.74, except for EfficientNetB3, which achieved the worst performance among the deep learning architectures. Results also show that MobileNetV2 holds the highest accuracy and can be considered the state of the art for the Iris-CV dataset.

Table 5. Results with standard deviation (Std) of Deep Learning architectures for the Iris-CV dataset using 5-fold cross-validation. Columns B and E refer to batch size and number of epochs, respectively. We used different batch sizes considering hardware resources of each machine. The DenseNet121, EfficientNetB0, and MobileNetV2 architectures were trained using RTX 2060 Super, ResNet 50 was trained with RTX 2070, while the remaining ones were trained with a GTX 1660Ti.

In addition to assessing accuracy, we analyzed the confusion matrix that corresponds to the best MobileNetV2 result, so that we can determine which classes the models have the most difficulty in predicting. The confusion matrix is shown in Table 6.

Table 6. Confusion matrix corresponding to MobileNetV2’s best test accuracy – each class is represented by the first two letters in its common name.

The MobileNetV2 confusion matrix shows that the network was good at differentiating Western Iris (We), Yellow Iris (Ye), and Dwarf Crested Iris (Dw) from the other classes, as their precision scores corresponded to approximately 88%, 89%, and 89%, respectively, as shown in Table 7. The other two classes – Bearded Iris (Be) and Douglas Iris (Do) – were not as well discriminated, and the algorithm had some trouble distinguishing them correctly since the flowers belonging to these two classes often have the same color, although they have different petal shapes. The similarity between these two classes can be observed in Table 8. Douglas Iris can also sometimes be confused with Dwarf Crested Irises, due to their white petal markings. As a result, Douglas Irises were the hardest flowers to classify correctly, with only 69% precision.

Table 7. Precision, recall and F1-Score corresponding to MobileNetV2’s best results in test dataset - each class is represented by the first letters in its common name
Table 8. Examples of Bearded Iris and Douglas Iris flowers showing their similar colors, but different petal structures.

5 Conclusion

This paper introduced a new computer vision dataset, called Iris-CV, consisting of five classes of iris flowers. The images show many features that make this dataset a challenging task, such as non-centered flowers, different lighting conditions, multiple flowers per image, and classes that naturally appear with different petal colors. Due to all of these reasons, Iris-CV proved to be too hard for traditional machine learning algorithms, with poorer results than those observed for established benchmark datasets, such as MNIST and Fashion-MNIST.

State-of-the-art deep neural nets also performed worse than current results for MNIST, Fashion-MNIST, and ImageNet, with MobileNetV2 achieving 82% accuracy, which is the best cross-validated result so far. Additionally, an analysis of the best confusion matrix produced by MobileNetV2 showed that three of the classes are more easily classified, namely Dwarf Crested Iris, Western Iris, and Yellow Iris. The remaining two classes – Bearded Iris and Douglas Iris – offered harder challenges, with the latter being the toughest to discriminate.

Since this paper main objective is to propose a new Iris dataset considering computer vision common problems (e.g. occlusion, background noise and fine grained features), we performed a baseline of most used and state-of-the-art algorithms. Future works include further exploration of deep neural net architectures and regularization techniques since the architectures used in this work focused on learning large datasets and some of them overfitted. In addition, hyper parameter tuning is also possible through optimization techniques such as bayesian optimization, grid search or random search, specially classic machine learning approaches which returned worse performed in this dataset. Therefore, our next goal is to improve the current benchmark results, collect more data from different sources, and extract an object detection dataset based on the same images.