1 Introduction

There are currently about 5 million properties in Brazil that subsist on family farming, employing more than 10 million people. According to the Brazilian government [1], smallholder farmers represent 84.4% of rural establishments and are responsible for 74% of rural jobs and 33.2% of the agricultural GDP. In Pernambuco, located in the Northeast region of Brazil, family farming employs ca. 83% of the people living in the countryside. However, smallholder farmers are economically vulnerable, especially in a region that still demonstrates one of the highest poverty rates in the country. Moreover, they are strongly affected by a changing climate. Yet today, Pernambuco suffers from major meteorological disasters, especially droughts.

Despite the great importance of family farming, small-scale producers have to face their challenges alone. They lack the financial resources to seek private agricultural consultants, and governmental support is sparse. According to the Agricultural Census in 2006 and 2017, about 80% of the agricultural establishments declared not having received any technical support required to adopt modern technology and improve productivity. Furthermore, pests and plant diseases can have severe economic impacts, causing losses that reached up to 43% of the annual production in 2016. Significant yield losses are avoidable when caused by the misuse (or lack of usage) of phytosanitary products for disease control. This situation aggravates by the aging and low education level of small-scale producers. 15.6% of the farmers who use phytosanitary products are illiterate [1]. The correct disease identification is vital for any treatment strategy that explicitly targets the causing agents, avoiding the excessive use of non-specified agrochemicals. Access to technical assistance would inform family farmers about the technological and management innovations and empower them to adopt these measures correctly, reducing the risks inherent to agricultural activities.

The present work proposes an assistant system based on deep learning to identify the presence of disease symptoms in images of plant leaves. The assistant was developed in partnership with experts from the Phytosanitary Clinic of Pernambuco (CliFiPe) hosted by the Federal Rural University of Pernambuco (UFRPE). The goal is to provide technical aid in plant disease diagnosis, prevention, and treatment more efficiently to family farmers. Furthermore, the efficiency gain shall enable the experts from CliFiPe to reach out to a more significant part of the local farming community. The system can classify the leaves’ images, achieving a recall value of 95%.

2 Related Works

The application of technology to agriculture has been approached in several studies that employ mobile services to help smallholder farmers or machine learning to detect diseases.

Baumüller recently reviewed [2] the literature on agriculture-related services offered through mobile phones, recognizing that the strategic application of information and communication technology provides the best opportunity for economic growth and poverty reduction. The review describes four service categories offered through mobile phones (information dissemination, financial services, access to suppliers, and access to output markets). It also takes a critical look at experimental works and the expected impact related to these services. Agricultural mobile services for crop disease monitoring and diagnosis belong to the first category.

In a pioneering work, Mohanty et al. [3] suggested using a deep learning approach based on image classification to identify selected plant diseases through leaf images. To this extent, the authors created a dataset for disease classification [4] composed of 38 classes defined as a pair of crop and illness. Then, they trained a Convolutional Neural Network (CNN) to classify them, achieving an accuracy of over 99%. The dataset includes more than 54000 images from 14 crop species and 26 diseases. Most importantly, the publicly accessible dataset enabled several follow-up studies. In the application scenario envisioned in the here proposed work, the CNN only assists in disease recognition. A human expert takes the final responsibility since false identification can cause severe consequences for a small-scale producer. Moreover, the farmer himself takes pictures of the symptoms and provides crop information. In addition, the images taken by the farmers in the field will not follow any predefined pattern, as opposed to the proposed dataset created in a laboratory. Thus, the CNN needs to be trained with images that match reality and are specific to the crops and diseases encountered in the region where advice shall be provided, here, the state of Pernambuco.

Rangarajan et al. [5] also proposed a system to perform disease classification using plant leaves images. In addition, the study assembled a dataset using pictures from five different diseases of the eggplant crop taken both in a lab and in an actual field. The proposed classifier used VGG16 as a feature extractor and Multi-Class Support Vector Machine to classify the diseases. Even though these techniques are frequently used in image classification tasks, studies indicate that this method produces worse results than using the same network as feature extractor and classifier.

Barbedo [6] introduced a system to identify plant diseases using the public dataset Digipathos [7], created by the Brazilian Agricultural Research Corporation Embrapa. The author also used deep learning to perform the image classification but explored individual lesions and spots rather than considering the entire leaf. Since the Embrapa dataset is not balanced, the segmentation in individual lesions acts as data augmentation. The system is specialized to detect and classify diseases based mainly on these spots. In its current form, our digital assistant suggests to a human expert larger picture segments to provide context and aid in the identification. However, lesion-based segmentation is considered for future work.

3 Proposed Approach

3.1 System Overview

As mentioned in Sect. 1, the system’s objective is to help farmers monitor their plantations and identify whether the plant is diseased. For this purpose, the system uses machine learning and computer vision techniques to analyze leaves images and detect whether they show or not disease symptoms. We also built a digital platform where both farmers and experts can connect and interact. Such a platform facilitates and improves the provided diagnosis and assistance. Figure 1 shows the system’s general structure.

Fig. 1.
figure 1

General architecture of the proposed system composed by a mobile app and a digital assistant for a crop clinic

The proposed platform is accessible to farmers through a mobile application. The user selects a crop from a predefined collection and takes a picture of a crop leaf. Images are automatically segmented and submitted to the classification system to detect the probability of revealing disease symptoms. All pictures are stored in a database and are used to fine-tune and continuously optimize the classification systems’ performance. The app also displays general information about possible diseases to the user, such as its most common symptoms and measures for prevention. However, more specific advice on disease control is left to direct communication with a phytopathology expert. The knowledge base is an information database created by experts from CliFiPe.

The mobile app allows the farmer to directly communicate with the experts, exposing doubts and receiving assistance with problems in their crops. The community also grants an opportunity for the farmers to help each other and share knowledge.

Currently, the system is trained to identify symptoms in images of grapes leaves. However, the plant disease diagnosis will focus on the main food crops cultivated in Pernambuco, and later in the Northeast region of Brazil. Therefore, one of the project’s goals is to build a dataset of images of the region’s most common and essential crops and the diseases affecting them. We also aim to make this dataset available to all researchers who work on improving plant disease management.

3.2 Crop Clinic Digital Assistant

To correctly identify the presence of disease symptoms in plant leaves images, we developed a system that generates segments from images of plant leaves and uses deep learning techniques to classify whether they show disease symptoms or not.

Section 2 delineates some examples of works that perform similar tasks. However, images are usually taken in a controlled environment, with supervised conditions and a standard background. Available training datasets, such as the PlantVillage [4] and the Digipathos dataset [7], do not include images with variations in lighting, size, or framing and thus does not reflect actual field conditions. As a result, according to [3], models trained over such datasets achieve only an average precision of 31% when applied to authentic field images. Here we use photos taken in the field to train the neural network model. Since images taken with a mobile phone by inexperienced users will exhibit similar picture quality and detail variations, the CNN will recognize the presence of disease symptoms more reliably.

Figure 2 shows the general architecture of the digital assistant. The system takes the raw images to be classified as input and consists of three modules:

Fig. 2.
figure 2

Tasks performed by the digital assistant

  • Cropping Module: Cropping the input image centers the leaf and eliminates parts of the background. Manually cropped images constitute the training and validation data sets to guarantee a certain standard, improving the classification performance of the CNN. However, photos submitted by the users will be cropped automatically once the app is more broadly disseminated. The algorithm will use cropping margins based on typical pictures taken by the app users.

  • Pre-processing Module: The division of leaves images into segments depends on the frame proportions. Frames with a height-to-width ratio below a certain threshold are separated into four segments, otherwise in six. The chosen threshold value guarantees that each segment shows a similar degree of detail. Segmentation increases the size of the training data set, allowing to exclude parts with limited information. Also, it gives more flexibility in balancing the training data set by selected inclusion of segments of varying exposure conditions and quality. The generated segments are resized using the Bilinear Interpolation technique [8] to match the input shape of the neural network (256 \(\times \) 256 \(\times \) 3).

  • Classification Module: A Convolutional Neural Network classifies the input segments as showing “Symptoms” or “No symptoms”. A CNN is a deep learning technique widely used in pattern recognition that employs a single network to learn and classify the image features [9]. We base our model on the ResNet50V2 pre-trained model. Details of the image training dataset will be given in Sect. 4. The digital assistant performs only binary classification, and forwards the segments with the most pronounced symptoms (highest probability) for further analysis to a phytopathology expert. Selecting the segments that most clearly demonstrate the disease symptoms is a valuable step in disease identification and agent recognition, either by human experts or another machine learning model.

4 Hyperparameter Tuning and Training

We discuss the classification results of the digital assistant in two parts. First, in the present Sect. 4, we will assess the performance of different trained models in the search for a balanced training dataset and optimal hyperparameters. Then, in Sect. 5, the classification of segments from twelve selected leaves images (that were not part of any training dataset) will be discussed in detail for illustration.

4.1 Dataset Collection

Currently, some online datasets are available for plant disease classification experiments, e.g., the PlantVillage [4] and the Digipathos dataset [7]. However, their images are usually taken in a lab and do not present variations in lighting, size, or framing, thus not reflecting the field conditions. Therefore, to improve the network performance when applied to images taken in the field, we manually build the dataset used in the network training.

The dataset employed in this study is composed of images collected by us in the region of the Siriji valley, in Pernambuco’s countryside. There are several smallholders plantations in the area, and we collected images from plant leaves at some of them. Among the cultivated crops, the most important ones for the region’s economy and production include grape, banana, and sugar cane. For this study, we trained a CNN to identify disease symptoms in grape leaves images, but the extension to other crops is straightforward. We took images from leaves in different growth stages, both healthy and diseased, and in different lighting conditions. Then, the photos were annotated by phytopathology experts from CliFiPe, identified as manifesting symptoms or not. After the expert annotation, the dataset size was:

  • 1987 images for the class “No symptoms”

  • 1302 images for the class “Symptoms”

The collected pictures were then manually cropped and segmented, as described in Sect. 3.2. After this step, they are ready to be divided into subsets and used in the experiments.

4.2 Dataset Division

Not all images were used for the training of the CNN. First, we separated a small number of representative images to illustrate how particular image characteristics influence the classification results of selected trained models. To ensure some variability and balancing between classes and conditions, twelve leaves images (six from the class “No Symptoms” and six from the class “Symptoms”) were selected for this final case study, exhibiting different illumination, focus, and contrast conditions. The selected images were not used in the training of any of the models and are not used to extract statistics results. The final case study, reported in Sect. 5, only illustrates possible outcomes when the trained models are applied and emulates the challenges of a usage scenario at a large scale.

Second, the remaining images were divided into training and testing subsets in several different ways. This allows the training and verification of distinct neural network models. We will use this approach to estimate the expected performance variability when using these models for the digital assistant in a crop clinic. Moreover, insights into the composition of a well-balanced training dataset are obtained.

After cropping and segmentation of the collected images, the complete dataset contained 5750 images for the class “Symptoms” and 7676 images for the category “No symptoms”. The entire dataset is then randomly divided into two groups, one for training (with 75% of the total pictures) and one for testing (with the remaining 25%). After the division, the set sizes are:

  • Train set: 10069 images (4312 for the class “Symptoms” and 5757 for the class “No Symptoms”)

  • Test set: 3357 images (1438 for the class “Symptoms” and 1919 for the class “No Symptoms”)

The generated train subset is used for the neural network model training. For this reason, the set must be composed of a sufficient number of images covering a significant variety of segments. The test set is used for the trained model to perform predictions. It allows estimating the performance when used as a digital assistant, classifying pictures taken by the mobile app users. Note that the random division is repeated several times to obtain distinct neural network models.

4.3 Model Training

During the training, the model might face learning problems of varying severity. In extreme situations, the model exhibits under- or overfitting. If the model does not learn to generalize well, then it becomes underfitted and will have a poor performance on training data. On the other hand, if the model retains the training dataset pattern too well, including all specific peculiarities, it becomes overfitted. Thus, the learning progress must be monitored by validating the model after each epoch (each training iteration updating the internal network parameters). To this extent, a certain percentage of the training dataset needs to be separated as a validation set. The model’s performance over this subset, evaluated after each epoch, will indicate potential fitting problems. We used 20% of the training set for validation.

Moreover, the remaining 80% of the images (used for training) were carefully balanced according to the number of images belonging to each class. The ratio between the size of the two categories determined the required augmentation of the class with fewer images, which is achieved by image rotation and flipping.

The CNN was configured using a sigmoid activation function in the output layer. During training, binary cross-entropy served as a loss function. The chosen optimizer was Adam, with a learning rate of 0.0001. The evaluation metric most closely observed during training should reflect the desired outcome. In this case, the digital assistant shall forward to experts image segments that possibly show symptoms for further inspection. Thus, ideally, segments with no symptoms will be classified correctly (“true negatives”, TN) and not analyzed. Still, any segment possibly showing symptoms need to be inspected, i.e., should not be classified erroneously (“false negatives”, FN). Therefore, during training, an essential evaluation metric is the recall metric, computed as \(\mathrm TP/(TP+FN)\). Nevertheless, other evaluation metrics were also computed when applying the model to the test dataset, such as accuracy, precision, F1-score, and average precision. We trained the model using a batch size of 128 and a maximum number of epochs of 100. However, an early stopping criterion was employed to avoid overfitting, with a threshold of 3 epochs with no progress. The classification module was implemented using the Tensorflow framework.

Fig. 3.
figure 3

Training results for a well performing model.

Figure 3 shows the training progress and the validation after each epoch for (a) the recall metric and (b) the loss function. The trained model achieves a high recall score of 95%.

4.4 Prediction

For performance evaluation, the trained model is applied to the images of the test set (not used in training). The prediction returns a probability between 0 and 1, indicating the likelihood of the segment showing disease symptoms. A threshold value of 0.5 separates the classes, i.e., a prediction below 0.5 indicates an image segment with “No Symptoms”, otherwise with “Symptoms”. Using the confusion matrix resulting from the prediction, we calculated the accuracy, precision (both for the positive and negative classes), recall, true negative rate, F1-score, and average precision for the performance assessment. However, the metric most closely observed is the recall.

Fig. 4.
figure 4

Flow diagram of the performed computational experiments

4.5 Performance Optimization

Different dataset division strategies and hyperparameter value options were tested and evaluated to determine the ones that maximize the model performance. Figure 4 shows a diagram that summarizes the experiment flow. We implemented a script to repeat the complete assessment (training and testing) a certain number of times (N) while performing a grid search for high-performance models. For each repetition, the dataset is divided differently (maintaining the ratio of 75/25). Then, the trained model is applied to the respective test dataset of the repetition, and the evaluation metrics are computed. After N assessments, the mean and standard deviation of the predicted recall metric from each execution are calculated. Although not representing a systematic cross-dataset validation, the approach will identify flyers, i.e., models that show atypical performance due to an accidentally introduced bias in the randomly selected training dataset. The script also saves the models that produce the best recall, the worst recall, and the closest recall to the mean. In addition, other prediction measures are also observed to guarantee that the model is behaving as expected. Applying these models to the case study dataset, composed equally of healthy and diseased leaves images with different quality levels, gives further insights into the strategy for finding a well-performing model. Due to image variability, a suitable dataset for training should not only balance between images showing disease symptoms or not but should also balance between pictures of good and low quality. The training of N distinct models helps to understand both the expected performance, capability, and limitations of the neural network, as well as the impact of the training.

Table 1 summarizes the findings of the first group of assessments. A minimum number of repetitions of dataset splits and model training is required to reliably explore the complete dataset. We found that \(N=50\) is sufficient for the collected data. The mean recall and standard deviation do not change significantly for larger N, but the computation time increases. Another interesting finding concerns the dataset division strategy. The data can be divided into train and test sets before or after separating the leaves’ pictures into segments. The former approach (before segmentation, defined in the table as the split category “Leaves”) ensures that all the parts from a given picture end up in the same group. However, constructing the training set on segment level (splitting after segmentation, defined in the table as the split category “Segments”) increases flexibility. This leads to a smaller standard deviation and reduced gap between maximum and minimum recall. Finally, the first group of assessments also included an evaluation of the desirable dataset size. We tested the use of the complete dataset, as opposed to the use of a partial one. As shown in Tab. 1, a training set that is too small leads to a reduced recall mean and maximum value, independently of the division strategy. However, the 14076 images collected in a dedicated field expedition are sufficient to train models with good performance and high evaluation metrics. The table also shows the mean values of the precision, accuracy, and F1-score measured in the experiments.

Table 1. Assessment results - Group 1

Table 2 summarizes the findings of the second group of assessments, related to the optimization of the training algorithm. Inside the second group, some experiments were conducted to determine the best weight to be attributed to each of the two classes during training. Choosing different weights for the classes allows to balance them without resampling. In other words, weights can tell the model to pay more attention to the instances of a particular category. Table 2 shows that attributing weight “1” to class “0” (“No symptoms”) and weight “2” to class “1” (“Symptoms”) gives a higher recall mean. This result is plausible because the importance to the class “Symptoms” is increased, and we are mainly trying to avoid false negatives.

Table 2. Assessment results - Group 2

In addition, the evaluation metric and the classification threshold were also analyzed, but are not shown in the table. Some models were trained using the accuracy metric instead of the recall metric. However, they produced too many false negatives, which is not desirable since segments classified as showing no symptoms will not be called for further inspection by a human expert. We also tested different threshold values for the classification but found that the standard choice of 0.5 gives the most reliable prediction.

As mentioned in Sect. 2, the purpose of the present work is to detect the presence of symptoms in the leaf image, which constitutes a different application scenario as discussed in the state-of-the-art literature. Therefore, the implemented neural network model was specifically developed for the present application. Consequently, we cannot compare our model to a reference.

Promising models trained in the various assessments were applied to the (previously separated) case study dataset, giving further insights into the expected classification performance when analyzing leaves images of varying quality.

Table 3. Prediction Results of two selected models

5 Case Study and Discussion

Twelve models were chosen for the final inference study. These models gave the best, worst, and mean recall in the assessments presented in the four last lines of Table 1. As stated in Sect. 4, the case study dataset is composed of images from both classes (showing or not symptoms) for various light, focus, and background conditions and different severity levels of symptoms.

The analysis of the predictions revealed some characteristics present in the inference images that influence the classification accuracy. For example, when the area occupied by the background is small, it is easier for the model to classify the leaf segment correctly. However, the classification is more challenging for the model when the sunlight is too bright, or the leaf image does not have enough contrast. Thus, a carefully composed and sufficiently large training dataset is essential. As shown in Table 1, all models trained using a smaller dataset (constituted of about 3600 images) achieved a poor performance, committing many errors (both false positives and false negatives) when applied to the case study dataset (results not shown). Meanwhile, models trained with a more extensive dataset performed better.

Fig. 5.
figure 5

Classification results for selected healthy leaves images obtained from two distinct models

However, defining the correct target metrics values for optimal performance is challenging. Table 3 displays the predicted evaluation metrics of two models that produced different recall values, out of \(N=50\) different training and test sets when divided on segment level (second line of Table 1).

The results indicate how the changes in the training dataset are related to the output metrics. For example, the first model achieved better results for metrics indicating the absence of false negatives, like negative precision and recall. On the other hand, the second model performed better for metrics that increase with decreasing false positives, like positive precision and true negative rate. However, this does not necessarily imply that the second model leads to more decisive errors in predicting the negative class (“No Symptoms”).

Figure 5 displays the classification results of the two models applied to the images from the case study dataset of healthy leaves. The leaves images are divided into segments and show the model output classification (symptoms “yes” or “no”) and the corresponding probability for each of them. The first line under each segment is the prediction obtained from model 1 from Table 3, and the second line is from model 2. Similarly, Fig. 6 shows the diseased leaves images and their classification results when the two models are applied.

Fig. 6.
figure 6

Classification results for selected diseased leaves images obtained from two distinct models

As expected, probabilities are higher when predicted by the first model in comparison to the second one. Although it is desirable to avoid false negatives (the objective of the training), model 1 leads to more errors in predicting the negative class, i.e., the model classifies segments without symptoms as presenting them (FP). Table 4 shows the confusion matrices results from the two models when applied to the case study dataset.

The inference study also revealed potential future improvements, such as, for example, the use of multi-label classification. This technique uses the same neural network to classify multiple different labels at the same time. That might allow, for instance, the enhancement of other prediction metrics maintaining the recall values.

Table 4. Confusion Matrices for two selected models over the case study dataset

6 Conclusion

The present work proposes using computer vision and machine learning to identify the presence of disease symptoms in plant leaves images. The main objective is to assist experts from the Phytosanitary Clinic of Pernambuco and enable consulting services for free or at a nominal cost offered to small-scale agricultural producers.

The implemented system allows smallholders to communicate with phytopathology experts and the community of users. In addition, the digital assistant successfully solves the problem of classifying whether image segments taken by inexperienced users show disease symptoms or not and, thus, acts as a filter for the subsequent analysis by human experts.

We also tackled the challenge of lacking data for training a Convolutional Neural Network by building our own training set composed of images of essential crops from Pernambuco. The training dataset balances images of good and poor quality to maximize the recall of photos taken from non-experts in the field. The best-performing model could achieve a recall of over 95%. Moreover, the annotated dataset will continuously improve with a growing user community. Test and inference results for distinct trained models gave insights into an ideal training dataset and the influence of exposure conditions when taking the photos to be classified in the field. We also found that models compromising on multiple metrics may give better classification results.

Future works include optimizing the classification system, automatizing identification of the disease-causing agent, and using a multi-label approach to improve the model performance.