1 Introduction

The gastrointestinal (GI) tract is a tubular system whose main task is digestion. Its structure is made up of the mouth, pharynx, esophagus, stomach, small intestine, large intestine, rectum and anal canal. Inflammatory diseases associated with the GI Tract have a high incidence rate in the world population, especially in developing countries, where colon, liver, and stomach cancer are among the most recurrent diseases [15]. According to the World Health Organization (WHO), in 2022, there were approximately 5.1 million cases of GI Tract cancers worldwide, and, in the Americas region alone, 431,116 deaths were recorded [27].

Based on the need to combat the harmful effects of GI Tract pathologies, endoscopy is one of the most widely used techniques for analyzing GI Tract pathologies, but it is characterized by being a painful and invasive process [12]. In contrast, videos can be taken using Wireless Capsule Endoscopy (WCE), a less aggressive and uncomfortable method. During this procedure, the patient swallows a capsule with a light source, a micro-camera, and a signal emitter that travels painlessly throughout the GI Tract until it is naturally expelled by the patient’s body. However, as they scan the entire GI Tract, the videos acquired by the capsule are around 8 h long. Therefore, analyzing this video is an arduous and tedious task, making it possible to miss abnormalities during the examination due to its dependency on the analyst’s constant concentration on the images [4]. For this reason, it is important to use automation to help the specialist analyze the video.

From this perspective, machine learning techniques, which consist of using a dataset to develop a computer-aided diagnosis (CAD) system, have been used in the analysis of medical images since the 1990s [13]. These techniques are widely used in medical tasks due to their ability to automatically learn discriminative features from raw data and to reduce the need for pre-processing and feature extraction [14]. An example of this type of method is Convolutional Neural Networks (CNNs), a computational resource inspired by the workings of the human brain commonly used in the processing and analysis of digital images [26]. Another well-known method is YOLO (You Only Look Once), which stands out as a real-time object detection algorithm and a single-stage detector that can predict all objects in a single pass through the image. In this way, the adoption of YOLO can help in the early detection and diagnosis of GI Tract diseases [19].

Furthermore, the importance of identifying the region of the pathology lies in enabling a more precise diagnosis, potentially avoiding additional invasive diagnostic procedures [12]. It allows for the comparison of images from subsequent exams to evaluate disease progression or treatment response and informing patients about the exact location of their condition can improve their understanding of the disease and increase treatment adherence. Thus, it becomes crucial to enhance the capability of tools to automatically detect and characterize pathologies, increasing the efficiency of CADs.

Therefore, this work aims to introduce detection methodologies for WCE images of the GI Tract, providing a solid foundation for subsequent research efforts to build upon the automated detection of diseases in WCE images. Thus, this study developed the application of a method to classify and detect the pathologies using the best CNN architecture experimented and YOLO. Next, Sect. 2 highlights the related work, Sect. 3 describes the proposed method, and Sect. 4 presents the results, experiments, comparisons, and discussions of the proposed method.

2 Related Works

In recent years, neural networks based on deep learning methods have proven to be a powerful tool for image analysis, including medical imaging tasks [3]. This has led to several advances in the field, including computer-aided classification and localization of pathologies from the GI Tract. This section presents some of the work that has been done on this subject.

Xu et al. (2019) introduced a multi-task anatomy detection convolutional neural network (MT-AD-CNN) designed to evaluate esophagogastroduodenoscopy inspection quality. The model classifies images and distinguishes between informative and non-informative frames, thus displaying detection boxes only on informative frames and reducing false positives. The MT-AD-CNN achieves a mean average precision (mAP) of 93.74% for the detection task and an accuracy of 98.77% for the classification task.

The work by Oukdach et al. (2024) presented a new framework that synergizes the strengths of CNNs and Vision Transformers (ViTs). This approach incorporates an attention mechanism within a CNN to extract local features. Experiments performed on the Kvasir Capsule, a large-scale dataset of WCE images [21], showed a promising result of 97% in accuracy, precision, recall, and F1-Score.

Muruganantham and Balakrishnan (2022) proposed a method for classifying images with ulcers, bleeding, polyps, or no abnormality using ResNet-50 associated with a self-attention mechanism aggregate spatial characteristics globally. The method was also tested on the Kvasir-Capsule dataset and achieved a classification accuracy of 94.7%.

In addition, Srivastava et al. (2022) proposed a focal modulation network called FocalConvNet. FocalConvNet makes use of focal modulation and an added convolutional block to extract features in order to improve detection. The study used 13 classes from anatomy and luminal findings from the Kvasir-Capsule dataset, obtaining F1-Score, recall, and MCC (Matthews Correlation Coefficient) of 67.34%, 63.73%, and 29.74, respectively.

As videos are sequences of captured images and the mentioned methods are applied per image (individual video frame), some work [5, 7, 17, 18, 22] do not clearly specify how sampling is done to split the training and test sets. Ideally, all video frames for a patient should be in the same set. Applying a random sampling of video frames can result in “data leakage” [9], as similar frames (neighbors) from the same patient can end up in both the training and test sets, compromising the results’ reliability.

In addition, although there are many published studies to assist in the diagnosis of GI pathologies in WCE images, there is a notable lack of studies focused on the detection of the region in which these anomalies are found. To this end, a pipeline was developed to classify abnormal images using the best-performing CNN architecture in the conducted experiments, followed by the detection of the region with the presence of the disease using YOLOv8 [8], a recent model for detecting objects from images in real-time. This work’s contribution lies in the proposal of an automated method that not only identifies the presence of different types of anomalies in WCE images but also indicates the region affected by the pathology, reducing the time and work involved in analyzing videos of WCE exams.

3 Materials and Method

This section describes the proposed method. Firstly, the image dataset used to train and evaluate the proposed method is determined, and then the best performing CNN architecture is selected to classify the images in the dataset. Then, the model that has shown the best performance during the experiments sends the images classified as abnormal for detection. Finally, the detection models trained for a specific class by YOLOv8, make predictions based on the images received. The predicted region is the one with the highest confidence between the model’s predictions. A representation of the process can be seen in Fig. 1.

Fig. 1.
figure 1

Proposed method

3.1 Dataset

Table 1. Kvasir-Capsule images by class

In this work, the experiments were carried out using the Kvasir-Capsule dataset, which was built by collecting WCE scans from a Norwegian hospital [21]. Currently, Kvasir-Capsule is the largest and most diverse WCE dataset with expert-labeled data and publicly available. It comprises a total of 117 videos collected from WCE exams, totaling around 4.7 million frames extracted from these videos.

In addition, the dataset features 47.238 frames labeled by specialists, 336\(\times \)336 in size and divided into two categories: anatomical findings, consisting of Pylorus, Ileocecal Valve and Ampulla of Vater, and luminal findings, composed of Angiectasia, Blood - fresh, Blood - hematin, Erosion, Erythema, Foreign Body, Polyp, Lymphangiectasia, Reduced Mucosal View and Ulcer, in addition to the Normal Clear Mucosa class, which represents normal images. Figure 2 shows the sample image for each class. The division of the dataset, the number of patients, and the availability of the bounding boxes per class can be seen in Table 1.

Fig. 2.
figure 2

Sample of labeled images in Kvasir-Capsule

3.2 Classification Stage

In this stage, we previously implemented and evaluated CNN architectures to choose the model to be used in the method, defined the metrics to be used to evaluate the binary classification models, and trained and tested the dataset with the architecture that obtained the best performance.

Selection of CNN Model. After defining the dataset, the best-performing CNN is selected to classify the images. For this task, CNNs widely used in the literature in the field of image classification were chosen. The architectures evaluated in this stage were VGG [20], Resnet [6], Inception [23], Xception [1], EfficientNet [24] and EfficientNetV2 [25]. The architectures mentioned were used according to their respective original authors and the implementation provided by the Keras library [2].

Binary Classification Training and Evaluation. After choosing the best architecture to perform the binary classification, the model is evaluated during and after the training process using the following metrics:

$$\begin{aligned} \text {Accuracy } (ACC):(TP + TN) / (TP + TN + FP + FN)\end{aligned}$$
(1)
$$\begin{aligned} \text {Precision } (PRE): TP / (TP + FP)\end{aligned}$$
(2)
$$\begin{aligned} \text {Recall } (REC): TP / (TP + FN)\end{aligned}$$
(3)
$$\begin{aligned} \text {F1-Score } (F1): 2 * (PRE * REC) / (PRE + REC)\end{aligned}$$
(4)

where TP and TN represent correctly classified images with pathology and healthy images. In addition, FP represents the healthy images wrongly classified as having abnormalities, and FN means the classifications of pathological images as healthy images.

In the context of the proposed method’s objective, accuracy is useful for giving an overview of the model’s performance, but it is misleading in unbalanced datasets, as in the case of Kvasir-Capsule, where one class is more frequent than the other. Thus, precision is also used to ensure that images classified as containing pathologies actually do have pathologies, thus reducing false positives that can overload the region detection stage.

In addition, recall is crucial to ensure that as many images with pathologies as possible are identified, minimizing false negatives that could result in undetected pathologies. The F1-Score will be the most important metric to evaluate due to its balance between precision and recall, providing a single performance measure that considers both false positives and false negatives. Furthermore, although specificity is a useful metric in many contexts, in this work, the main focus is ensuring that all images with pathologies are identified (maximizing recall) and that the identified images contain pathologies (maximizing precision).

Thus, using the metrics presented, the model selects the images with pathologies and sends them to the detection phase to determine the region where the classified pathology appears.

3.3 Detection Stage

During this stage, the detection models based on YOLOv8 were trained and the pathological area was predicted based on the model with the highest prediction confidence.

YOLOv8 Architecture. YOLOv8 was unveiled in January 2023 by Ultralytics, providing five versions: YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv8l (large), and YOLOv8x (extra-large) [8]. YOLOv8 provides support for classification, segmentation, and object detection tasks. In this work, YOLOv8s was used as a model to detect the regions where anomalies appear in the dataset images. The YOLOv8 architecture can be seen in Fig. 3.

Fig. 3.
figure 3

YOLOv8 architecture [8]

Among the differences in YOLOv8 compared to previous versions is the C2f (partial inter-stage bottleneck with two convolutions) module, which combines high-level features to improve detection accuracy based on contextual information. In addition, YOLOv8 has an approach that allows each branch to be focused on its task in order to improve the overall accuracy of the model. In the output layer of YOLOv8, the sigmoid function is used as the activation function to represent the probability of a bounding box containing an object. In addition, the softmax function is used for class probabilities.

YOLOv8 uses the CIoU (Complete Intersection over Union) and DFL (Distribution Focal Loss) loss functions for bounding box loss and binary cross-entropy for classification loss. These losses improved object detection performance, especially when dealing with smaller objects [11]. YOLOv8’s architecture uses a modified backbone CSPDarknet53 combined with an SPPF layer (Spatial Pyramid Pooling Fast), which speeds up the calculation by grouping features into a fixed-size map. Furthermore, each convolution has batch normalization and SiLU activation (Sigmoid Linear Unit), and the head of the architecture is decoupled, so the objectivity, classification, and regression tasks are processed independently [16].

Detection Model Training. After classifying the images with pathologies, the next step is to train the detection models with YOLOv8. The aim was to train a dedicated model to detect each class so that each model could predict based on the abnormal images received from the previous stage. YOLOv8s was therefore used to train the detection models. Along with accuracy and sensitivity, the following metrics were also used to measure the assertiveness of the predicted regions in relation to the true region with the pathology:

  • Average Precision (AP): \(\int _{0}^{1} P(R) \, dR \)

  • mean Average Precision (mAP): \((\frac{1}{n} \sum _{i=1}^{n} \text {AP}_i )\)

  • Intersection over Union (IoU): \(\frac{\text {Intersection Area}}{\text {Union area}} \)

mAP is a widely used metric in object detection tasks, as it helps calculate average accuracy at various IoU thresholds. Thus, it provides a comprehensive assessment of the algorithm’s accuracy in object detection. The mAP calculation considers the AP, the average of the accuracies calculated at various recovery points. The recovery point is where the rate of true positives is calculated in relation to the total number of positive examples.

Additionally, the IoU metric was used to measure how close the predicted bounding boxes are to the real box described in the dataset. This metric is calculated as the area of intersection over the area of the union of the prediction with the true area. It is commonly used to assess the accuracy of object detection. Thus, after training the dedicated models for each class, all these metrics were calculated and applied to a test set to evaluate the performance of each model.

Pathology Region Detection. After training the dedicated model for each class, the images classified as abnormal are passed through all the detection models to assess the confidence in the region predicted in each image. From the results stored in text files in the YOLO format, we will have access to the following prediction information: detected class, x and y coordinates of the center of the bounding box, width, and height of the predicted bounding box, and the prediction confidence. So this step consists of identifying the prediction with the highest confidence and then collecting the coordinates of the predicted bounding box.

Then, the given coordinates are translated into Cartesian format on top of the original image size and the region of intersection between the prediction and the true location of the pathology is stipulated. Once all the images have been passed through and the region of interest has been identified, the IoU corresponding to all the intersection regions is calculated in order to obtain an overall evaluation of the proposed method.

4 Results and Discussion

This section presents the results achieved using the proposed method for detecting regions containing GI tract pathologies. The experiment carried out, and the impact of the results on the advancement of detection models in GI Tract images are also described.

4.1 Dataset Preparation

Initially, the dataset was divided by patient, given that the images were taken from WCE videos and depict frames of the same patient. This division was then designated to prevent data leakage, the influence of extremely similar images of the same patient being present in the same set (training, validation, or test). As a result, only classes with at least three patients were used, since each set must contain at least one different patient with each pathology. In addition, only classes of luminal findings in which the bounding boxes were also available in the dataset were considered. Therefore, based on Table 1, the classes used were Angiectasia, Erosion, Foreign body, Lymphangiectasia, and Ulcer.

As a result, there was an imbalance between the number of pathological images belonging to each patient. So, in order to increase the difference in the number of samples in each set, the patient with the most images was sent to the training set, the second to the validation set, the third to the test set, and so on, in that order, until all the patients in the class were distributed. The number of images with pathology belonging to each of the sets can be seen in Table 2.

Table 2. Distribution of images by class for each set

In the healthy images used for classification, it was observed that many images were from the same patient. Therefore, efforts were made to diversify the number of normal images obtained from each patient, ensuring a balance with the set of abnormal images. Thus, 200 images from 13 patients were used for the training set, 80 images from 10 patients for validation, and 30 images from 14 patients for testing, totaling 2600 images for training, 800 images for validation, and 420 images for testing.

4.2 Abnormal Image Classification

This section describes the results achieved in the stage of selecting the best-performing CNN architecture. All CNN architectures were trained with 100 epochs and the following hyperparameters: Adam [10] optimizer with a learning rate of 0.001, batch size of 32 and early stopping with patience of 25 epochs. Moreover, the images were normalized and resized to 224 \(\times \) 224. In order to evaluate the models to determine the best performing one, experiments were carried out on the following architectures: VGG16, ResNet50, InceptionV3, Xception, EfficientNetB0, EfficientNetB1, EfficientNetB2, EfficientNetV2B0, EfficientNetV2B1 and EfficientNetV2B2. All the architectures were trained with pre-trained weights from imagenet. Table 3 shows the results achieved with each architecture and the number of parameters in each model.

Table 3. Comparison of CNN architectures

Although VGG16 has perfect precision, its accuracy, recall and F1-score are low, which indicates that the model makes few positive predictions and is not ideal for applications that require balanced detection. Something similar occurs in EfficientNetB0, as the model has high accuracy but fails to detect part of the true positives.

Xception and EfficientNetV2B2 presented modest results in all metrics, demonstrating that neither model was well-suited to the proposed task. Therefore, ResNet50, InceptionV3 and EfficientNetV2B1 have good accuracy, but their precisions are relatively low, resulting in moderate F1-score values, 78.7% and 70.9%, respectively.

EfficientNetB1 and EfficientNetB2 presented relatively good and balanced precision and recall values, although EfficientNetB2 presented low accuracy compared to the other models. As a result we have 82.7% and 84.5% F1-Score, respectively. For these two architectures, the model does not usually miss a positive instance and also does not make many wrong predictions.

Table 4. EfficientNetV2B0 confusion matrix

However, EfficientNetV2B0 is the model that presents the best combination of accuracy (87.8%), precision (91.6%), recall (89.1%) and F1-score (90.3%). This indicates that it makes many correct predictions and detects most true positives. In other words, it performs the best correct classification of pathologies among all detections made by the system, thus being the most suitable model to undertake the proposed task of classifying pathological images. Table 4 shows the confusion matrix for the classification with EfficientNetV2B0.

4.3 Detection of Regions with Anomalies

To train the detection models, data augmentation was performed on the images of each class. The operations applied were: Vertical and horizontal flip and blur variations with filter size varying from 1\(\,\times \,\)1 to 7\(\,\times \,\)7 pixels. The results of training the dedicated detection models for each class are shown in Table 5.

Table 5. Metrics of the models trained with YOLOv8

The mAP50 returns the average precision of the detections with an IoU threshold of 0.5, while the mAP50-95 returns the results of the mAP with IoU thresholds of 0.5 to 0.95. Finally, the IoU indicates the extent to which the prediction box actually intersected the region with the anomaly. Of particular note were the Angiectasia, Foreign Body and Ulcer classes, which achieved 67.3%, 87.2% and 73.1% IoU, respectively, demonstrating the models’ ability to detect the majority of pathological regions in their respective class.

At the end of training the detection models, the next step is to determine the coordinates of the predicted box with the highest confidence prediction in the images classified as abnormal. Thus, in order to evaluate the model’s prediction in relation to the true region with the pathology, the IoU was used. The results obtained from the predictions per model and the overall evaluation of all the models can be seen in Table 6.

Table 6. Evaluation of detection regions

To evaluate overall performance, it should be noted that for normal images, which have passed through the classification filter for detection, a 100% true positive value will be considered if there are no predictions for these images since there are no abnormalities to be found. Otherwise, if there is any kind of prediction in an image without pathology, a 0% true positive value will be considered, as there are no pathologies to be found in such images. In addition, abnormal images that have been classified as normal automatically count as a 0% hit in the method’s overall performance since the detection models won’t have the chance to predict the pathologies present in these images. Some examples of the acquired results can be seen in Fig. 4. The red boxes represent the model’s prediction, while the green boxes show the true anomaly region. In addition, the IoU value of each prediction is highlighted in the top left corner of the images. Although the overall result was not satisfactory, it should be noted that this work represents an initial step in the detection of pathological regions in WCE images, especially with regard to the criteria for dividing up the dataset.

Fig. 4.
figure 4

Sample results of pathology region prediction

4.4 Discussion

The results shown in Tables 5 and 7 demonstrate the superiority of the dedicated models for each class compared to the single model for all classes in detecting pathologies. For the angiectasia classification, the dedicated model obtained a precision of 94.6% and a mAP50 of 79.3%, in contrast to the single model, which showed a precision of only 21.1% and a mAP50 of 29.9%. Although the recall of the single model is slightly higher, the combination of high precision and mAP50 of the dedicated model indicates a more accurate and reliable identification of angiectasia images. In erosion detection, the dedicated model was also superior, with a precision of 46.6% and a mAP50 of 16.9%, while the single model showed substantially lower values.

Table 7. Metrics of the model trained for all classes with YOLOv8

For the foreign body class, although both models showed high performances, the dedicated model was still superior in its performance. In the lymphangiectasia and ulcer classes, the dedicated models significantly outperformed the single model in all evaluated metrics. As for the IoU, the dedicated models were superior in detecting regions containing anomalies in all classes. Especially in the erosion and ulcer classes, where the model trained for all classes obtained insignificant results, while the dedicated models could detect more than half of the pathological regions. These results highlight that the approach of training dedicated models for each class not only improves efficiency in detecting specific pathologies, but also reduces the false positive rate and increases confidence in the detections made. Therefore, using dedicated models for pathology detection in WCE is a more effective solution, providing more reliable results than a single model trained for all classes.

In addition, in Table 8 a comparison was made with the binary classification methods of works related to the study carried out in this paper, all using Kvasir-Capsule. Each study contributed different methodologies and made significant advances in the field. However, based on the division criteria adopted by the works that relate to the classification stage carried out in this study, the proposed method still showed good performance compared to state-of-the-art models. In comparison, Muruganantham et al. [17] achieved 94.70% accuracy, Fonseca et al. [5] obtained 97.00% precision, Oukdach et al. [18] achieved F1-Score of 97.00% and Jain et al. [7] achieved 98.00% across all metrics with a hybrid network approach and residual attention, all using the same random division of data.

Table 8. Comparison of binary classification works using Kvasir-Capsule

It is emphasized that the patient splitting criterion adopted in the proposed method is crucial to avoid data leakage between training and testing, ensuring more reliable results when evaluating the model’s performance. This method ensures the integrity of the results by mitigating potential biases introduced by less rigorous data-splitting practices. It is worth noting that during the search for research related to this work, no other work was identified that utilized the Kvasir-Capsule dataset and proposed or evaluated a method to classify abnormalities and detect the region in which the pathology is found. As such, this work aims to propose advances in the area of pathology detection in WCE images of the GI Tract, presenting some baseline results for future work in the area.

5 Conclusion

In conclusion, early diagnosis is essential so that the necessary treatment can be carried out, but the detection of GI Tract pathologies in WCE images remains a challenging task due to the complex nature of the gastrointestinal system and the low quality of the images generated by this exam. The proposed method proved to be a promising tool for detecting GI Tract pathologies present in WCE images. The F1-Score of 90% obtained in the classification stage indicates that the EfficientNetV2B0 model is able to help in the early identification of pathological images, allowing the detection models dedicated to each class to work on the images that present the diseases. Thus, the detection showed the ability to find some of the pathological regions based on the strategy of treating the prediction region as the one with the highest confidence among the trained detection models.

As future work, in order to build on the results obtained in this study, an expansion of the proposed method is intended, with pre-processing methods to improve the models’ ability to identify pathologies. In addition, it is proposed to investigate multi-class classification methods to serve as a step between the first classification and detection, to use the best detection model for the frame analyzed. It is also planned to test the method on other datasets in order to validate the robustness of the proposed method further. Therefore, it is understood that this work can help health specialists in the task of identifying pathologies in WCE images, potentially reducing the workload on doctors and improving patient outcomes through early and accurate diagnosis.