1 Introduction

The World Health Organization (WHO) estimates that between 6 and 7 million people are infected with Chagas disease (CD). CD is endemic in Brazil and Latin America, and its incidence is increasing steadily across North America, Europe, Japan and Australia.

CD, caused by the parasite Trypanosoma cruzi, is a persistent and potentially lethal illness. It affects up to 1.5 million (30%) people worldwide who have Chagas cardiomyopathy (CC). This condition significantly compromises the health of infected individuals, leading to a shortened lifespan or death due to sudden cardiac events (SCD) [27].

SCD is one of the leading causes of death worldwide, often presenting as an abrupt collapse with a documented loss of vital signs, which may lead to attempts to restore circulation through cardiopulmonary resuscitation (CPR) [8]. In the context of CD, SCD accounts for approximately 45% of deaths among patients diagnosed with CC [19], with an occurrence rate (2.4% per year) higher than that of the general population [20].

The search for early and advanced treatments for patients with CC who are susceptible to SCD has led to measures such as implantable cardioverter-defibrillators (ICDs), which can extend patients’ lifespans [20]. However, these interventions are limited by the unpredictability of SCD, a challenge that is well-documented in the literature [12, 14, 19]. In this context, providing systems that aid in the diagnosis of SCD risk in CC patients becomes a valuable tool, especially given the high prevalence of this disease in some countries and the unpredictable nature of SCD.

Therefore, this study proposes a model architecture that allows for the incorporation of various data types (tabular and serial data) and hybrid ML models, eventually supporting clinical decision-making in the context of SCD in CC. Additionally, this work developed a first prototype of a user-friendly computational system to assist doctors in patient care. The system utilizes a predictive model built with a minimal feature set, focusing on patients not considered at high risk of SCD.

By combining the output of a recurrent neural network (RNN), which effectively captures temporal features, with tabular data, the approach leverages both sequential and non-sequential information. This combined approach ensures the model’s scalability to handle diverse input data types and tasks. Consequently, it enables the exploration of potential impacts and the relevance of new features for predicting SCD risk in CC patients.

This study is organized into five sections. The first section provides an introduction to the research problem. The second section discusses related works that align with the themes of the present study. The third section explores the methodological aspects of the research, outlining the proposed model and system architecture. The fourth section presents the experimental results, detailing the system’s development and the model’s performance. Finally, the last section offers concluding considerations of the developed model and suggestions for future work.

2 Related Work

Intelligent Systems (IS) with machine learning (ML) have a significant impact on assisting diagnosis in various areas of healthcare. Their use can lead to safer and more precise medical practices [13]. In the specific context of CC, some studies have shown that ML algorithms can classify patients with CD with the same or higher precision than physicians [3].

Despite the global significance of SCD [15], its occurrence in the specific context of CC has been scarcely explored, especially regarding the use of ML computational tools by end-users. Furthermore, a comprehensive review of the literature reveals a current lack of studies dedicated to addressing this knowledge gap.

One study has a limited approach, as it only performed classic linear analysis and did not fully explore the potential of combining different parameters [23]. Another research study utilized eight variables from 78 patients, applying heart rate variability (HRV) and heart rate turbulence (HRT) techniques to ECG-Holter recordings to investigate SCD in Chagas heart disease (ChHD). The attribute set was reduced using forward and reverse stepwise approaches. This work employed the k-nearest neighbors (KNN) classifier and leave-one-out (LOO) cross-validation (CV) [1]. However, the lack of sensitivity and methodology for handling imbalanced data makes reproducing this study unfeasible.

A third study employed several classic ML methods in a dataset with 218 patients, using seven cardiac restitution metrics (CRM) extracted from a 4-hour segment of a 24-hour ECG-Holter along with clinical data. This study reported sensitivity results of 90% and identified the Gaussian Naive Bayes algorithm as the best-performing model [4]. However, there was no exclusion of high-risk patients in this research.

[18] investigated the use of action potential duration restitution (APDR) dynamics, using a cross-sectional study of 221 patients, the authors extracted APDR dynamics metrics from 4-hour ECG segments and analyzed them alongside clinical variables using a classification tree. The study identified the %QTend TendQ > 1, as the most relevant predictor of SCD in the model with a 0.96 AUC. However, it is limited by its inclusion criteria that included patients who met the following: (1) had regular appointments at the Clementino Fraga Filho University Hospital (HUCFF-UFRJ), Rio de Janeiro, Brazil, (2) had resided outside an area with endemic Chagas disease transmission for at least 20 years, (3) were classified as having a low-intermediate Rassi risk score, (4) had their vital status checked until the end of February 2017, and (5) had an established diagnosis of Chagas cardiomyopathy (CC), confirmed by at least two positive serologic tests for antibodies against Trypanosoma cruzi and characteristic 12-lead electrocardiogram changes.

3 Methodology

About the scientific method, this work adheres to the action research methodology, aiming to propose and test solutions for problems within a specific context [17]. As for its approach, this study can be categorized as quantitative, centered on problem-solving through theory testing and the numerical quantification of variables [10]. The project was organized into developing prediction models and creating a prototype web system, aspects that are detailed below.

3.1 Data Collection

The study utilized data from a clinical follow-up program conducted between 1992 and 2023 at Clementino Fraga Filho University Hospital of the Federal University of Rio de Janeiro. Patients with a high degree of cardiac compromise were excluded, as they already had indications for potential interventions to ensure survival, such as CRT implantation or other treatments. Samples meeting the following criteria were removed from the dataset: NYHA class 4, severe dysfunction classification, high Rassi score, 2005 Classification Guideline of C or D, and all patients with a Teicholz EF below 0.40. After this exclusion, 120 patients remained for analysis, including 19 (16%) who experienced SCD and 101 (84%) who did not. The gender distribution among these patients was 42 (35%) males and 78 (65%) females.

3.2 Standardization Process

The dataset used consists of two subsets: the first derived from tabular data and the second from time series extracted from ECG-Holter signals using the sames methods used by [16, 21]. Regarding the number of attributes, the dataset consists of 49 tabular features and 27 time series features.

All features are grouped into 7 categories: clinical data, echocardiogram (ECHO), electrocardiogram (ECG), electrocardiogram Holter (ECG-Holter), cardiology guideline classification (CGC), treatments, and cardiac restitution metrics (CRM). Notably, the CRM group exclusively contains the 27 time series features, while the remaining 49 tabular features are distributed among the other four categories. The groups and features used in this work are detailed in Table 1.

Table 1. Description of the sequential and non-sequential features used in this work, grouped into 7 groups. The 49 tabular features are classified into the groups: clinical data, ECHO, ECG, ECG-Holter, CGC, and treatments. The 27 time series features extracted from ECG-Holter are exclusively in the CRM.

Initially, 14 features were excluded due to confidentiality concerns or because they contained irrelevant information for this study, which resulted in 49 final tabular attributes. Among these attributes, 30 were binary, 12 were scalar, and 7 were categorical. The seven categorical attributes underwent one-hot encoding, transforming each category into a binary attribute, thereby bringing the dataset back to 63 attributes. Subsequently, the data was normalized using the MinMax technique.

Regarding the time series extracted from the ECG-Holter, each series initially had 30 attributes, but two were excluded—one constant and the other the target variable indicating SCD. This left 28 final attributes: 27 corresponding to CRM and 1 indicating the time of each sample in the series; all of these attributes were scalar. These 27 time series features represent an initial result of an ongoing study that is processing ECG-Holters from Chagas disease patients. In this work, we propose an architecture that utilizes both sequential and non-sequential data for an initial analysis of the results. This architecture concatenates the output of a recurrent neural network (RNN) with tabular data as input to a multilayer perceptron (MLP).

3.3 Tools

The Python programming language was chosen for this study due to its widespread adoption in the field of ML. The Google Colab platform was utilized, taking advantage of its cloud-hosted Jupyter notebooks and the CPU resources provided by Google, which facilitated the execution of experiments for the prediction model.

For model development, two primary libraries were used: scikit-learn and PyTorch. These libraries were selected for their wide range of algorithms and functionalities available for developing and evaluating ML models. The predictive model with the best performance was deployed through an Application Programming Interface (API) using the pickle5 library.

Regarding the system prototype, a REST architecture with Docker was defined for both the backend and frontend applications. The Flask framework was employed for backend construction, integrated with a PostgreSQL database. Frontend development was implemented using Next.js.

3.4 Classification Experiments

The experiments to create the predictive model were conducted under two scenarios. In each scenario, the data were split into a 67% training set and a 33% test set for robust evaluation and all experiments were repeated 10 times to calculate the average of the metrics. Synthetic data generation was not considered for either scenario due to the specific nature of the data, scope limitations, and time constraints.

In the first scenario, exclusively tabular data from 120 patients was used. Three models were employed: Multilayer Perceptron (MLP), XGBoost (XGB), and Random Forest (RF). The selection of tree-based models was motivated by their established performance in handling tabular data, as documented in the literature [7, 25].

For the experiments in this scenario, scikit-learn was utilized. The hyperparameters for the classification algorithms were chosen from the following values:

  • Multilayer Perceptron (MLP): hidden_layer_size: (200, 50, 30), (100, 50, 10), (100, 50), (200, 100), (500, 250), (20,), (50,), (100,), (10,), (200,); activation: Tangencial e ReLU.; solver: SGD e Adam; alpha: 0.0001, 0.005, 0.05; learning_rate: constant e adaptive.

  • Random Forest (RF): n_estimators: 100, 300, 500, 800, 1200; max_depth: 5, 8, 15, 25, 30; min_samples_split: 2, 5, 10, 15, 100; min_samples_leaf: 1, 2, 5, 10; max_features: 1, 2, 3, 4, 5.

  • XGBoost (XGB): min_child_weight: 1, 5, 10.; gamma: 0.5, 1.0, 1.5, 2.0, 5.0; subsample: 0.6, 0.8, 1.0; colsample_bytree: 0.6, 0.8, 1.0; max_depth: 3, 4, 5, 8, 12; eta: 0.3, 0.2, 0.1, 0.05, 0.01, 0.005.

For the second scenario, tabular data and time series data from each patient’s ECG-Holter exams were utilized. In this scenario, we present a formal description of a composite system employing an Elman RNN and a Multilayer Perceptron (MLP) for tasks involving time series and tabular data. The overall process can be decomposed into two stages: (1) the Elman RNN, responsible for handling the inherent sequential nature of the time series data [5], and (2) the Multilayer Perceptron, which integrates the outputs generated by the RNN with the provided tabular data.

The purpose of choosing a hybrid model using RNN + MLP is to observe, through the use of the RNN, whether the time series extracted from ECG exams add value to the SCD problem in CC when concatenated with tabular data, in order to obtain a more efficient multimodal hybrid model. Another point is that backpropagation ensures that both parts of the model (RNN + MLP) are adjusted in a coordinated and efficient manner, maximizing the model’s ability to capture and integrate temporal and tabular information.

The output of the RNN at time step \( t \) can be mathematically described by equation below (i), where \( x_t \) represents the input vector at time \( t \), \( h_t \) is the hidden state at time \( t \), \( y_t \) is the output at time \( t \), \( b_y \) is the bias vector for output, and \( h_t \) is the hidden state.

$$\begin{aligned} y_t = W_{hy} h_t + b_y \qquad (i) \end{aligned}$$

The \( h_t \) is given by the following equation (ii), where \( W_{xh} \) represents the weight matrix from input to hidden state, \( W_{hh} \) is the weight matrix from previous hidden state to current hidden state, \( W_{hy} \) is the weight matrix from hidden state to output, and \( b_h \) is the bias vector for hidden state.

$$\begin{aligned} h_t = \sigma (W_{xh} x_t + W_{hh} h_{t-1} + b_h) \qquad (ii) \end{aligned}$$

Assume that the vector \( z_t \) is the concat of the output of an RNN (\( y_t \)) and tabular data (\( d_t \)). The mathematical expression for processing \( z_t \) as the input to an MLP is described as follows, where

$$ z_t = \left[ y_t || d_t \right] \qquad (iii) $$

The input to the MLP is \(a^{(0)} = z_t\). For an MLP with \( L \) hidden layers and an output layer (L + 1), the final output \( \hat{y} \) is showed in followed equitation (iv), where \( W^{(l)} \) represents the weight matrix of layer \( l \), \( b^{(l)} \) is the bias vector of layer \( l \), \( a^{(l)} \) is the activation vector of layer \( l \), and \( \sigma \) is the an activation function, such as ReLU, sigmoid, or tanh.

$$ \hat{y} = a^{(L+1)} = \sigma (W^{(L+1)} a^{(L)} + b^{(L+1)}) \qquad (iv) $$

The summary of the process used in this work can described in four pass: (i) pass the time series through the RNN to obtain \( y_t \); (ii) concatenate \( y_t \) with \( d_t \) to get \( z_t \), (iii) pass \( z_t \) through the MLP to obtain the final output \( \hat{y} \) and (iv) the compute the error and propagate the gradients back through the MLP and then through the RNN to update the weights and biases.

In scenario 1, the experiments were conducted using a feature selection process, considering a range of 5, 10, 15, 20, and 25 features, as well as using all 63 features. The GridSearchCV method was employed to select the top features in each configuration. All experiments were evaluated using leave-one-out (LOO) and 5-fold CV.

In the second scenario, two approaches were executed. In the first approach, the output of the RNN was directly concatenated with all tabular data features. In the second approach, the RNN output was concatenated only with each of the top selected features from the MLP model in scenario 1. Both the RNN and MLP were implemented using PyTorch. To facilitate data manipulation, two tensors were created for the tabular data: one containing the target labels and the other containing the attributes.

During model training in scenario 2, the CrossEntropyLoss loss function was automatically calculated gradients. Backpropagation was carried out using the loss.backward() function to optimize the model by learning from errors and enhancing performance. The gradients were propagated through the entire network, encompassing both the RNN layer and the MLP.

To handle the time series extracted from the ECG-Holter exams of the 120 patients, two tensors were created as well. One tensor stored the values of the 27 CRM variables, while the other stored the length of each time series. Hyperparameter selection was carried out using the ParameterGrid function from scikit-learn. The RNN hyperparameters were chosen from the following values:

  • Recurrent Neural Network (RNN): hidden_layer_size: 8, 10, 16, 32, 50, 64, 100, 128; num_layers: 1; nonlinearity: ReLu; solven: SGD e Adam.; learning_rate: 0.0001, 0.005, 0.001, 0.05 ou 0.01.

4 Results

4.1 Model Prediction Performance

Experiments were conducted using all features (63) and subsets (5, 10, 15, 20, and 25) of the most relevant features selected through a feature selection process. After the feature selection process was completed, the subset of features that yielded the best performance for each algorithm was selected. Interestingly, certain features were frequently selected across various combinations of top features that led to the models (MLP, RF, XG) best performance. Table 2 provides a detailed breakdown of the selected features for each model.

Table 2. Selected tabular features present in the best performance of the experiments conducted with the models (MLP, RF, XGBoost).

The results demonstrate four features (NSVT, Amiodarone, “Moderate Classification”, and the “B2 2005 Classification Guideline”) were selected in all models. Additionally, the features Total VE and “Intermediate Rassi Score” appear in two of the models. This consistency indicates the potential relevance of these features to the prediction of SCD, which is crucial for comparing and validating these selected attributes with the state-of-the-art in medical literature.

The best results from the feature selection process and from using all the features are presented for each model. In the first scenario, the best performance of the feature selection process was achieved by the Multilayer Perceptron (MLP) with 5 features, the Random Forest (RF) with 5 features, and the XGBoost (XGB) with 10 features. These models were trained using only tabular data as input. In the second scenario, the RNN output was concatenated with the 5 best-performing features selected from the MLP model in scenario 1 and all 63 tabular features.

To select the best performance for each tested configuration, we prioritized recall (sensitivity) and AUC (Area Under the ROC Curve). This choice reflects the critical need to accurately identify patients at high risk for SCD, maximizing the detection of potential cases consistent with other studies [1, 4, 6, 22, 23]. The Table 3 displays the models used, the number of features employed, and the evaluation metrics including ACC (accuracy), AUC, Recall (sensitivity), precision, and F1 Score, along with their respective results and standard deviations. Important to note, the feature quantity of 63 in the table is regarding the 49 features after one-hot encoding.

Table 3. Prediction results of the XGB, MLP, and RF models in scenario 1 using only tabular data. Results are presented with all 63 features and the best result with feature selection using the two types of cross-validation.(S= scenario; NF = Number of features)

Regarding only the scenario 1 with tabular data, the RF model performed better with 5 features obtained 89.33% recall and 95.38% AUC. However, it presented a significant standard deviation in recall of almost 20%. In second place, the XGB model with 10 features acquired 83.85% recall and 91.44% AUC, performing slightly better than the MLP with 5 features, that reached 83.68% recall and 90.51% AUC.

When evaluating the three models (MLP, RF, XGB), we observed that feature selection did not result in a significant performance loss; in fact, it improved the recall metric. This finding suggests that some of the features might not be relevant to the context. The high dimensionality of data can be a challenge for developing intelligent systems. However, reducing the number of features offers several advantages. These include creating systems that are interoperable for domain experts, leading to faster, more accurate decision-making and improved patient care (or other relevant outcome).

In the second scenario, where the time series is added, the experiments obtained 91.63% recall with all features (27 temporal series + 63 tabular features) and 90.13% with selected features (27 temporal series + 5 tabular features). Interestingly, the approach that concatenates all features (63 tabular) yielded better results for all metrics compared with concatenating only the feature selected (5 tabular).

Analyzing all results of both scenarios, the RNN+MLP hybrid model applied in Scenario 2, using sequential and non-sequential information, obtained a recall improvement. To assess the significance of this improvement, we performed a hypothesis test with a significance level (alpha) of 0.05, using the techniques Wilcoxon signed-rank test [26] and the Student’s t-test [9]. The null hypothesis (H0) was defined as follows: the recall of the combined sequential and non-sequential data is not significantly higher than the recall using only non-sequential data.

Fig. 1.
figure 1

ROC curves from all models (MLP, RF, XGB, RNN+MLP) in scenarios 1 and 2.

The p-values obtained for the RNN+MLP model with 27 temporal features and 63 tabular features were highly significant (1.95e-3 for Wilcoxon and 1.47e-5 for Student’s t-test), leading to a strong rejection of H0. Similarly, for the model with 5 features, the p-values (1.95e-2 for Wilcoxon signed-rank and 7.21e-3 for Student’s t-test) also resulted in H0 rejection. These results support the conclusion that combining time series and tabular data statistically improves recall. The Figure (Fig. 1) shows the ROC curves for all models (MLP, RF, XGB, RNN+MLP) in scenarios 1 and 2.

Table 4. Performance of the machine learning model of studies on predicting SCD risk, considering sensitivity as a reference parameter.

For comparative purposes, Table 4 shows the performance of SCD-related works in ChHD and non-ChHD contexts with the same parameter (recall) as this study. The results from both approaches were similar to the other works, the recall from RNN+MLP (27 temporal features + 63 tabular features) being the second highest one. Despite methodological differences, our results seem relevant to the SCD scenario in ChHD.

4.2 User Interface

To empower domain experts with the ability to predict SCD in CC, the MLP model with 5 features from Scenario 1 was exported using the Pickle5 library. This exported model served as the initial foundation for a initial Web prediction system. We proposed a prototype intelligent system with six key functionalities: Login, Patient Registration, Appointment Scheduling, Exam Registration, Prediction, and Model Visualization. The application’s login was created using OAuth authentication with the Google platform, that offers secure and convenient access by leveraging existing Google credentials. Upon successful login, the user is redirected to the application’s main screen, depicted in Fig. 2.

Fig. 2.
figure 2

Home Screen that provides access to all system functionalities.

This screen serves as a central hub for managing patients. It lists all registered patients, displaying their names, birth dates, last appointment details, and quick access buttons for requesting relevant cardiac exams (e.g., ECG, ECO, ECG-Holter) to aid in SCD risk assessment. Additionally, users can register patient data, input clinical or exam-related information, allowing the system to function as basic medical record software.

Fig. 3.
figure 3

Cardiology risk prediction listing screen

Regarding SCD risk prediction functionality, users can access the dedicated screen by clicking on the ‘Predict’ in top menu of the main screen, as depicted in Fig. 3. This screen displays a grid summarizing all SCD risk predictions made in the system. Each entry in the grid shows the patient’s name, prediction date, predicted result (Risk or non-Risk, and Probability), and a button to view the features values used for the prediction.

Fig. 4.
figure 4

Add prediction screen

After clicking the “Predict +” button, the system redirects the user to a screen for inputting new prediction data (Fig. 4). This screen includes all fields used as input for the predictive model, including patient demographics (name, physician) and five selected tabular features (NSVT, Amiodarone classification, Rassi score). The very simple and user-friendly prototype system was developed as a web platform to facilitate easy access and use by doctors, ultimately assisting them in patient care.

5 Conclusion

This work proposes the use of Multimodal and Hybrid Models to assess SCD risk in non-high-risk CC patients. Furthermore, a prototype IS has been developed to support healthcare professionals in identifying patients with potential SCD risk, aiding in early diagnosis and intervention.

Encouragingly, despite a limited sample size, the study achieved promising results. Combining time series data with tabular data using a hybrid RNN+MLP model improved recall, precision, and F1-score performance, regardless of feature selection. It’s important to note that time series feature extraction is an ongoing area of research, and this analysis represents an initial exploration of the findings. This work serves as a foundation for future research aimed at developing even more robust models that can assist in the early diagnosis of SCD risk in CC patients who are not classified as high-risk. These models have the potential to address the current limitations in accurately predicting SCD in this patient population.

Future work includes reprocessing the models applying feature selection techniques to the temporal data, comparing the results of classical models with results of other models that have the capability to handle different types of data, applying XAI techniques to enhance the interpretability of predictions for specialists, and exploring other RNN combination approaches to potentially improve model performance and generalizability.