1 Introduction

Atrial fibrillation (AF) is the most prevalent cardiac arrhythmia, impacting millions of people worldwide [31]. Its prevalence escalates with age, affecting approximately 10% of individuals by the age of 80 [12]. Additionally, underlying cardiac conditions, such as coronary artery disease, significantly heighten the risk of developing AF [12]. AF is characterized by erratic electrical impulses in the atria, leading to an irregular and rapid heart rhythm that disrupts blood flow, causing stasis and increasing the risk of thromboembolic events such as ischemic stroke and systemic embolism [16]. This condition is associated with a fourfold increase in mortality risk, as evidenced by a population-based study in Korea [19].

The primary goal in managing AF is to reduce the risk of thromboembolism and stroke, primarily through anticoagulation therapy [3, 17, 20]. Warfarin, a traditional anticoagulant, has been extensively used for this purpose but poses challenges due to its food and drug interactions and the need for frequent monitoring [26]. Maintaining therapeutic levels of warfarin requires regular international normalized ratio (INR) checks and frequent dose adjustments to mitigate bleeding risks. Recently, direct oral anticoagulants (DOACs), such as rivaroxaban, have emerged as more favorable alternatives due to their improved pharmacological profiles [1, 26].

Machine learning has become a valuable tool in healthcare [9], aiding in the development of robust risk prediction algorithms and enhancing our understanding of disease-related features to improve patient outcomes [33]. These techniques are particularly effective in uncovering patterns and associations in large datasets that may be challenging to discern manually [28]. However, the application of machine learning in healthcare demands a focus on explainability due to the high-stakes nature of medical decision-making.

Explainability involves articulating the internal mechanisms and outputs of an algorithm in a way that is comprehensible to humans. For healthcare professionals to trust and effectively utilize these tools in clinical settings, they must understand the rationale behind a model’s predictions. For example, when a machine learning model forecasts a high risk of stroke in an AF patient, clinicians need to grasp the factors influencing this prediction to make well-informed treatment decisions [14, 21]. Trust in these models is built through transparency and interpretability, aligning model predictions with clinical knowledge and patient data [6]. Similarly, clear and transparent communication of these predictions fosters patient adherence to treatment plans [10].

In this study, we built a dataset comprising biomarkers from 195 individuals, including 109 in the control group (without AF) and 86 patients with AF. Among the AF patients, 47 were using warfarin and 39 were using rivaroxaban. The biomarkers used comprise patient data from characterization, blood count, lipid profile, coagulation, inflammatory and cardiac diseases of the individuals. We built high-performance machine learning models using this data (i) to identify biomarkers that differentiate individuals with AF from those without, and (ii) to compare biomarker responses between patients on different anticoagulant therapies (warfarin \(\times \) rivaroxaban). Additionally, we conducted ablation studies to better understand the significance of two major biomarker groups (coagulation and inflammatory) given their known strong correlation with AF [11, 23]. In summary, we utilize synthetic data generation [32], machine learning algorithms [18], and explainable machine learning [22] to create a framework that provides insights into AF pathophysiology and the effects of anticoagulant therapy.

2 Related Work

Machine learning is changing healthcare by enabling the analysis of vast amounts of patient data to uncover patterns and insights that improve diagnosis, treatment, and patient outcomes [2, 29, 34]. In [25], authors used convolutional neural networks to predict the evolution of Alzheimer’s disease, and explainable machine learning to explain patterns within this evolution. In [5], authors employed Raman Spectroscopy to identify traces of proteins and developed machine learning models to find which proteins are more related to melanoma. In [4], authors developed a machine learning model for predicting the risk of death in COVID-19 patients. These models employ blood count data and present high prognostic performance. Authors in [7] developed a new ensemble algorithm to predict the evolution of pain relief in patients suffering from chronic pain.

Specific patterns in patients with AF were already studied using typical procedures in the biomedical literature. In [11], authors showed that platelet and coagulation activation profiles in patients with AF were reduced by warfarin or rivaroxaban use and that patients with AF using rivaroxaban were less hypocoagulable than patients using warfarin. This means that endogen thrombin potential was less reduced in patients using rivaroxaban compared to those taking warfarin. Similarly, authors in [23] showed that patients with AF (both warfarin and rivaroxaban groups) presented increased levels of inflammatory cytokines in comparison with controls. The use of rivaroxaban was associated with decreased levels of inflammatory cytokines in comparison with warfarin. On the other hand, patients with AF using rivaroxaban presented increased levels of the chemokines (MCP-1 in comparison with warfarin users; MIG and IP-10 in comparison with controls). Finally, authors in [8] showed that Hp levels and Hp1-Hp2 polymorphism are not associated with AF, and authors in [24] provided a basis for the study of inflammatory markers that had not yet been well addressed in AF, especially IP-10, besides supporting evidence about molecules that had previously been associated with the disease.

We employ the aforementioned findings to select the biomarkers to be included in our models. To the best of our knowledge, our study is the first to use machine learning models to provide a deep understanding of AF in terms of its known biomarkers and given the possible therapies.

3 Materials and Methods

Let D be a dataset comprising biomarkers from patients, including a control group without AF and patients diagnosed with AF. Specifically, D consists of individuals in the control group and AF patients who are either on warfarin or rivaroxaban. Further, let \(X = \{ x_1, x_2, \ldots , x_n \}\) be a feature vector containing the biomarker values for each individual. Missing values in numerical features are imputed using the mean of the feature within the corresponding class, and categorical features are imputed using the mode of the feature within the same class. Formally, for a numerical feature \(x_i\) in class \( C_k \), the imputation is given by:

$$\begin{aligned} x_i^{(j)} = \frac{1}{|C_k|} \sum _{x_i \in C_k} x_i \end{aligned}$$
(1)

For a categorical feature \( x_i \) in class \( C_k \), the imputation is:

$$\begin{aligned} x_i^{(j)} = \text {mode}(x_i \in C_k) \end{aligned}$$
(2)

The task is formulated as a binary classification problem, in which we are concerned with two distinct contexts: classifying individuals based on the presence or absence of AF, and classifying AF patients based on their medication (warfaring \(\times \) rivaroxaban). The goal is to learn a function \(f(X; \phi ) \rightarrow \{0, 1\}\), where X represents the biomarker features and \(\phi \) represents the learning parameters of the model. We perform ablation experiments to evaluate and better understand the impact of different biomarker subsets on classification performance. Thus, the evaluated models are trained on three different sets of variables:

  • All biomarkers are used in our models.

  • All biomarkers, excluding coagulation biomarkers.

  • All biomarkers, excluding inflammatory biomarkers.

This ablation approach allows us to understand how different groups of biomarkers contribute to the classification task and identify which features are more informative for distinguishing between the patient groups.

3.1 Data

The dataset for our analysis comprises biomarkers from 195 individuals, including 109 in the control group (without AF) and 86 patients with AF. Among the patients, 47 were using warfarin and 39 were using rivaroxaban. The biomarkers used comprise patient data from characterization, blood count, lipid profile, coagulation, inflammatory and cardiac diseases of the individuals and can be found in Table 1.

Table 1. Biomarkers used to build our machine learning models.

Prior to further analysis, missing feature values for patients were imputed based on their respective classes (control, rivaroxaban, warfarin). Our data was divided into two distinct datasets for separate analysis. The first subset \( D_1 \) segregates individuals based on the presence or absence of atrial fibrillation:

$$\begin{aligned} y_i = {\left\{ \begin{array}{ll} 1 & \text {if AF is present} \\ 0 & \text {if AF is absent} \end{array}\right. } \end{aligned}$$
(3)

The second subset, \( D_2 \), includes only AF patients and further classifies them based on their medication:

$$\begin{aligned} y_i = {\left\{ \begin{array}{ll} 1 & \text {if the patient is on warfarin} \\ 0 & \text {if the patient is on rivaroxaban} \end{array}\right. } \end{aligned}$$
(4)

As previously mentioned, for numerical features imputation was performed using the mean value of the feature within the class in either \( D_1 \) or \( D_2 \). For categorical features, the mode of the values within the class was used. By imputing missing values based on class-specific statistics we obtain results more closely related with the underlying data distribution, resulting in more realistic and reliable values.

Synthetic Data and Training Augmentation: To improve the visualization and analysis of explanatory factors, we generate synthetic data and create an expanded dataset \(D'\) using two techniques: Gaussian Copula [27, 30] and Tabular Variational Autoencoder (TVAE) [32]. The Gaussian Copula method models the dependency structure between variables by transforming the original data into a space where the variables follow a multivariate normal distribution. This involves computing the empirical cumulative distribution function for each variable and transforming them using their inverse of the standard normal to produce data that follows a standard normal distribution. We then proceed to estimate the covariance matrix of these transformed variables to capture their dependencies. New samples are generated from a multivariate normal distribution and are transformed back to the original space using the inverse of the initial transformations. Formally, let \(X = (X_1, X_2, \ldots , X_n)\) represent the original data and \(U = (U_1, U_2, \ldots , U_n)\) be the transformed variables that follow a standard normal distribution. The Gaussian Copula C is defined as:

$$\begin{aligned} C(u_1, u_2, \ldots , u_n) = \Phi _{\Sigma }(\Phi ^{-1}(u_1), \Phi ^{-1}(u_2), \ldots , \Phi ^{-1}(u_n)) \end{aligned}$$
(5)

in which \(\Phi \) represents the CDF of the standard normal distribution, \(\Phi ^{-1}\) its inverse, and \(\Phi _{\Sigma }\) is the CDF of the multivariate normal distribution with the covariance matrix.

In contrast, TVAE uses a variational autoencoder to learn the underlying data distribution and generate synthetic samples. It consists of an encoder and a decoder structure, in which the encoder \(q_{\phi }(Z|X)\) maps the input data X to a latent representation Z using a neural network characterized by a mean \(\mu \) and variance \(\sigma ^2\). Formally:

$$\begin{aligned} q_{\phi }(Z|X) = \mathcal {N}(Z; \mu (X), \sigma ^2(X)) \end{aligned}$$
(6)

The decoder \(p_{\theta }(X|Z)\) then maps the latent representation Z back to the original data space, attempting to reconstruct the original data \(\hat{X}\):

$$\begin{aligned} p_{\theta }(X|Z) = \mathcal {N}(X; \hat{\mu }(Z), \hat{\sigma }^2(Z)) \end{aligned}$$
(7)

We create synthetic data points by sampling from the latent space and feeding it to the decoder. For each technique, synthetic samples equivalent to \(33\%\) of the total number of individuals in each subset are generated, resulting in the expanded dataset \(D'\). While synthetic data generation can raise concerns about the representativeness of the data distribution and the potential for introducing bias in real-world applications [15], validating the model on real datasets helps mitigate these issues and ensures its reliability in real-world scenarios.

3.2 Model Training

For each comparison, the model is trained on three different sets of variables: one including all biomarkers, one excluding coagulation markers, and one excluding inflammatory markers. The target variable corresponds to the group that the individual belongs to. Our models were built using an implementation of the LightGBM algorithm [18]. It works by generating a model made up of hundreds of simple decision trees, these trees are integrated into a unified model through boosting [13].

Explainability: As a key point in any health domain application, we employ explainability tools to better understand the decision-making process of the model. Specifically, we employ SHAP [22]. We represent how model \(f(X; \phi )\) explains a phenomenon as a d-dimensional vector \(E(f(X)) = {e_1, e_2, . . . , e_d }\) showing which features are contributing to the model’s prediction. Specifically, \(e_i\) takes a value that corresponds to the influence that the respective feature \(x_i\) had on the model decision. This influence is quantified by calculating SHAP values, which provide a unified measure of feature importance across different instances.

SHAP values are calculated using Shapley values from cooperative game theory, ensuring that each feature’s contribution to the prediction is fairly attributed. For each prediction, the SHAP values are computed by considering all possible combinations of feature inputs, thus capturing the complex interactions between features. In particular, SHAP contains three desirable properties:

  • Additivity: the explanations are truthfully explaining the model and each evaluated instance. This means that the sum of the SHAP values for all features equals the model output for that instance, ensuring that the attributions are consistent with the model’s actual predictions.

  • Consistency: if a model changes such that some feature’s contribution increases or stays the same regardless of the other features, that feature’s attribution should not decrease. This property ensures that SHAP values remain stable and interpretable even if the underlying model changes.

  • Missingness: A missing feature gets an attribution of zero. That is, a feature not present in a model does not impact its output or explanation.

4 Experimental Results

Model performance was evaluated by the area under the receiver operating characteristic curve (AUROC). It is used to show how much the model is capable of distinguishing between competing classes (i.e., patient groups). In our experiments we employed stratified 5-fold cross-validation by dividing the dataset into five parts and, at each iteration, we use four folds for training and the remaining one as the testing set for the model. No hyperparameter tuning was performed. Following, we discuss the results of each study.

4.1 Atrial Fibrillation and Control Group

Table 2 shows the model performance in differentiating AF patients from the control group, with all models achieving high AUROC values. Literature suggests that physiological patterns in AF patients are markedly different from those in the control group, which the trained models were able to learn. Given the high performance of our models, our main interest lies in understanding these distinctive patterns. We used SHAP values for this analysis.

Table 2. Model performance on atrial fibrillation and control group classification.

When using SHAP to find the most significant features using all biomarkers, we found a great impact of IL-4, Peak, MPEs/ul, IL-2, ETP, F1+2, LDLc, IFN, TRIG, IL-10, TGF\(\beta \), and HDLc in model prediction (Fig. 1 (Top)). If we exclude the coagulation biomarkers the most important features are IL-4, IL-10, IFN, Total Cholesterol, IL-6, sICAM-1, LDLc, MCP-1, IL-2, IL-8, GDF-15, and HDLc (Fig. 1 (Middle)). Furthermore, without inflammation biomarkers the most impacting features are Peak, TAFI, F1+2, Lagtime, TRIG, MPEs/ul, PAI-1, Total Cholesterol, sICAM-1, GDF-15, MPP/ul and ETP (Fig. 1 (Bottom)).

These findings suggest that inflammation, lipid metabolism, and coagulation processes are all critical in the model’s predictions. Without coagulation biomarkers, cytokines and lipid measures remain important, indicating that inflammation and lipid metabolism are still significant. Additional markers like sICAM-1 and MCP-1 become more prominent, highlighting the role of inflammation and immune response. Without inflammation biomarkers, coagulation and lipid measures (such as TAFI, F1+2, Lagtime, PAI-1) are significant. This shift underscores the importance of blood clotting mechanisms and lipid metabolism.

Fig. 1.
figure 1

(Top) All features. (Middle) All, but coagulation biomarkers. (Bottom) All, but inflammatory biomarkers

4.2 Warfarin \(\times \) Rivaroxaban

Table 3 shows the model performance in differentiating AF patients on warfarin from AF patients on rivaroxaban. Again, all models achieved very high AUROC values, and given the high performance of our models, our main interest lies in understanding the distinctive patterns in both groups. We used SHAP values for this analysis.

Table 3. Model performance on rivaroxaban and warfarin group classification.
Fig. 2.
figure 2

(Top) All features. (Middle) All, but coagulation biomarkers. (Bottom) All, but inflammatory biomarkers

When using SHAP to find the most significant features using all biomarkers, we found a great influence of IL-2, F1+2, ETP, GDF-15, sICAM-1, RANTES, TNF, t-PA, IL-4, TGF\(\beta \), PLT, and IL-8 in model prediction (Fig. 2 (Top)). When excluding the coagulation biomarkers the most impacting features became IL-2, IL-4, Genotype Hp (1), IL-6, TGF\(\beta \), IFN, Haptoglobin (Hp), MCP-1, PLT, Dyslipidemia (hypercholesterolemia), TRIG and HAS (Fig. 2 (Middle)). Furthermore, without inflammation biomarkers the most impating features are F1+2, ETP, ttPeak, t-PA, Peak, Dyslipidemia (hypercholesterolemia), TRIG, PLT, Total Cholesterol, HDLc, LDLc, and TAFI (Fig. 2 (Bottom)).

5 Conclusion

This study underscores the potential of machine learning models in providing a deeper understanding of atrial fibrillation (AF). By leveraging ML models, we successfully identified and analyzed key biomarkers that differentiate AF patients from healthy individuals and elucidate the effects of different anticoagulant therapies. Our high-performance models demonstrate robust predictive capabilities, offering significant insights into the distinct biomarker patterns associated with warfarin and rivaroxaban treatments. The application of explainable machine learning methods enabled us to uncover critical factors influencing biomarker behavior, providing a deeper understanding of the pathophysiological processes in AF and the impacts of specific treatments (we can clearly see patients using rivaroxaban as having lower inflammation markers in contrast to those using warfarin). These insights contribute to a more nuanced view of how anticoagulant therapy affects biomarker profiles, which can inform clinical decision-making and optimize treatment strategies. Finally, synthetic data generation proved invaluable in augmenting our dataset, enhancing model robustness, and ensuring reliable predictions despite potential limitations in real-world data availability. This approach highlights the efficacy of combining data augmentation techniques with machine learning to overcome data scarcity issues common in clinical research.