1 Introduction

Alzheimer’s Disease (AD) is a progressive neurodegenerative disorder associated with aging. It can cause cognitive and psychiatric symptoms, potentially leading to disability. According to the World Alzheimer Report, the number of AD cases is projected to reach 131.5 million by 2050 [29]. AD is characterized by synaptic damage and neuronal loss in brain regions responsible for cognitive function [33].

Researchers have been looking for genetic markers linked to Alzheimer’s disease (AD) to create personalized medicine for individuals at risk and enhance their quality of life. The Human Genome Project, initiated in the mid-1990s, greatly expanded genetic data availability through DNA sequencing. Today, personalized medicine incorporates tailored medical treatments based on an individual’s genetic characteristics, including factors like Single Nucleotide Polymorphisms (SNPs) that aid in predicting disease risk for individuals [22].

This study seeks to find genetic markers, like Single Nucleotide Polymorphisms (SNPs), that are linked to AD. This type of research is called Genome-Wide Association Studies (GWAS). SNPs are variations at specific locations in the DNA chain and are classified based on the type of nucleotide substitution that takes place [10]. SNPs are the most commonly occurring type of genetic polymorphism in human genetics, and can impact various human traits and their development in a specific environment. Additionally, SNPs are evolutionarily stable, meaning that there is little variation among different generations [10].

In GWAS, genotypic and phenotypic data are collected from diverse individuals, and quality control procedures ensure the data’s reliability. After GWAS, analyses are performed to interpret the results. These analyses involve adjusting the significance threshold, presenting results in a Manhattan plot, and conducting bioinformatics analyses to identify biological mechanisms underlying observed associations. GWAS is a powerful approach for identifying genetic variants associated with complex diseases or traits, but it is essential to consider various factors carefully to ensure valid and interpretable results.

Advancements in DNA sequencing technologies have resulted in an abundance of genetic data, leading to improvement of personalized medicine [5, 20, 22, 35]. The rapid growth of data has made machine learning algorithms a valuable tool in genetics. Researchers have shown that Machine Learning techniques can predict diseases based on genetic factors. In the domain of machine learning, algorithms are designed to model the intricate associations between risk Single Nucleotide Polymorphisms (SNPs) and disease phenotypes [17]. Supervised learning techniques are employed, where algorithms are trained using labeled datasets to enable precise classification or prediction. Specifically, regression algorithms have been utilized to discern the most significant variables in extensive datasets comprising hundreds or thousands of SNPs [16]. This study harnesses these supervised Machine Learning methods to predict Alzheimer’s Disease by identifying pivotal SNPs.

This study also focused on explainability to identify the most important features, known as Single Nucleotide Polymorphisms (SNPs). To achieve this, several machine learning algorithms that offer insights into feature importance were used, including Random Forest, XGBoost, and Logistic Regression. These algorithms were applied to the datasets to pinpoint the most significant SNPs, based on their feature importance scores [17].

Both Random Forest and XGBoost are renowned for their capability to manage large datasets with numerous features, such as the genetic data containing thousands of single nucleotide polymorphisms (SNPs). The configuration of key parameters such as maximum tree depth (max_depth), number of trees (n_estimators), and the maximum number of features considered for splitting a node (max_features) significantly impacts these models’ ability to detect intricate interactions among SNPs. For example, increasing max_depth enables the models to discern more detailed patterns, potentially identifying complex SNP interactions that hold biological relevance for Alzheimer’s Disease (AD). However, excessively deep trees risk overfitting, capturing noise rather than valid biological signals. Consequently, careful tuning of these parameters is essential to optimize the balance between accuracy and generalization in the models [26, 36].

Although not as commonly utilized as tree-based models in genomic studies, we included Logistic Regression in this study to compare and analyze its predictive power. This model was enriched by tuning hyperparameters such as the penalty type (penalty—L1, L2, ElasticNet) and regularization strength (C), which influence the model’s sparsity. Specifically, L1 regularization promotes feature selection by zeroing out less important features, thereby enabling the model to highlight SNPs most predictive of Alzheimer’s Disease (AD). This approach not only simplifies the genetic model but also focuses on the most biologically relevant markers.

The remainder of this paper is organized as follows: Sect. 2 presents some related studies; Sect. 3 presents our proposed approach; Sect. 4 presents and discusses our results; and finally, Sect. 5 presents our conclusions and future works.

2 Related Work

Jin et al. [18] conducted a systematic exploration of Machine Learning (ML) and deep learning (DL) techniques for identifying and analyzing biomarkers associated with Alzheimer’s Disease. Their review highlights significant research findings, such as the use of LASSO for feature selection [39], and identification of hub genes associated with immune function and neuroinflammation [40]. The study showcases notable algorithms like Differential Gene Selection TabNet for effective gene-based classification of AD.

The study conducted by Araujo et al. [3] focuses on the use of Random Forest algorithms and gene network analysis to investigate the correlations between SNPs and Alzheimer’s Disease. The research employs Random Forest algorithms due to its effectiveness in handling large datasets and managing the complexity inherent in genetic data. The study’s findings suggest that certain SNPs are significantly associated with the disease, which offers potential new insights into its genetic basis.

A study by Sherif et al. [34] found genetic variations linked with Alzheimer’s Disease using a multi-stage system. They used a supervised Bayesian network and discovered the most AD-related SNP. Their results showed that endothelial-based Markov methods were better than naive Bayes and naive tree-fed Bayes, but their work is still ongoing in drug discovery.

[2] study uses Machine Learning algorithms to detect Alzheimer’s disease early by analyzing SNPs. The research focuses on identifying genetic markers predictive of the disease’s early onset. The work showcases the efficacy of combining detailed genetic data with precise Machine Learning techniques to enhance early detection strategies.

Our study differs from previous research by focusing solely on genetic data to identify markers for Alzheimer’s Disease (AD), not including environmental factors and pre-filtered genetic markers. We utilize interpretable models to ensure transparency and compare our findings with BLUPF90, an advanced mixed linear model approach. Our approach emphasizes the impact of genetic data on advancing our knowledge of AD’s genetic foundations and supporting the creation of precise therapeutic and diagnostic approaches.

Fig. 1.
figure 1

Methodology Architecture

3 Methods

The methodology adopted for this study is structured into five crucial stages, as follows: (1) Data Acquisition; (2) Data Processing for Quality Control; (3) GWAS using BLUP family of programs; (4) Employment of advanced Machine Learning methods on the dataset identified as optimal in the previous phase; (5) Comparative analysis of the results achieved in the GWAS and Machine Learning stages.

Figure 1 schematizes the mentioned stages and their respective sub-tasks, which will be detailed in the subsequent sections of this article.

3.1 Data Acquisition

The data used in this study were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database [1]. Launched in 2004 as a public-private partnership, ADNI aims to identify whether brain images, biological markers, clinical evaluations, and neuropsychological assessments can be combined to measure the progression of Mild Cognitive Impairment (MCI) and Alzheimer’s Disease in early stages.

The initial genotypic data contained 620,901 variants, with an average of 30,785 missing variants. The input file included 757 individuals, comprising 449 males and 308 females. The phenotype was divided into three categories: Normal (CN), with 214 samples; Alzheimer’s Disease with 175; and Mild Cognitive Impairment (MCI) with 367. These categories were estimated by ADNI using various biomarkers, which are substances, measurements, or indicators of a biological state that can be identified before the onset of clinical symptoms.

In addition to genotypic data, a dataset of individual phenotypes was also obtained. This dataset contains information on sex, age, ethnicity, and gender.

3.2 Data Processing for Quality Control

In this study, quality control (QC) methods preprocess the input data for genome-wide association studies (GWAS). The data includes individual IDs, disease stage, and genotype information. The employed QC filters include Minor Allele Frequency (MAF), which excludes SNPs with a MAF less than certain percentages to filter out rare variants; Linkage Disequilibrium (LD) for detecting SNP clusters linked to specific traits; Hardy-Weinberg Equilibrium (HWE) to identify unusual allele frequencies; and checks for Genotype and Sample Missingness, removing entries exceeding predefined thresholds. These measures are crucial for reducing data complexity and improving the reliability of the analysis.

3.3 Best Linear Unbiased Prediction (BLUP)

The Best Linear Unbiased Prediction (BLUP) software suite is employed to select the most suitable dataset after applying a series of quality control filters. The selected dataset is then used in machine learning models to enhance the study of Alzheimer’s Disease, focusing particularly on the significance of genetic markers known as single nucleotide polymorphisms (SNPs). These markers are analyzed to understand their influence on the disease.

Consequently, BLUP serves two primary purposes in this study: firstly, to select the best dataset based on a combination of QC parameters, and secondly, to evaluate the significance of SNPs through its predictive modeling capabilities.

The BLUP includes several key programs, each playing a vital role in the analysis, being them:

  1. a)

    RUNUM: Creates necessary parameter files for other software components.

  2. b)

    THRGIBBS1F90: Employs Gibbs sampling to estimate important genetic variations, essential for analyzing both simple and complex genetic traits.

  3. c)

    POSTGIBBSF90: Summarizes statistical samples, providing estimates of genetic variation.

  4. d)

    BLUPF90 and POSTGS: These programs calculate p-values for each SNP, identifying key genetic markers for Alzheimer’s Disease.

The Genomic Model. The genetic analysis is conducted by applying a genomic model that predicts genetic susceptibility to Alzheimer’s Disease based on SNP data, represented by the equation:

$$\begin{aligned} y = X\beta + Zu + e \end{aligned}$$
(1)

where:

  • y represents the observed traits (phenotypes).

  • X and Z are matrices that connect observations to fixed and random effects, respectively.

  • \(\beta \) represents fixed effects.

  • u is a vector of random effects.

  • e is the error term.

This model facilitates the exploration of how specific SNPs may be associated with Alzheimer’s Disease.

Assessing Significance. The significance of SNP associations is quantified through p-values, which evaluate the likelihood that the observed genetic effects are due to chance. Lower p-values indicate a stronger statistical connection to Alzheimer’s Disease, suggesting significant roles for certain SNPs in its development.

3.4 Machine Learning Algorithms

In this study, we employed the algorithms Logistic Regression, XGBoost, and Random Forest. These models were trained using a training set (X_train, y_train) consisting of 80% labeled data for Alzheimer’s Disease and Cognitively Normal and then internally validated using the 5-fold cross-validation technique. RandomizedSearchCV was utilized to evaluate various hyperparameter combinations, with optimization based on the F1 scorer. To prevent any warnings associated with division by zero in the F1 score, the zero_division value was set to 1. Finally, the model was externally evaluated on the test set (X_test, y_test). Table 1 provides a comprehensive list of all hyperparameters used for fine-tuning.

$$\begin{aligned} F_1 = 2 \cdot \frac{\text {Precision} \cdot \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$
(2)
Table 1. Hyperparameters for Logistic Regression, Random Forest, and XGBoost

For Logistic Regression, we adjusted various settings including the regularization strength (C), penalty type, solver used, and the L1 ratio. The Random Forest model was fine-tuned by adjusting the number of trees, their maximum depth, and the number of features considered at each split to enhance overall performance. Similarly, the XGBoost model was adjusted to prevent overfitting and improve accuracy by optimizing parameters like the number of trees, learning rate, tree depth, and sampling methods.

3.5 Gene Retrieval

This study retrieved gene information using Biopython, a tool that interfaces primarily with the National Center for Biotechnology Information (NCBI) databases. This approach allowed for the systematic extraction of gene data based on specific Single Nucleotide Polymorphisms (SNPs). The identified genes were compared to the ones present in the existing literature about Alzheimer’s Disease.

This study used data sourced from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), where participant information is pre-anonymized to protect privacy. Our access to and use of this data strictly adhered to all guidelines provided by ADNI, which are designed to comply with ethical standards for research involving human subjects. Since the data was pre-anonymized and publicly available, this study did not require additional ethical approval from an ethics committee.

4 Results and Discussion

4.1 Data Quality Control

To ensure optimal results, a quality control standard was implemented, which included a minor allele frequency (MAF) of 0.01, a linkage disequilibrium (LD) threshold of 85, a Hardy-Weinberg equilibrium (HWE) significance of 5e-6, a sample missingness rate of 0.01, and a gene missingness rate of 0.1. The SNP rs7918269 produced the most significant outcome, with a p-value of 1.600000e-09. Please note that the QC hyperparameters can impact the results, and thus, selecting the ideal hyperparameters is crucial for achieving optimal outcomes.

4.2 Machine Learning Hyperparameter Selection

In this study, tuning of hyperparameters was critical to optimize each machine learning model’s performance for predicting Alzheimer’s Disease using SNPs. Table 3 illustrates the selected hyperparameter settings for the Logistic Regression, Random Forest, and XGBoost algorithms, ensuring each model’s efficacy and robustness.

4.3 Prediction Performance Analysis

Although the predictive performance of the models was not the primary focus of this study, their ability to make accurate predictions was analyzed. Overall, the models demonstrated similar accuracy but exhibited significant differences in precision and recall. The Random Forest model had higher precision but lower recall compared to the other models, indicating a tendency to frequently predict the negative class (CN). Logistic Regression and XGBoost performed better in maintaining a balance between precision and recall, with XGBoost slightly outperforming. However, all models have room for improvement, especially in terms of balancing performance metrics.

The detailed results obtained from the Machine Learning models are described as follows: Random Forest, Logistic Regression, and XGBoost. The Random Forest model achieved an accuracy score of 57.69%, with a perfect precision of 100% but a very low recall of 5.71%, resulting in an F1 score of 10.81%. This indicates a strong bias toward predicting the negative class, limiting its ability to accurately identify the positive class. Logistic Regression also achieved an accuracy of 57.69%, with a precision of 60% and a recall of 17.14%, leading to an F1 score of 26.67%. Compared to Random Forest, Logistic Regression exhibited a slightly better balance between precision and recall. Lastly, the XGBoost model recorded an accuracy of 53.85%, with a precision of 47.62%, recall of 28.57%, and an F1 score of 35.71%. Although XGBoost had a lower overall accuracy, it showed a more balanced performance across all metrics (Table 2).

Table 2. AD Prediction Results
Table 3. Selected Hyperparameters for Machine Learning Models

The similar accuracy observed across the Random Forest, Logistic Regression, and XGBoost models can be attributed to several factors, including the complexity of Alzheimer’s Disease, the quality and representation of the genetic data, model-specific characteristics, hyperparameter tuning processes, inherent data limitations, and the choice of evaluation metrics. These factors collectively influence the models’ performance, leading to convergent accuracy levels despite their different architectures and mechanisms.

4.4 Comparative Analysis of Alzheimer’s Disease Associated SNPs

Table 4. 20 most significant SNPs for Logistic Regression and Random Forest.
Table 5. 20 most significant SNPs for XGBoost and GWAS (BLUPF90).

After implementing the Machine Learning techniques, we conducted a genome study and benchmarked the results of each algorithm. This comparison was performed by selecting the top 20 most significant SNPs for each method, as presented in Tables 4 and 5, and analyzing and comparing the identified genes with those reported in the literature. Instances of absent genes in the Tables 4 and 5 indicate ’No Gene Association’. This phenomenon can be attributed to the occurrence of SNPs in non-coding regions of the genome. While not directly associated with known genes, such regions can still play a crucial role in gene regulation or the production of non-coding RNAs, thereby indirectly influencing gene function [14].

Through this analysis, we identified that the models successfully detected SNPs previously associated with Alzheimer’s disease and other neurological functions. This finding underscores the potential of ML models for identifying genetic markers linked to complex diseases such as Alzheimer’s disease.

Logistic Regression identified specific SNPs with positive and negative coefficients, indicating the presence of both protective and risk alleles for AD. On the other hand, the Random Forest model assigned greater significance to a different set of SNPs, with feature importances varying in the order of \(10^{-4}\). The XGBoost model identified a collection of SNPs, with importances ranging around \(10^{-2}\), emphasizing the relevance of each SNP more equally. Finally, the GWAS analysis using BLUPF90 provided a separate set of SNPs based on highly p_value, some of which were not detected by the other models.

Based on the XGBoost model analysis, several genes and single nucleotide polymorphisms (SNPs) were identified in relation to Alzheimer’s disease (AD). The GALNT13 gene, crucial for neural development, is strongly associated with AD [9, 19]. The rs6507641 SNP in the SLC14A1 gene, involved in urea transport, has also been linked to AD [31]. The rs6037744 SNP, although not directly associated with a specific gene, is linked to increased susceptibility to optic nerve degeneration in glaucoma. Both conditions involve nerve cell death and abnormal proteins, suggesting a broader neurodegenerative process [37]. Studies have also found that glaucoma patients are four times more likely to develop dementia, highlighting a potential connection between glaucoma and AD [28]. Lastly, the rs10933234 SNP in the SPHKAP gene has been suggested to be involved in the AD process [38].

For the Logistic Regression model, it was found that the FGF13 gene is related to ameliorating amyloid-\(\beta \)-induced neuronal damage, acting as a protective factor [24], as shown by the negative coefficient in the logistic regression. The ARHGAP36 gene has been associated with excitatory-inhibitory neuronal analyses within the DG granule cell layer, also serving as a protector [25]. The ABCA4 gene, linked to rs497511, is highly associated with AD and frontotemporal lobar degeneration with TDP-43 protein inclusions [21]. The SYTL2 gene was found among the top genes with significant Alzheimer’s loci in the study conducted by Oxford [11]. The expression of FRMPD4, indicated by rs7880350, is significantly altered in the AD hippocampus [8].

Based on the Random Forest model, the CD207 gene was identified as one of the top 10 most differentially expressed genes in AD, according to a study [15]. The OR51B6 gene, an olfactory receptor, shows the highest number of associated variants and is expressed in temporal cortex neurons [32]. The SOX13 and PLXND1 genes were associated with AD in various studies [6, 27]. The PDE8B gene shows altered mRNA expression in AD brains at different disease stages [30].

BLUP discoveries identified several genes previously associated with AD in the literature, including CNTN6, KLF12, RPS6KA2, MLN, FANCC, ABCA8 and ADCY9. These genes are essential in regulating synaptic plasticity, biomarkers for AD, inflammatory responses, neuritogenesis, and biochemical pathways of AD. These discoveries offer valuable insights into the molecular mechanisms underlying AD and could lead to the development of innovative diagnostic tools and treatments for this debilitating disease [4, 7, 13, 23].

Comparing the significant SNPs identified by each model can offer valuable insights into the genetic mechanisms of AD. The observed variations have potential implications that can guide future research in the field. These findings can reveal new biological pathways or confirm the relevance of already known ones, serving as a basis for further investigations. In turn, this can contribute to the advancement of our understanding of the pathogenesis of AD.

5 Conclusion and Future Work

This study has demonstrated the efficacy of integrating advanced Machine Learning algorithms and genomic analysis to identify significant genetic markers associated with Alzheimer’s Disease (AD). Utilizing methods such as Random Forest, Logistic Regression, XGBoost, and genomic-wide association studies via BLUPF90, we have identified a comprehensive set of SNPs, which includes both previously documented and novel genetic markers. These findings not only underscore the potential of Machine Learning to enhance our understanding of AD but also highlight the utility of genomic data in developing personalized medical interventions.

While this study has made significant progress in understanding the genetic basis of Alzheimer’s Disease, it is essential to note that the complexity of the disease involves various genetic, environmental, and lifestyle factors, underscoring the need for more comprehensive research. In the future, studies should include larger and more diverse datasets that incorporate these factors to improve the accuracy and generalization of the results. Additionally, it is critical to incorporate sophisticated computational models that can handle such multidimensional data. While explainability is a concern, methods, such as Shapley Additive exPlanations (SHAP) Values, can assist with black box models.

Further investigations should also focus on validating the novel SNPs identified in this study, examining their biological relevance, and understanding how they interact with other genetic and environmental factors in the pathogenesis of AD. This validation is essential for translating these findings into clinical practice, where they can inform the development of targeted therapies and diagnostic tools.

In conclusion, our study represents a significant advance in the genetic study of Alzheimer’s Disease, opening new avenues for research and potential therapeutic interventions. The insights gained here provide a foundation for future studies aimed at unraveling the complex genetic networks involved in AD and developing more effective strategies for its prevention, diagnosis, and treatment.