key: cord-0058712-lq3agm72 authors: Braga, José; Ferreira, Flora; Fernandes, Carlos; Gago, Miguel F.; Azevedo, Olga; Sousa, Nuno; Erlhagen, Wolfram; Bicho, Estela title: Gait Characteristics and Their Discriminative Ability in Patients with Fabry Disease with and Without White-Matter Lesions date: 2020-08-20 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58808-3_30 sha: a2c7474aaf1ba7092671eb333538798863cbbf57 doc_id: 58712 cord_uid: lq3agm72 Fabry disease (FD) is a rare disease commonly complicated with white matter lesions (WMLs). WMLs, which have extensively been associated with gait impairment, justify further investigation of its implication in FD. This study aims to identify a set of gait characteristics to discriminate FD patients with/without WMLs and healthy controls. Seventy-six subjects walked through a predefined circuit using gait sensors that continuously acquired different stride features. Data were normalized using multiple regression normalization taking into account the subject physical properties, with the assessment of 32 kinematic gait variables. A filter method (Mann Whitney U test and Pearson correlation) followed by a wrapper method (recursive feature elimination (RFE) for Logistic Regression (LR) and Support Vector Machine (SVM) and information gain for Random Forest (RF)) were used for feature selection. Then, five different classifiers (LR, SVM Linear and RBF kernel, RF, and K-Nearest Neighbors (KNN)) based on different selected set features were evaluated. For FD patients with WMLs versus controls the highest accuracy of 72% was obtained using LR based on 3 gait variables: pushing, foot flat, and maximum toe clearance 2. For FD patients without WMLs versus controls, the best performance was observed using LR and SVM RBF kernel based on loading, foot flat, minimum toe clearance, stride length variability, loading variability, and lift-off angle variability with an accuracy of 83%. These findings are the first step to demonstrate the potential of machine learning techniques based on gait variables as a complementary tool to understand the role of WMLs in the gait impairment of FD. Fabry Disease (FD) is a rare genetic X-linked lysosomal disorder caused by the deficiency or absent activity of the enzyme α-galactosidase A, resulting in the accumulation of globotriaosylceramide (GL-3) in several organs, including the kidney, heart and brain [10] . Brain manifestations in FD include progressive white matter lesions (WMLs) [4, 14] . Brain WMLs were an early manifestation, affecting 11.1% of males and 26.9% of females under 30 years of age, even without cerebrovascular risk factor outside FD [3] . In [14] , WMLs were found in 46% of 1276 patients which tend to occur earlier in males and their prevalence revealed to increase with patients' age. White matter is the brain region responsible for the transmission of nerve signals and for communication between different parts of the brain. WMLs have been associated with gait impairment [23] and the risk of falls [22, 27] . Gait abnormalities, such as slower gait and postural instability, have been reported in FD [16] . However, not much is known regarding the impact of WMLs on gait performance in patients with FD. Specific location and distribution of WMLs suggest a specific underlying disease [14] and may reflate in a different gait profile. In fact, gait evaluation can be useful to differentiate different pathologies even in the presence of highly overlapping phenotypes, such as the differences found between two types of Parkinsonism (Vascular Parkinsonism versus Idiopathic Parkinson's Disease) in [9] . Furthermore, gait assessment has been revealing as a good complementary clinical tool to discriminate adults with and without a pathology such as Parkinson's disease [8, 9, 15] , Huntington's disease [17] and recently FD [7] . Gait is usually described by its spatio-temporal and foot clearance characteristics such as speed, stride length, stride time, minimum toe clearance (mean gait characteristics), and their respective variability (given by the coefficient of variation or the standard deviation) [9, 15, 21] . Different machine learning (ML) techniques have been used to select the best combination of relevant gait characteristics for gait classification [6, 8, 21] . Recent work [21] shows that a subset of gait characteristics selected using random forest with information gain and recursive features elimination (RFE) technique with Support Vector Machine (SVM) and Logistic Regression improves the classification accuracy of Parkinson's Disease. Widely used machine learning models for gait classification are SVM [1, 7, 17, 20, 26] , Random Forest (RF) [1, 7, 26] and K-Nearest Neighbor (KNN) [20, 26] . The outcomes of these studies show good performance accuracy in the classification of pathological gait. In particular, the outcomes of previous work [7] show promising results in the use of gait characteristics to discriminate FD patients and healthy adults. However, the implication of the presence or absence of WMLs in the gait performance has not yet been investigated. Hence, the purpose of this study is to identify relevant gait characteristics to discriminate FD patients with and without WMLs from healthy subjects (controls). Our hypothesis is that from the evaluation of different gait characteristics (e.g., stride length, stride time variability) one can (1) identify unique gait characteristics in the two groups of FD and (2) a selection of relevant gait characteristics can accurately discriminate FD patients with WMLs versus healthy adults, and FD patients without WMLs versus healthy adults. Data from 39 patients with FD (25 with WMLs and 14 without WMLs) and 37 healthy controls were collected. From the control group was constructed two groups of controls aged-matched with the groups of FD patients with WMLs and FD patients without WMLs. The subject demographics of these four groups are summarized in Table 1 . For all FD patients, the exclusion criteria were: less than eighteen years of age, the presence of resting tremor, moderate-severe dementia (CDR > 2), depression, extensive intracranial lesions or neurodegenerative disorders, musculoskeletal disease, and rheumatological disorders. Local hospital ethics committee approved the protocol of the study, submitted by ICVS/UM and Center Algoritmi/UM. Written consent was obtained from all subjects or their guardians. Two Physilog R sensors (Gait Up R , Switzerland) positioned on both feet were used to measure different gait variables of each stride. This study consists of walking a 60-m continuous course (30 m corridor with one turn) in a self-selected walking while the sensors are acquiring the data. This data consists in the arithmetic mean calculated for all subjects stride time series for 11 spatial-temporal variables: speed (velocity of forward walking for one cycle), cycle duration (duration of one cycle), cadence (number of cycles in a minute), stride length (distance between successive footprints on the ground for the same foot), stance (time in which the foot touches the ground), swing (time in which the foot is in the air and does not touch the ground), loading (percent of stance between the heel strike and the foot being entirely on the ground), foot flat (percent of stance where the foot is entirely on the ground), pushing (percent of stance between the foot being entirely on the ground and toe leaving the ground), double support (percent of the cycle where both feet are touching the ground), peak swing (maximum angular velocity during swing) and 6 foot clearance variables strike angle (angle between the foot and ground when the heel touches the ground), lift-off angle (angle between the foot and the ground at take-off), maximum heel clearance (maximum height reached by the heel), maximum toe clearance 1 (maximum height reached by the toes just after maximum heel clearance), minimum toe clearance (minimum height of the toes during swing) and maximum toe clearance 2 (maximum height reached by the toes just before heel strike) in a total of 17 gait variables that contain the full step data. The average walking speed of FD patients and controls was 1.335 ± 0.174 m/s and 1.333 ± 0.201 m/s, respectively. Gait characteristics of a subject are affected by his demographics properties including height, weight, age, and gender, as well as by walking speed [25, 26] or stride length [2, 7, 9] . To normalize the data regression models according to Wahid et al.'s method [25, 26] were used. Comparing to other methods, such as dimensionless equations and detrending methods, MR normalization revealed better results on reducing the interference of subject-specific physical characteristics and gait variables [18, 25, 26] , thereby improving gait classification accuracy using machine learning methods [7, 26] . First, to control the multicollinearity among predictor variables within this multiple regression, Variance Inflation Factor (VIF) was calculated [24] . This test measures the colinearity of the physical characteristics (age, weight, height, and gender), speed, and stride length, being a value of VIF greater than 5 an indicator of this strong correlation. In Table 2 are the results of VIF tests. Since, when testing VIF with all the variables, both Speed and Stride Length had a VIF higher than 5 (an indicator of strong correlation), they can not be used simultaneously. When used separately all VIF values are less than 5, so there is no evidence of multicollinearity. Spatio-temporal gait variables and foot clearance variables were normalized as follows:ŷ whereŷ i represents the prediction of the dependent variable for the ith observation; x ij represents the jth physical property of the ith observation including age, weight, height, gender, speed or stride length, β 0 represents the intercept term, β j represents the coefficient for the jth physical property and ε i represents the residual error for the ith observation. The model's coefficients are estimated using the physical properties and the mean values of the gait variables of the 37 healthy controls. Although at least 20 subjects per independent variable are recommended in multiple linear regression [13] , based on similar studies [18, 25] with higher sample sizes, MR models were computed for all combinations with 1, 2, and 3 independent variables using a bisquare weight function. For the models with all significant independent variables (p < 0.05) Akaike's information criterion (AIC) [5] and R-squared metrics were used to select the best-fitted model. Statistical assumptions of a linear regression including linearity, normality, and homoscedasticity were verified. The models created for each feature are summarized in Table 3 . These are similar to the ones found in [7] with 34 controls. Left foot models were also computed but no major differences were found among both feet. So, in this paper, we just report the results on the right foot. In each subject group, the best fitted MR models are used to normalize each stride gait variable by dividing the original value y i by the predicted gait variablê y i from (1), as follow: where y n i represents the normalized value for the ith observation. After normalizing all strides of each of the 16 gait variables, the mean and the standard deviation (SD) of each variable (each gait time series) for all subjects were calculated. In this work, the SD value is used to measure the variability of each gait variable. Due to the high number of gait variables (a total of 32) and a small number of samples a hybrid method (filter method followed by a wrapper method) was employed to select the most relevant gait characteristics. First, a filter method based on Mann Whitney U tests and Spearman's correlation between the variables was used to selected 12 gait characteristics. Mann Whitney U Tests were used to examine the difference between groups (FD with WMLs vs. Controls and FD without WMLs vs. Controls) for each gait characteristics and Spearman's correlation to evaluate the independence and redundancy between them. Therefore, the selected 12 gait characteristics are the ones that present higher U -value and do not present a high correlation between them (ρ < |0.9|). Before applying a wrapper method, the selected 12 gait variables were scaled to have zero mean and unit variance. Based on previous work [21] , the wrapper was developed using the Recursive Feature Elimination (RFE) technique with three different ML classifiers: Logistic Regression (LR), SVM with Linear kernel and RF. RFE has some advantages over other filter methods [12] . RFE is an iterative method where features are removed one by one without affecting the training error. The selection of the optimal number of features for each model is based on the evaluation metric F1 Score evaluated through 5-fold cross-validation. F1 score is defined as the harmonic mean between precision and recall whose range goes from 0 to 1 and tells how precise the classifier is [11] . The gait characteristics' importance was quantified using the model itself (feature importance for LR and SVM with linear kernel and information gain for RF). The F1 score was used to assess the performance of the different gait characteristics combinations. Based on the literature [1, 7, 17, 20, 21, 26] four different types of classifiers were employed for the purpose of distinguishing FD patients with and without WMLs versus controls: LR, SVM, RF and KNN. All classifiers hyperparameters were tuned using randomized search and grid search method with 5-fold cross-validation: LR regularization strength constraint and type of penalty according to regularization strength constraint; SVM regularization parameter, gamma and different types of kernel (with different degrees); RF maximum depth, minimum samples leaf, minimum samples split and the number of estimators; finally, KNN number of neighbors, weights and metric. All classifiers were implemented in Python programming language using Scikit-learn library [19] . To evaluate the performance of the different classification models the accuracy metric (the ratio of correct predictions) of the training and validation set was used. Recursive feature elimination technique was performed on the 12 remaining features from filter method: cycle duration, cadence, swing, foot flat, pushing, double support, maximum heel clearance, maximum toe clearance 1, maximum toe clearance 2, loading variability, foot flat variability, and maximum heel clearance variability. The results are summarized in Table 4 and Fig. 1 . Five gait characteristics were selected by LR with an F1 Score of 63.96%, while SVM selected 4 features with an F1 score of 63.96%, and RF selected the larger number of features, 9 with an F1 score of 60.21%. In Table 4 , are the training and validation accuracies for the optimal models of each algorithm. RF had the higher accuracy in the training but the lower accuracy in validation. LR and SVM show similar accuracies in training and validation. Table 5 . LR performance increases slightly by Recursive feature elimination algorithm was performed on the 12 remaining features from filter method: cycle duration, cadence, loading, foot flat, peak swing, strike angle, minimum toe clearance, stride length variability, loading variability, lift-off angle variability, maximum heel clearance variability, and minimum toe clearance variability. Results are stated in Table 6 and Fig. 2 . LR selected 8 features as the optimal number of gait characteristics with an F1 Score of 86.29%, SVM selected 6 features with an F1 Score of 74.29% and RF selected 10 features with an F1 Score of 68.88%. Table 6 presents the training and validation accuracies for the optimal models of each algorithm. LR presents the higher training and validation accuracies. Taking into account the contribution of each gait characteristic in the classification model (Fig. 2, right side) , the common features were 5 (Top 5): loading, foot flat, minimum toe clearance, loading variability, and lift-off angle variability. The common gait characteristics from LR and SVM were 6 (Top 6): loading, foot flat, minimum toe clearance, stride length variability, loading variability, and lift-off angle variability. These Top 5 and Top 6, as well as the Top 3 from LR and SVM (stride length variability, loading variability, and lift-off angle variability), were evaluated with five classification models (LR, SVM Linear kernel, SVM RBF kernel, RF and KNN) to identify the optimal combination of gait characteristics and the classification model with better performance. Results are displayed in Table 7 . With Top 5 SVM RBF kernel, RF and KNN achieved the highest validation accuracy of 76.67%. Looking at the Top 6, LR and SVM RBF kernel showed the highest validation accuracy of 83.33%. Using the Top 3 from LR and SVM, RF showed 78.33% validation accuracy, followed by the SVM Linear kernel, RF, and KNN with 75% validation accuracy, which KNN had the lower standard deviation. The validation accuracy of RF and KNN increased by reducing the feature set from 6 to 3, while the validation accuracy of LR, SVM Linear kernel, and SVM RBF kernel decrease. Overall, LR showed higher mean validation accuracy but also a higher standard deviation. The present work aimed to identify discriminatory gait characteristics to distinguish FD with WMLs from controls and FD without WMLs from controls. Based on previous literature [1, 7, 17, 20, 21, 26] different classification models were evaluated with different set of gait characteristics selected using a filter method follow by recursive feature elimination wrapper method with RF, SVM Linear kernel, and LR. Sixteen gait time series were obtained by two wearable sensors. All strides were normalized before developing any ML model according to previous studies [7, 18, 25] . Then, for each gait time series the mean and the standard deviation (as variability measure) were calculated obtaining 32 gait characteristics. From the feature selection analysis, foot flat, pushing, and maximum toe clearance 2 were identified as important characteristics to classify FD with WMLs. While stride length variability, loading variability, and lift-off angle variability, followed by loading, foot flat, and minimum toe clearance were identified as important gait characteristics to distinguish FD without WMLs from aged-matched healthy adults. Previous work [7] reveals that FD patients (with and without WMLs together) present lower percentages in foot flat and higher in pushing comparing with healthy adults. For FD patients with WMLs versus controls, validation accuracy of 62-72% and a similar training accuracy of 69-85% was achieved thought the five selected classification models based on Top 3 gait characteristics, showing LR classifier the best performance with validation and training accuracy of 72% and 71%, respectively. With one more feature (foot flat variability) SVM with both Linear or RBF kernel also reveal good performance with an accuracy of 70% for validation and 74% for training. These results corroborated the hypothesis that the gait characteristics can be used to distinguish FD patients with WMLs from controls. This goes in line with the premise that gait is a final outcome of WMLs [22, 23, 27] . Surprisingly, in the FD patients without WMLs versus controls classification higher training and validation accuracies of 79-97% and 70-83% , respectively, were obtained based on Top 6, Top 5, or Top 3 selected features. LR and SVM RBF kernel classifier displayed the best performance based on Top 6, with an accuracy of 83% for validation and of 97% and 85% for training, respectively. By reducing the feature set for Top 3, overall validation accuracy of 70-78% was achieved, where for the RF and KNN classifiers the accuracy slightly increased to 75%. Similarly, in [21] an increase of the model accuracy was observed with feature reduction. Further, feature selection (reduction) plays an important role to deal with the problem of model overfitting, reduces training time, enhancing the overall ML performance and implementation. These results suggest that selected gait characteristics could be used as clinical features for supporting diagnoses of FD patients even without WMLs from younger ages since the mean age of these patients is 37.786 ± 10.48 years. Due to the number of subjects involved in this study, all dataset was used in the training and validation of the models and any independent (external) dataset was used for checking the model performances. To test the robustness of classification models based on the selected gait characteristics further research with independent datasets is needed. To the best of our knowledge, this is the first study that explores gait characteristics and their discriminate power in FD patients with WMLs and without WMLs from controls. For the discrimination of FD patients with WMLs the best model was built using LR with the variables foot flat, pushing, and maximum toe clearance 2, with an accuracy of 72%. In contrast, for the discrimination of FD patients without WMLs the best suited model was achieved using LR and SVM with RBF kernel based on the variables loading, foot flat, and minimum toe clearance, stride length variability, loading variability, and lift-off angle variability with an accuracy of 83%. The implications of WMLs on gait compromise in FD or predictive value of each kinematic gait variable remains still elusive, warranting further investigation with a more enriched cohort. Still, our findings are the first step to demonstrate the potential of machine learning techniques based on gait variables as a complementary tool to understand the role of WMLs in the gait impairment of FD. For future research, a larger sample size will be used to confirm and extend these findings. A performance comparison based on machine learning approaches to distinguish Parkinson's disease from Alzheimer disease using spatiotemporal gait signals Step length determines minimum toe clearance in older adults and people with Parkinson's disease Natural history of the late-onset phenotype of Fabry disease due to the p. F113L mutation Central nervous system involvement in Anderson-Fabry disease: a clinical and MRI retrospective study Multimodel inference: understanding AIC and BIC in model selection IMU-based classification of Parkinson's disease from gait: a sensitivity analysis on sensor location and feature selection Gait classification of patients with Fabry's disease based on normalized gait features obtained using multiple regression models Artificial neural networks classification of patients with Parkinsonism based on gait Gait stride-to-stride variability and foot clearance pattern analysis in idiopathic Parkinson's disease and vascular parkinsonism A 15-year perspective of the Fabry outcome survey Evaluation measures of the classification performance of imbalanced data sets An introduction to variable and feature selection Multivariable Analysis: A Practical Guide for Clinicians and Public Health Researchers Development and clinical consequences of white matter lesions in Fabry disease: a systematic review Machine learning for large-scale wearable sensor data in Parkinson's disease: concepts, promises, pitfalls, and futures Clinical prodromes of neurodegeneration in Anderson-Fabry disease A machine learning framework for gait classification using inertial sensors: application to elderly, poststroke and Huntington's disease patients Regression analysis of gait parameters and mobility measures in a healthy cohort for subject-specific normative values Scikit-learn: machine learning in Python Automated classification of neurological disorders of gait using spatio-temporal gait parameters Selecting clinically relevant gait characteristics for classification of early Parkinson's disease: a comprehensive machine learning approach White matter integrity is associated with gait impairment and falls in mild cognitive impairment. Results from the gait and brain study Brain white matter lesions detected by magnetic resosnance imaging are associated with balance and gait speed Extracting the variance inflation factor and other multicollinearity diagnostics from typical regression results A multiple regression approach to normalization of spatiotemporal gait features Classification of Parkinson's disease gait using spatial-temporal gait features Brain white matter hyperintensities, executive dysfunction, instability, and falls in older people: a prospective cohort study