key: cord-0448060-qvoermz0 authors: Neumann, Michael; Roesler, Oliver; Liscombe, Jackson; Kothare, Hardik; Suendermann-Oeft, David; Pautler, David; Navar, Indu; Anvar, Aria; Kumm, Jochen; Norel, Raquel; Fraenkel, Ernest; Sherman, Alexander V.; Berry, James D.; Pattee, Gary L.; Wang, Jun; Green, Jordan R.; Ramanarayanan, Vikram title: Investigating the Utility of Multimodal Conversational Technology and Audiovisual Analytic Measures for the Assessment and Monitoring of Amyotrophic Lateral Sclerosis at Scale date: 2021-04-15 journal: nan DOI: nan sha: 4d87e911c6dca6eaed925d4c6f82e9ae3e17d9cd doc_id: 448060 cord_uid: qvoermz0 We propose a cloud-based multimodal dialog platform for the remote assessment and monitoring of Amyotrophic Lateral Sclerosis (ALS) at scale. This paper presents our vision, technology setup, and an initial investigation of the efficacy of the various acoustic and visual speech metrics automatically extracted by the platform. 82 healthy controls and 54 people with ALS (pALS) were instructed to interact with the platform and completed a battery of speaking tasks designed to probe the acoustic, articulatory, phonatory, and respiratory aspects of their speech. We find that multiple acoustic (rate, duration, voicing) and visual (higher order statistics of the jaw and lip) speech metrics show statistically significant differences between controls, bulbar symptomatic and bulbar pre-symptomatic patients. We report on the sensitivity and specificity of these metrics using five-fold cross-validation. We further conducted a LASSO-LARS regression analysis to uncover the relative contributions of various acoustic and visual features in predicting the severity of patients' ALS (as measured by their self-reported ALSFRS-R scores). Our results provide encouraging evidence of the utility of automatically extracted audiovisual analytics for scalable remote patient assessment and monitoring in ALS. challenging for patients due to various reasons, including, but not limited to: (i) no access to neurologists or psychiatrists; (ii) lack of awareness of a given condition and the need to see a specialist; (iii) lack of an effective standardized diagnostic or endpoint; (iv) substantial cost and transportation involved in conventional or traditional solutions; and in some cases (v) lack of medical specialists in these fields [3] . The NEurological and Mental health Screening Instrument (NEMSI) [4] , has been developed to bridge this gap. NEMSI is a cloud-based multimodal dialog system that can be used to elicit evidence required for detection or progress monitoring of neurological or mental health conditions through automated screening interviews conducted over the phone or via web browser. While intelligent virtual agents have been proposed in previous work for such diagnosis and monitoring purposes [5, 6] , NEMSI offers three significant innovations: First, NEMSI uses readily available devices (web browser or mobile app), in contrast to dedicated, locally administered hardware, like cameras, servers, audio devices, etc. Second, NEMSI's backend is deployed in an automatically scalable cloud environment allowing it to serve an arbitrary number of end-users at a small cost per interaction. Thirdly, the NEMSI system is natively equipped with real-time analytics modules that extract a variety of speech and video features of direct relevance to clinicians in the neurological space, such as speech and pause duration for the assessment of ALS, or geometric features derived from facial landmarks for the automated detection of orofacial impairment in stroke. This paper investigates the utility of audio and video metrics collected via NEMSI for early diagnosis and monitoring of ALS. We specifically investigate two research questions. First, which metrics demonstrate statistically significant differences between (a) healthy controls and bulbar pre-symptomatic people with ALS or pALS (thereby assisting in early diagnosis), as well as (b) bulbar-presymptomatic patients and bulbar symptomatic patients (thereby assisting in progress monitoring)? Second, for pALS cohorts, which metrics are most predictive of their self-reported ALS Functional Rating Scale-Revised (ALSFRS-R [7] ) score? Before addressing these questions, however, we briefly summarize the current state of ALS research, and describe our data collection and metrics extraction process. ALS is a neurodegenerative disease that affects roughly 4 to 6 people per 100,000 of the general population [8, 9] detection and continuous monitoring of ALS symptoms is crucial to provide optimal patient care [10] . For instance, decline in speech intelligibility is said to have a negative impact on patients' quality of life [11, 12] , and continuous monitoring of speech intelligibility could prove valuable in terms of patient care. Traditionally, subjective measures, such as patientreports or ratings by clinicians, are used to detect and monitor speech impairment. However, recent studies show that objective measures allow for earlier detection of ALS symptoms [13, 14, 15, 16, 17, 18, 19] , stratification and classification of patients [20] and can provide markers for disease onset, progression and severity [21, 22, 23, 24, 25] . These objective measures can be automatically extracted, thereby allowing for more frequent monitoring, potentially improving treatment. The success of the Beiwe Research Platform 1 [26] to track ALS disease progression demonstrates the viability of such remote monitoring solutions. NEMSI end users are provided with a website link to the secure screening portal and login credentials by their caregiver or study liaison (physician, clinic, a referring website or patient portal). After completing microphone and camera checks, subjects participate in a conversation with "Nina", a virtual dialog agent. Nina's virtual image appears in a web window, and subjects are able to see their own video. During the conversation, Nina engages subjects in a mixture of structured speaking tasks and open-ended questions to elicit speech and facial behaviors relevant for the type of condition being screened for. Analytics modules automatically extract speech (e.g., speaking rate, duration measures, fundamental frequency (F0)) and video features (e.g., range and speed of movement of various facial landmarks) in real time and store them in a database, along with meta-data about the interaction, such as call duration and completion status. All this information can be accessed by the study liaison through an easy-to-use dashboard, which provides a summary of the interaction (including access to a video recording and the analytic measures computed), as well as a detailed breakdown of the metrics by individual interaction turns. Data from 136 participants (see Table 1 ) were collected between September 2020 and March 2021 in cooperation with Every-thingALS and the Peter Cohen Foundation 2 . For this crosssectional study we included one dialog session per subject. 3 The conversational protocol elicits five different types of speech samples from participants, inspired by prior work [27, 28, 29, 30] : (a) sustained vowel phonation, (b) read speech, (c) measure of diadochokinetic rate (rapidly repeating the syllables /pAtAkA/), and (d) free speech (picture description task). For (b) read speech, the dialog contains six speech intelligibility test (SIT) sentences of increasing length (5 to 15 words), and one passage reading task (Bamboo Passage; 99 words). After dialog completion, participants filled out the ALS Functional Rating Scale-revised (ALSFRS-R), a standard instrument for monitoring the progression of ALS [7] . The questionnaire consists of 12 questions about physical functions in activities of daily living. Each question provides five answer options, ranging from normal function (score 4) to severe disability (score 0). The total ALSFRS-R score is the sum of all sub-scores (therefore ranging from 0 to 48). The ALSFRS-R comprises four scales for different domains affected by the disease: bulbar system, fine and gross motor skills, and respiratory function. We stratified subjects into three groups for statistical analysis: (a) Healthy controls (CON); (b) pALS with a bulbar subscore < 12 (first three ALSFRS-R questions) were labeled bulbar symptomatic (BUL); and (c) pALS with a bulbar sub-score of 12 were labeled bulbar pre-symptomatic (PRE). Similar to [14] we aim at identifying acoustic and visual speech measures that show significant differences between these groups. We use measures commonly established for clinical speech analysis with regard to ALS [14] , including timing measures, frequency domain measures, and measures specific to the diadochokinesia task (DDK), such as syllable rate and cycle-tocycle temporal variation [31] . Table 2 shows the metrics and speech task types from which they are extracted. Additionally, speech intensity (mean energy in dB SPL excluding pauses) was extracted for all utterances. The picture description task (free speech) was not used for this analysis. All acoustic measures were automatically extracted with the speech analysis software Praat [32] . Speaking and articulation rates are computed based on expected number of words because forced alignment is error-prone for dysarthric speech [33] . For that reason, these measures can be noisy, if for example a patient did not finish the reading passage. Hence, we automatically remove outliers based on thresholds for the Bamboo task: speaking rates> 250 words/min, articulation rates> 350 words/min, and PPT> 80% are excluded. 4 Facial metrics were calculated for each utterance in three steps: (i) face detection using the Dlib 5 face detector, which uses Speech type Collected metrics Held vowel Mean F0 (Hz), jitter (%), shimmer (%), HNR (dB), CPP (dB) SIT and Speaking and articulation duration (sec), Bamboo speaking and articulation rate (words/min), PPT DDK Speaking and articulation duration (sec), Syllable rate (syllables/sec), number of produced syllables, cTV (sec) Table 3 : Extracted facial metrics. Maximum (suffix max) and average ( avg) were extracted for movement and surface metrics, and maximum, average and minimum ( min) were extracted for velocity, acceleration and jerk metrics. five histograms of oriented gradients to determine the (x, y)coordinates of one or more faces for every input frame [34] , (ii) facial landmark extraction using the Dlib facial landmark detector, which uses an ensemble of regression trees proposed in [35] to extract 68 facial landmarks according to Multi-PIE [36] , and (iii) facial metrics calculation, which uses 20 facial landmarks to compute the metrics shown in Table 3 (cf. [37] for details). Finally, all facial metrics in pixels were normalized within every subject by dividing the values by the interlachrymal distance in pixels (measured as distance between the right corner of the left eye and the left corner of the right eye) for each subject. To normalize for sex-specific differences in metrics (such as F0), we z-scored all metrics by sex group. Additionally, all metrics reported below (except speaking and articulation duration) were averaged across speech task type. An important caveat to all the analyses presented here is the imbalance of sample size between the cohorts; also, future extensions to this work will need a larger sample size of the BUL and PRE cohorts to make robust and generalizable statistical claims. Figure 1 : Effect sizes of acoustic and visual metrics that show statistically significant differences at p < 0.05. Effect sizes are shown with a 95% confidence interval and are ranked by the BUL-CON group pair. We conducted a non-parametric Kruskal-Wallis test for every acoustic and facial metric to identify the metrics that showed a statistically significant difference between the cohorts. For all metrics with p < 0.05 a post-hoc analysis was done (again Kruskal-Wallis) between every combination of two cohorts to find out which groups can be distinguished. Figure 1 shows effect size, measured as Glass' ∆ [38] , for all metrics that show statistically significant difference (p < 0.05) between different subject groups. In addition to the statistical tests, we conducted 5-fold cross-validation with logistic regression to investigate binary classification performances, and in turn sensitivities and specificities, of our aforementioned metrics in distinguishing the CON vs PRE (with applications to early diagnosis) and PRE vs BUL (progress monitoring) groups. We investigated using both the full feature set as well as feature selection with recursive feature elimination for classification and found that the latter performed better as the former method tends to overfit the data, given our small sample size. Receiver operating characteristics (ROC) curves for these classification experiments encapsulating sensitivities and specificities are presented in Figure 2 . For the CON vs PRE case, we observed that the mean unweighted average recall (UAR) across 5 cross-validation folds was 0.63 ± 0.08, respectively, significantly above chance, suggesting promising applications for early diagnosis. For the PRE vs BUL case (and therefore progress monitoring), these numbers were even better -0.77 ± 0.05, respectively. The BUL vs CON case, unsurprisingly, produced the best results. Looking at acoustic features, we found that timing measures (speaking and articulation duration and rate; PPT; syllable rate, cTV) exhibit strong differences between groups and that the effect sizes of these metrics are highest between the BUL group and the CON group. Mean F0 also showed a significant difference with small effect sizes. For visual metrics, the results indicate that velocity, acceleration and jerk measures are generally the best indicators for ALS. Additionally, while the jaw center (JC) seems to be more important than the lower lip (LL) for detecting ALS, further investigations are necessary to ensure that the difference between the JC and LL metrics is not just due to a difference in facial landmark detection accuracy. : Which metrics contribute the most toward predicting the ALSFRS-R score? In a regression analysis, we investigated the predictive power of the extracted metrics with regard to both BUL and PRE pALS cohorts. To investigate this, we employed a LASSO (least absolute shrinkage and selection operator) regression with the objective to predict the total ALSFRS-R score (implemented using least-angle regression (LARS) algorithm [39] ). The algorithm is similar to forward stepwise regression, but instead of including features at each step, the estimated coefficients are increased in a direction equiangular to each one's correlations with the residual. Figure 3a shows the final 17 features, in the order they were selected by the LASSO-LARS regression, on data from 19 PRE samples 6 along with the cumulative model R 2 at each step. We observe that both facial metrics (mouth opening and symmetry ratio, higher order statistics of jaw and lips) and acoustic metrics (particularly voice quality metrics such as jitter, shimmer and mean F0) added useful predictive power to the model, suggesting that these might be useful in modeling severity in bulbar 6 The number of samples in the classification and regression analyses differ from Table 1 because not all metrics were present for all subjects, either because the task was not performed correctly or system errors. pre-symptomatic pALS. On the other hand, for the BUL cohort, Figure 3b shows a slightly different set of 20 features obtained using LASSO-LARS (based on 26 participants). We observe that facial metrics (eye blinks and brow positions, in addition to higher order statistics of jaw and lips) add more predictive power to the model than acoustic metrics (such as cTV, CPP and mean F0), suggesting that these might find utility in modeling severity in bulbar symptomatic pALS. Our findings demonstrate the utility of multimodal dialog technology for assisting early diagnosis and monitoring of pALS. Multiple automatically extracted acoustic (rate, duration, voicing) and visual (higher order statistics of the jaw and lip) speech metrics show significant promise in assisting with both early diagnosis of bulbar pre-symptomatic ALS vs healthy controls, as well as for progress monitoring in pALS. Moreover, using LASSO-LARS to model the relative contribution of these features in predicting the ALSFRS-R score highlights the utility of incorporating different speech and facial metrics for modeling severity in bulbar pre-symptomatic and bulbar symptomatic pALS. While higher order statistics of the jaw and lower lip facial features and timing, pausing and rate-based speech features were useful across the board for both cases, voice quality and mouth opening and area metrics seem to be more useful for the bulbar pre-symptomatic group, while spectral and eye-related metrics are relevant for the bulbar symptomatic group. Future work will expand these analyses to more speakers to ensure the statistical robustness and generalizability of these trends. Mobile health: Revolutionizing healthcare through transdisciplinary research Telemedicine for management of patients with amyotrophic lateral sclerosis through covid-19 tail Can mobile health technologies transform health care Nemsi: A multimodal dialog system for screening of neurological or mental conditions Simsensei kiosk: A virtual human interviewer for healthcare decision support Now all together: Overview of virtual health assistants emulating face-to-face health interview experience The alsfrs-r: a revised als functional rating scale that incorporates assessments of respiratory function Genetic epidemiology of amyotrophic lateral sclerosis The epidemiology of als: a conspiracy of genes, environment and time Diagnostic timelines and delays in diagnosing amyotrophic lateral sclerosis (als) A cross sectional study on determinants of quality of life in als Speech intelligibility and marital communication in amyotrophic lateral sclerosis: an exploratory study Speech in als: longitudinal changes in lips and jaw movements and vowel acoustics The diagnostic utility of patient-report and speech-language pathologists' ratings for detecting the early onset of bulbar symptoms due to ALS Articulation acoustic kinematics in als speech Detection of amyotrophic lateral sclerosis (als) via acoustic analysis Kinematic features of jaw and lips distinguish symptomatic from presymptomatic stages of bulbar decline in amyotrophic lateral sclerosis Lingual and jaw kinematic abnormalities precede speech and swallowing impairments in als The effects of symptom onset location on automatic amyotrophic lateral sclerosis detection using the correlation structure of articulatory movements A speech measure for early stratification of fast and slow progressors of bulbar amyotrophic lateral sclerosis: lip movement jitter Predicting speech intelligibility decline in amyotrophic lateral sclerosis based on the deterioration of individual speech subsystems Automatic prediction of intelligible speaking rate for individuals with als from speech acoustic and articulatory samples Speech deterioration in amyotrophic lateral sclerosis (als) after manifestation of bulbar symptoms Detecting bulbar motor involvement in als: Comparing speech and chewing tasks Reliability and validity of speech & pause measures during passage reading in als Design and results of a smartphone-based digital phenotyping study to quantify als progression Acoustic analysis of voice in individuals with amyotrophic lateral sclerosis and perceptually normal vocal quality Dysarthria in amyotrophic lateral sclerosis: A review Comparison of automated acoustic methods for oral diadochokinesis assessment in amyotrophic lateral sclerosis Analyzing progression of motor and speech impairment in als Automated acoustic analysis of oral diadochokinesis to assess bulbar motor involvement in amyotrophic lateral sclerosis Speak and unspeak with praat Improving automatic forced alignment for dysarthric speech transcription Histograms of oriented gradients for human detection One millisecond face alignment with an ensemble of regression trees Multi-pie On the utility of audiovisual dialog technologies and signal analytics for real-time remote monitoring of depression biomarkers Meta-analysis in social research Least angle regression