key: cord-0133905-z1trwxxd authors: Singh, Rita; Shah, Ankit; Dhamyal, Hira title: An Overview of Techniques for Biomarker Discovery in Voice Signal date: 2021-10-10 journal: nan DOI: nan sha: 07c19aa98557f1d52dddb7a538bab05cce0303da doc_id: 133905 cord_uid: z1trwxxd This paper reflects on the effect of several categories of medical conditions on human voice, focusing on those that may be hypothesized to have effects on voice, but for which the changes themselves may be subtle enough to have eluded observation in standard analytical examinations of the voice signal. It presents three categories of techniques that can potentially uncover such elusive biomarkers and allow them to be measured and used for predictive and diagnostic purposes. These approaches include proxy techniques, model-based analytical techniques and data-driven AI techniques. Based on their effects on human voice, medical conditions that are known to affect humans can be divided into four clear categories. Of these, one category of diseases comprises those that have absolutely no effect on voice, such as certain dermatological conditions, hairrelated conditions etc. In contrast to this, the category of conditions that is expected to have the most obvious effects on voice includes diseases that directly affect the structures of the vocal tract -vocal folds, larynx, glottis, respiratory tract, articulators etc. Examples of such diseases are otolayngological diseases of various etiology. A third category is that of diseases that indirectly affect the processes that drive voice production -including cognitive, neuromuscular, biomechanical and auditory feedback processes. These conditions cause varied effects on voice, ranging from fairly intense and obvious effects, to very subtle or almost imperceptile ones. This category includes diseases such as those listed in Table 1 , and syndromes caused by secondary effects of drugs, intoxicants and other harmful substances. The fourth category -and the focus of the mechanisms presented in this paper -comprises diseases for which the existence * Second and third author contributed equally of voice changes is hypothesized, but these may not be evident through standard analytical examinations of the voice signal. These include disease subcategories that affect intellectual abilities, visual function, temperament, personality, behavior etc. For these diseases, biomarkers in voice may be hypothesized to be present, but are elusive and must be searched for -in effect designed or created -for use in data-driven applications that attempt to detect the presence of these diseases from voice. The changes in voice that are alluded to above are in fact biomarker patterns, or biomarkers. The term biomarker in this context refers to specific patterns of change(s) in the voice signal in the voice signal, that carry information about the health conditions that cause them. Such changes may be thought of as perturbations or deviations from a hypothetical "normal" voice signal, within its frequency, duration, intensity, amplitude and other entities that characterize it. The changes may be wide-raging and coarse in nature, or temporally transient, occurring within micro-durations of the signal. In other words, biomarker patterns may range from being highly perceptible, to completely imperceptible -to the extent that they may be undetectable even within standard analytical representations of the speech signal. The rest of this paper introduces three categories of techniques that can potentially uncover such elusive biomarkers. Such techniques can be thought of as biomarker discovery or feature engineering techniques focused on the design or re-design of biomarker features in voice. The approaches described below include proxy techniques, model-based analytical techniques and data-driven AI techniques. "© 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works." How it affects voice Attention Deficit Hyperactivity Disorder (ADHD) Prosodic variations in loudness and fundamental frequency [1] Amyotrophic Lateral Sclerosis (ALS) Voice tremor, flutter [2] , incomplete vocal fold closure, dysarthria [3] ; Dystonia, dysarthria [4] ; Low-frequency (< 4 hz) tremor [5] Alzheimer's Disease (Dementia) Abnormal fundamental frequency, pause and voice-break patterns, reduction in vocal range [6, 7] ; Dysphonia [8] Arthritis: Fibromyalgia Changes in Jitter, shimmer, harmonic-to-noise ratio, and phonation time [9] Cerebral Palsy Dysphonia [10] ; Breathiness, Asthenia, Roughness, Strain [11] Cholera Husky voice [12] ; high-pitched, asthenia [13] ; "Cholera voice" [ Proxy techniques may be used for the changes in voice that are human observable, but not easily measurable. For example, many of the correlations established between various diseases and voice changes in the medical literature refer to changes in voice quality. The changes, although subjective, comprise biomarkers that could potentially be used in machine learning systems for prediction of the corresponding diseases from voice. However, the problem is that for the most part, the entities that constitute the set of voice qualities are subjectively specified. For many, methodologies for objective measurement do not exist. For example, the voice sub-qualities of nasality (or nasalence), roughness, breathiness, asthenia etc. have no objective measures associated with them. They are in fact rated by human experts on standardized clinical rating scales, such as Voice Rating scale (VRS) [32] , Voice Disability Coping Questionnaire (VDCQ) [33] , Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) [34] etc. Examples of such subjective correlations of various diseases to voice changes are listed in Table 1 . Features that are subjectively rated and do not have specific methods to objectively measure them can be measured through proxy. The term "proxy" here refers to the act of using (or creating) replacement features that can be measured instead of the original ones. For this to be viable, the proxy features must be highly correlated to the subjective features that we desire to measure. There are two potential mechanisms to create such proxy features: the use of physical models of voice production to produce signals that can be more easily measured, and the use of AI mechanisms for transfer learning that can generate measurable features that exhibit the same patterns as the subjective features they proxy for. 2.1.1. Proxy features from models of voice production: measurement through emulation As an example, we consider physical models that can generate just one specific aspect of a continuous speech signal -the set of phonated sounds embedded in it. The idea is to use such models to generate or approximate the actual motion of the vocal folds during the process of phonation, producing a glottal flow signal that has characteristics (or specific voice sub-qualities) that are similar to a give recorded speech signal. Once this is achieved, the parameters of the model used can proxy for the speech quality characteristics of the original signal, that may not have been directly measurable. We can also call this process "measurement through emulation." Examples of physical models of phonation that can be used include the 1-mass, mass-spring model [35] , the 2-mass, mass-spring model [36] etc. For a given recording, the parameters of these models are derivable through the ADLES algorithm, that minimizes the squared error between a glottal flow signal estimated through inverse filtering and the glottal flow signal generated through the model. The discriminability of such features is evident from the highly characteristic patterns exhibited by the corresponding model in its phase space. For example, while we know that there are changes in voice in response to various diseases of the vocal folds, and may have observed changes in voice as a result of covid, it is hard to characterize, identify or measure the exact changes in spectrographic and other signal representations. However, we can also see that the vocal fold movements (phonation process) would be affected by these diseases, and be highly correlated to the actual vocal fold oscillations as an affected person speaks. This is borne out by the phase space patterns exhibited by a 1-mass model, as shown in Fig. 1 . We can therefore use the corresponding parameter values (and even other measurements that pertain to the phase space trajectories shown) to build predictors for the underlying conditions. The model parameters are then the proxy features we are looking for. Proxy features can also be derived using neural networks and other classifiers that learn to perform classification tasks on the signals within which we desire to identify or measure biomarkers. When trained to perform auxiliary tasks that are equivalent to those that must be actually performed using objective measures of the biomarker, the scores generated through the auxiliary classifiers act as proxy features for the biomarkers in question. For example, it is easy to identify, but difficult to measure changes in the "nasality" of speech signals. However, it is relatively easy to identify and isolate nasal and nonnasal phonemes in speech, and build classifiers to discriminate between them. When properly trained, an accurate classifier would generate scores that are discriminative of nasality characteristics. If this were not so, the classifier would not be able to accurately perform the task of discriminating between nasal and non-nasal phonemes. Once trained, the classifier can be used to generate scores for training newer predictors of underlying conditions based on nasality. that we wish to measure. With the advancements in the voice profiling techniques, it has become evident that for a large set of diseases for which correlations to voice were not known to exist, the presence of such correlations can in fact be hypothesized and scientifically supported. Since for these, we clearly know that biomarkers are not perceptible or evident within standard analytical representations of the voice signal, we must devise techniques to discover them or create them in appropriate mathematical spaces within which they can be shaped and measured. One example of such a biomarker discovery system/framework is illustrated in Fig. 2 . This represents a generic setup which we call the ABCDE framework (Autoencoder based Biomarker Creation and Discovery Engine). Its exact formulation can vary from application to application. In this framework, a speech signal is first converted into a numerical representation that is hypothesized to contain the biomarker related information that we seek to uncover, or extract. The representation is then "projected" into a neural kernel space, where we can impose objective criteria on it that are tailored to the specific biomarker. Viewing The cuboid for SPi represents a stack of the same "type" of DSP features, e.g. a stack of spectrograms obtained at different time-frequency resolutions, or a correlogram. These are input into a neural network E (e.g. a convolutional neural network), which can be viewed as an "encoder" that transforms them into a kernel space, yielding a latent feature representation Z. Within this space, different constraints (including those based on prior knowledge) can be imposed on Z, so that it has the desired properties of the biomarker for the targeted health condition. One obvious property it must possess is that it must be discriminative for the underlying medical condition it is expected to encode. To ensure that the information present in the input representation preserved in the process of transformation into the kernel space, a decoder D is trained to reconstruct the original representation from Z, while minimizing the loss between the reconstructed and original representation. All loss functions are represented by red double-lines in Fig. 2 . In the training process, which is that of parameter optimization of the aggregate neural framework through gradient descent, such a loss function would be minimized. To ensure that only the information in the original input representation is engineered for the feature at hand, the same latent representation Z is input into a generator G that can recreate the original speech signal from it (as G(Z)). Alternatively, the G can act on the output of the decoder, SPo (as G(SPo)) to generate the voice signal. Time-domain signals with same information content can be different in voice acoustics and thus, more robust losses are defined based on comparisons of DSP features from the original signal, to those predicted from the voice signal generated by G. Alternatively, the DSP features can be "simulated" by a neural stack, as shown in the figure. An additional level of detail that allows voice quality features to play a role in this process of discovery is introduced by specifying rough functional relationships between DSP features and various voice qualities V. Losses based on direct comparisons between these can also play a part in the learning process. The entire framework is simultaneously optimized, but it is conceivable to express this process of discovery within more sophisticated AI learning frameworks that prioritize interpretability and can be trained in parts. Here, Z represents the "discovered" feature carrying the biomarker whose existence was hypothesized. The biomarker discovery mechanisms given above are generic ones, and may have varied formulations in different settings. In contrast to traditional features derived from audio signals for use with machine learning algorithms, these features are likely to perform better with lesser training data, since they are designed to be relatively more discriminating and less ambiguous for the specific disease for which they are designed. Such features are useful in many ways. For example, they can be used in applications that serve as diagnostic aids in clinical settings, to build tools for early-detection of certain diseases, for self-monitoring of health by disabled, elderly and under-resourced people, etc. This material is based upon work supported by the Defence Science and Technology Agency, Singapore under contract number A025959. Its content does not reflect the position or policy of DSTA and no official endorsement should be inferred. Predicting adult attention deficit hyperactivity disorder (adhd) using vocal acoustic features Rapid voice tremor, or "flutter," in amyotrophic lateral sclerosis Otolaryngologic presentations of amyotrophic lateral sclerosis Neurologic diseases and their effect on voice Characterizing vocal tremor in progressive neurological diseases via automated acoustic analyses Speech in alzheimer's disease: can temporal and acoustic parameters discriminate dementia? Ten years of research on automatic voice and speech analysis of people with alzheimer's disease and mild cognitive impairment: A systematic review article Dysphonia in the aging: physiology versus disease Voice disorder in patients with fibromyalgia Voice in people with cerebral palsy Changes in voice quality after speech-language therapy intervention in older children with cerebral palsy Notes of cases of cholera treated by sulphurous acid Cholera: diagnosis and treatment The Cholera and Its Homoeopathic Treatment, Radde Voice quality evaluation in patients with covid-19: An acoustic analysis Detection of covid-19 through the analysis of vocal fold oscillations Interpreting glottal flow dynamics for detecting covid-19 from voice Vocal characteristics in patients with type 2 diabetes mellitus Intonation and phonation in young adults with down syndrome Temporal lobe epilepsy alters auditory-motor integration for voice control A study on etiopathogenesis of vocal cord paresis and palsy in a tertiary centre The hot patient: acute drug-induced hyperthermia Diagnosis and treatment of drug-induced hyperthermia Hypothermia: evaluation, electrocardiographic manifestations, and management The nature and severity of voice disorders in lung cancer patients Value of the voice in diagnosis of myxoedema in the elderly Aspiration pneumonia and dysphagia in the elderly Hoarseness of voice-a presenting manifestation of primary pulmonary hypertension Physical task stress and speaker variability in voice quality Voice, stress, and emotion Health and voice quality in smokers: an exploratory investigation Differences in self-rated, perceived, and acoustic voice qualities between high-and low-fatigue groups How do individuals cope with voice disorders? introducing the voice disability coping questionnaire Consensus auditory-perceptual evaluation of voice: development of a standardized clinical protocol Modeling vocal fold asymmetries with coupled van der pol oscillators Dynamics of the two-mass model of the vocal folds: Equilibria, bifurcations, and oscillation region