key: cord-104138-qagyaegp authors: Magee, Michelle; Lewis, Courtney; Noffs, Gustavo; Reece, Hannah; Chan, Jess C. S.; Zaga, Charissa J.; Paynter, Camille; Birchall, Olga; Azocar, Sandra Rojas; Ediriweera, Angela; Caverlé, Marja W.; Schultz, Benjamin G.; Vogel, Adam P. title: Effects of face masks on acoustic analysis and speech perception: Implications for peri-pandemic protocols date: 2020-10-08 journal: bioRxiv DOI: 10.1101/2020.10.06.327452 sha: doc_id: 104138 cord_uid: qagyaegp Wearing face masks (alongside physical distancing) provides some protection against infection from COVID-19. Face masks can also change how we communicate and subsequently affect speech signal quality. Here we investigated how three face mask types (N95, surgical and cloth) affect acoustic analysis of speech and perceived intelligibility in healthy subjects. We compared speech produced with and without the different masks on acoustic measures of timing, frequency, perturbation and power spectral density. Speech clarity was also examined using a standardized intelligibility tool by blinded raters. Mask type impacted the power distribution in frequencies above 3kHz for both the N95 and surgical masks. Measures of timing and spectral tilt also differed across mask conditions. Cepstral and harmonics to noise ratios remained flat across mask type. No differences were observed across conditions for word or sentence intelligibility measures. Our data show that face masks change the speech signal, but some specific acoustic features remain largely unaffected (e.g., measures of voice quality) irrespective of mask type. Outcomes have bearing on how future speech studies are run when personal protective equipment is worn. Face masks (alongside physical distancing) provide some protection against infection from Coronavirus disease (Chu et al., 2020) . Their use in public spaces and healthcare settings is either recommended or mandatory in many jurisdictions internationally. In the United States, the Center for Disease Control (CDC, 2020) recommends mask use to minimize droplet dispersion and aerosolization of the virus (Bahl et al., 2020) . Clinical trials and healthcare settings continue to assess speech production, which generates respiratory droplets while unrestricted exposure increases the likelihood of disease contraction (Stadnytskyi et al., 2020) . Risk of transmission increases through behaviors common in many speech assessment tasks including continuous and loud speech (Asadi et al., 2019) . At the same time, acknowledgement of the necessity of personal protective equipment to minimize virus transmission has increased internationally (Asadi et al., 2019; Stadnytskyi et al., 2020; Zaga et al., 2020) . Masks, however, alter the speech signal with downstream effects on intelligibility of a speaker. The use of personal protective equipment poses some unique challenges for speech assessment. We evaluated the impact wearing a mask has on acoustic output and speech perception. We examined how different face mask types (surgical, cloth and N95), in combination with microphone location variations (headset vs. tabletop), affect speech recordings and intelligibility. Four subjects, aged 29.0 ± 5.8 years, range 23-38; 2 males: 2 females, were included in the study. All speakers were English speaking with no dysphonia, cognitive or neurological impairments. One male and female had English as their second language. The speech battery was elicited by trained staff and consisted of sustaining an open vowel /aː/ for approximately six seconds reproduced ten times and reading a phonetically balanced text, the Grandfather Passage (Van Riper, 1963) , reproduced five times. The speech battery was repeated under four conditions in a randomized order: 1) no mask; 2) standard surgical mask (regulated under 21 CFR 878.4040); 3) cloth mask (2-layered cotton); and 4) N95 mask (disposable mask made from electrostatic non-woven polypropylene fiber containing a filtration layer). Subjects were instructed to speak in a natural manner at a comfortable pitch and pace. Speech samples were recorded using two standardized methods: 1) Using a head-mounted cardioid condenser microphone (AKG520, Harman International, United States) positioned 2 inches from the corner of the subject's mouth (minimum sensitivity of -43dB, near flat frequency response) and coupled with a QUAD-CAPTURE USB 2.0 Audio Interface (Roland Corporation, Shizuoka, Japan) connected to a laptop computer; and 2) Using a Blue Yeti (Blue Microphones, United States) tabletop microphone (sensitivity 4.5mV/Pa) connected to a laptop computer. The microphone was positioned 5 ft. from the subject to simulate physical distancing measures. Standardization of the recording environment was achieved by recording in the absence of traffic, electrical, appliance, or other background noise. All recordings were sampled at 44.1 kHz with 32-bit quantization. Speech intelligibility was evaluated using the Assessment of Intelligibility of Dysarthria Speech (ASSIDS) (Yorkston and Beukelman, 1984) . For each condition subjects read aloud a randomized list of single words (one and two syllables in length) and sentences (5 to 28 syllables in length). Two blinded raters transcribed ASSIDS words and sentences, with the percentage of correct items calculated for each condition. Audio files were screened for deviations and synchronized between microphones to ensure uniformity of length. Acoustic analysis of sustained vowel and reading tasks were performed using Praat software (Boersma, 2002) . Two groups of speech features were analyzed, one to describe responsiveness to speech and silence, and another to determine agreement between measurements taken by different microphone conditions. The speech spectrum was used to describe the impact of mask type on the complex voice waveform. The interaction between intensity and frequency was characterized using the power spectral density (PSD, dB/kHz relative 2x10 -5 Pa) in the longterm average spectrum on the reading task. PSD provides information on how "each frequency" contributes to the total sound power. Frequency bands were fixed at 1kHz. PSD was averaged across subjects for each mask condition and compared between masks not subjects. Center-of-gravity (CoG, in Hz) was calculated from the power spectrum to inform frequency responsiveness of the conditions. CoG is the mean power-weighted frequency, i.e. the frequency that divides the power spectrum in equal halves above and below CoG. The intensity of background noise (floor) was determined as equal to the average intensity during the quietest three seconds of each files (i.e., in the absence of vocalization). Floor intensity was subtracted from the average intensity (during vocalization) for each task (vowel and reading) to determine the speech intensity prominence per mask condition. Features of interest included cepstral peak prominence smoothed (CPPS), harmonic-to-noise ratio (HNR), local jitter and shimmer for the sustained vowel, and average and standard deviation of pause length for the reading task. Fundamental frequency was calculated through autocorrelation within a restricted range (70Hz -250Hz for males, 100Hz -300Hz for females) (Vogel et al., 2009 ). The analysis window was 43ms and 30ms respectively, and window shift fixed at 10ms. The maximum number of formants was set at 5 with a maximum of 5500Hz for formant detection. All other parameters were maintained at default software settings. The detection of silence-speech and speech-silence transitions was done using an energy threshold on the time domain (Rosen et al., 2010; Vogel et al., 2017) . The threshold was set to 65% of the 95 th percentile, with minimum silence length set to 20ms and minimum speech length to 30ms. To examine differences of each acoustic parameter under each mask condition (no mask, surgical, N95, and cloth), a linear mixed-effects model analysis using restricted maximum likelihood estimation was applied. Mask type was modeled as a fixed factor, and subject and order of mask as a random factor. Bonferroni corrected post hoc pairwise comparisons were conducted to determine differences in mask type (surgical, N95, and cloth) compared to no mask. To investigate power spectral density, the interaction effect between mask and frequency band was investigated. Where the interaction was significant, planned comparisons were made for each 1Khz frequency band to determine differences between masks types compared to no mask. SPSS was used for all statistical analyses (IBM SPSS Version 26.0). Intelligibility varied between the speakers and across mask conditions. On average, intelligibility remained above 92% for all mask conditions, irrespective of single words (Figure 1a Frequency bands were collapsed into 1kHz slices to explore differences in PSD between mask type. There was a Mask × 1kHz frequency band interaction effect (F27,755=2.50, p=0.006). Post hoc comparisons showed power (dB/Hz 2 ) was significantly lower between 3-10 kHz for N95 mask and 5-10kHz for surgical and cloth masks when compared to no mask on recordings made using the head-mounted microphone (Figure 2a) . No significant differences were observed between mask conditions on recordings made using the tabletop microphone (F27,757=1.41, p=0.082; Figure 2b ). -Insert Figure 2 about here- showed that recordings produced with the N95 mask increased percentage of pauses (p=0.023) (Table 1) . Spectral tilt was lower in recordings produced with the surgical (p=0.016) and N95 masks (p=0.001). For recordings produced with the tabletop microphone, there was a significant effect of mask type for percentage of pauses (F3,7.87=8.17, p=0.008), and spectral tilt (F3,8.39=15.43, p=0.001) ( Table 1) . Post hoc comparisons revealed that the N95 and cloth masks yielded higher percentage of pauses (N95 p=0.022; Cloth p=0.029) no mask. As with the head-mounted microphone, recordings produced with the tabletop microphone yielded lower spectral tilt values with both the surgical (p=0.006) and N95 masks (p=0.002). No significant differences were observed in acoustic parameters extracted from the sustained vowel recorded using either the headmounted or tabletop microphone. -Insert Table 1 about here- The type of mask affected the speech signal. We observed significant differences in acoustic power distribution across relevant frequency bands for speech in all three mask conditions compared to no mask. The differences were not observed in frequencies below 3kHz. Differences in signal for higher frequencies led to altered acoustic outcomes including spectral tilt. The masks however did not significantly influence listener-perceived intelligibility or acoustic measures of perturbation (e.g., NHR, CPPS). Measures of speech rate were lower for N95 and surgical masks, possibly as speakers compensate when wearing masks to improve intelligibility. It is also possible that speech timing differences were related to how speech boundaries are identified in the analysis scripts (i.e., our timing analysis relied on identification of phoneme/word boundaries via intensity thresholds). Intelligibility scores varied between raters and between mask condition. Intelligibility remained above 92% for words and sentences. Anecdotally, it can be difficult to understand people when they wear a mask (Goldin et al., 2020) . Our small dataset suggests mask type does not systematically impact intelligibility in controlled environments. Our recordings were made with high-quality microphones in quiet environments. Raters listened to samples in ideal listening conditions away from distractions and background noise but without visual aid (lips and jaw movement) for all mask conditions. In loud environments, communication can be challenging with multiple distractors, background noise, and a lower signal-to-noise ratios (SNR). Noise in ecological situations may further decrease speech intelligibility, when complementary visual cues blocked by use of face masks play a role in communication. It is clear that face masks change the acoustic speech signal, but some specific perceptual features remain largely unaffected (e.g., acoustic measures of voice quality) irrespective of mask type. These results have implications for clinical assessments and speech research where PPE is required. It is easy to assume that subjects in a speech study will simply remove PPE during assessments; however, subjects and researchers may be reluctant to do so if it leads to potential exposure to airborne viruses. In longitudinal studies with data collection before, during, and after pandemics requiring PPE, researchers should consider how to mitigate against changes to protocols that affect speech (see Figure 3 ) (Redenlab, 2020) . Mean power spectra density displayed between 1-10kHz based on mask type. Shaded areas represent the standard error of mean. *p≤0.05 no mask vs mask type at each frequency bin. Red stars denote significant differences between no mask and N95, blue stars denote significant differences between no mask and surgical masks while orange stars denote significant differences between no mask and N95. *Disclaimer: Please be advised that nothing completely eliminates bacteria or viruses and the guidelines contained in this document are measures attempting to limit the spread of a virus. Further, these guidelines do not supersede medical practitioner recommendations or the COVID-19 safety policies implemented by your business or institution. It is your responsibility to follow the recommendations and safety policies applicable to your business or institution. To reduce risk, it is recommended assessors wear masks throughout assessments, the microphone's metal surfaces are sanitized between subjects, and all windscreens are washed at the end of each use. Aerosol emission and superemission during human speech increase with voice loudness Face coverings and mask to minimise droplet dispersion and aerosolisation: a video case study Use of Masks to Help Slow the Spread of COVID-19 Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: a systematic review and meta-analysis How do medical masks degrade speech reception? Guidance on minimizing risk to patients and staff during speech recordings Automatic method of pause measurement for normal and dysarthric speech The airborne lifetime of small speech droplets and their potential importance in SARS-CoV-2 transmission Standardization of pitch-range settings in voice acoustic analysis Motor speech signature of behavioral variant frontotemporal dementia: Refining the phenotype Assessment of Intelligibility of Dysarthric Speech Speech-Language Pathology Guidance for Tracheostomy During the COVID-19 Pandemic: An International Multidisciplinary Perspective