key: cord-0427193-pxr6fd32 authors: Haider, Chandra Leon; Suess, Nina; Hauswald, Anne; Park, Hyojin; Weisz, Nathan title: Face masks impair reconstruction of acoustic speech features and higher-level segmentational features in the presence of a distractor speaker date: 2021-09-30 journal: bioRxiv DOI: 10.1101/2021.09.28.461909 sha: fe236ef27628e12c578236b63fe841e5ce39bcc6 doc_id: 427193 cord_uid: pxr6fd32 Face masks have become a prevalent measure during the Covid-19 pandemic to counteract the transmission of SARS-CoV 2. An unintended “side-effect” of face masks is their adverse influence on speech perception especially in challenging listening situations. So far, behavioural studies have not pinpointed exactly which feature(s) of speech processing face masks affect in such listening situations. We conducted an audiovisual (AV) multi-speaker experiment using naturalistic speech (i.e. an audiobook). In half of the trials, the target speaker wore a (surgical) face mask, while we measured the brain activity of normal hearing participants via magnetoencephalography (MEG). A decoding model on the clear AV speech (i.e. no additional speaker and target speaker not wearing a face mask) was trained and used to reconstruct crucial speech features in each condition. We found significant main effects of face masks on the reconstruction of acoustic features, such as the speech envelope and spectral speech features (i.e. pitch and formant frequencies), while reconstruction of higher level features of speech segmentation (phoneme and word onsets) were especially impaired through masks in difficult listening situations, i.e. when a distracting speaker was also presented. Our findings demonstrate the detrimental impact face masks have on listening and speech perception, thus extending previous behavioural results. Supporting the idea of visual facilitation of speech is the fact that we used surgical face masks in our study, which only show mild effects on speech acoustics. This idea is in line with recent research, also by our group, showing that visual cortical regions track spectral modulations. Since hearing impairment usually affects higher frequencies, the detrimental effect of face masks might pose a particular challenge for individuals who likely need the visual information about higher frequencies (e.g. formants) to compensate. In the ongoing Covid-19 pandemic, face masks have become a prevalent and effective measure around the globe to counteract the transmission of SARS-CoV 2 (Rahne et al., 2021) . However, face masks pose issues in everyday listening situations. Face masks have been shown to impair acoustic details of the speech signal in several studies, with N95/FFP2 masks showing quite a strong attenuation in mid to high frequencies, while surgical face masks show only moderate influence on the acoustic speech signal (Caniato et al., 2021; Corey et al., 2020) . Confirming this impact behaviourally, Rahne et al. (2021) found impaired speech perception in noise when wearing a surgical face mask compared to no mask, while participants performed even worse when using an N95 mask. As well as reducing acoustic details, information transmitted via visual input is completely missing or severely reduced. It has already been established very early that visual cues originating from the mouth and lips facilitate speech intelligibility, especially in situations with a low speech-to-noise ratio (Sumby & Pollack, 1954) . Through analysing continuous speech using decoding models to reconstruct acoustic speech features from the neural signal obtained by electroencephalography (EEG) or magnetoencephalography (MEG), evidence has accumulated that the brain directly tracks crucial speech specific components, like the speech envelope (Brodbeck & Simon, 2020; Ding & Simon, 2014) and speech-critical resonant frequencies called formants (Peelle & Sommers, 2015) . Not only does the brain engage in direct tracking of the acoustic signal, but also seeing the talker's face substantially facilitates acoustic signal tracking (Crosse et al., 2015; Golumbic et al., 2013; Park et al., 2016) . On the one hand, this mentioned audiovisual facilitation might be explained by providing simple temporal cues (i.e. opening and closing of mouth) when having to attend to auditory stimuli. On the other hand, visual information might be preselecting certain possible stimuli (e.g. phonemes) and therefore enhancing subsequent auditory processing as a form of crossmodal integration. By using the additive model (i.e. comparing event-related potentials (ERP) to audio stimuli + ERPs visual stimuli (A+V) to ERPs of audiovisual stimuli (AV)), past studies indeed suggested that the brain integrates early (~ 120 ms after stimulus onset) information from the visible lip movements and the auditory input for efficient speech processing (Besle et al., 2004 (Besle et al., , 2009 ). In addition to these effects in auditory processing regions, we have provided evidence for a direct visuo-phonological transformation when individuals only process visual information (i.e. silent video recordings of speakers), by showing that the acoustic speech envelope is tracked in visual cortical regions when individuals observe lip movements (Hauswald et al., 2018; Suess et al., 2021) . Furthermore, the visual cortex also tracks spectral modulations in the range of the pitch, as well as in the second (F2) and third formant (F3) which reflect mainly sounds produced with the visible part of the mouth (Suess et al., 2021) , which aligns well with previous findings by Chandrasekaran et al. (2009) . Their findings indicate that the area of mouth opening correlates strongest with spectral components of speech in the range of 1kHz -3kHz, corresponding to the frequency range of F2 and F3. Together, these results reveal that visual lip movements are transformed in order to track acoustic speech features such as the speech envelope and formant frequencies. In the context of face masks, a large online study investigated their effects on audiovisual (AV) speech behaviourally (Brown et al., 2021) . They found no differences in sentence intelligibility between clear AV speech (i.e. no face mask) and face masks of several types (e.g. surgical face mask and N95 mask) in conditions with a quiet background, but differences became apparent in conditions with moderate and high background noise. These drawbacks of face masks might be even more pronounced in individuals with hearing impairments, as they profit more by observing a talking face congruent with the auditory input (Puschmann et al., 2019) . Despite these well-established effects, the behavioural studies have left open which (degraded) speech features are driving these findings. Decoding distinct speech features from the neural signal could be used for addressing this issue. Putting the aforementioned findings together, face masks might adversely impact the processing of diverse speech characteristics at different hierarchical levels, resulting in poor behavioural performance. With face masks still common in everyday life as a measure against Covid-19 and continuing to remain important in medical settings, understanding precisely which features of speech are less well tracked by the brain can help guide decisions on which face mask to use. These considerations are especially important when dealing with hearing-impaired individuals. In the current MEG study, we investigated how neural tracking of a variety of speech features (purely acoustic and lexical/phonetic boundaries) in an audio-visual naturalistic speech paradigm is impaired through (surgical) face masks. Special emphasis is placed on an interaction between face masks and difficult listening situations induced via an audio-only distractor speaker, as recent studies emphasised the visual benefit when acoustics are unclear (Brown et al., 2021; Mitchel & Weiss, 2014; Park et al., 2016) . We then trained a backward model on clear speech in order to reconstruct the speech characteristics from the participants' brain data for each condition. Additionally, we measured participants' comprehension performance and subjective difficulty ratings. We hypothesised particularly strong effects on speech features that have been shown to be detectable by visual input (i.e. lip movements), such as the speech envelope, pitch, the averaged F2 and F3 (F2/3) as well as segmentation (i.e. phoneme and word onsets). We found strong adverse general effects of the face mask on acoustic features (i.e. speech envelope and spectral features) reconstruction and difficulty ratings irrespective of a distractor speaker. Importantly, for features of segmentation (word and phoneme onsets) face masks revealed their adverse impact especially in difficult listening situations. Participants 29 German native speakers (12 female) aged between 22 and 41 years ( M = 26.79, SD = 4.86 ) took part in our study. All participants had self-reported normal hearing, verified by a standard clinical audiometry. Further exclusion criteria were non-removable magnetic objects, as well as a history of psychiatric or neurological conditions. Recruitment was done via social media and university lectures. One participant was excluded because signal source separation could not be applied to the MEG dataset. All participants signed an informed consent form and were compensated with €10 per hour or course credit. The experimental protocol was approved by the ethics committee of the University of Salzburg and was carried out in accordance with the Declaration of Helsinki. We used excerpts of four different stories for our recording read out in German. 'Die Schokoladenvilla -Zeit des Schicksals. Die Vorgeschichte zu Band 3' ("The Chocolate Mansion, The Legacy" -prequel of Volume 3") by Maria Nikolai and 'Die Federn des Windes' ("The feathers of the wind") by Manuel Timm were read out by a female speaker. 'Das Gestüt am See. Charlottes großer Traum' ("The stud farm by the lake. Charlotte's great dream") by Paula Mattis and 'Gegen den Willen der Väter' ("Against the will of their fathers") by Klaus Tiberius Schmidt were read out by a male speaker. Stimuli were recorded using a Sony FS100 camera with a sampling rate of 25 Hz and a Rode NTG 2 microphone with a sampling rate of 48 kHz. We aimed at a duration for each story of approximately ten minutes, which were cut into ten videos of around one minute each ( range: 56 s -76 s, M = 64 s, SD = 4.8 s) . All stories were recorded twice, once without the speaker wearing a surgical face mask and once with the speaker wearing a surgical face mask (Type IIR, three-layer single-use medical face mask, see Figure 1A ). After cutting the videos, we ended up with 80 videos of approximately one minute each. Forty of those were presented to each participant (20 with a female speaker, 20 with a male speaker) in order to rule out sex-specific effects. The audio track was extracted and stored separately. The audio files were then normalised using the Python function 'ffmpeg-normalise' with default options. Pre-recorded audiobooks read out by different speakers (one female, one male) were used for the distractor speaker and normalised using the same method. These audio files contained either a (different) single male or female speaker. The syllable rate was analysed using a Praat script (Boersma & Weenink, 2001; de Jong & Wempe, 2009 ). The target speakers' syllable rates varied between 3.7 Hz and 4.6 Hz ( M = 4.1 Hz ). Target and distractor stimuli were all played to the participant at the same volume, which was individually set to a comfortable level at the start of the experiment. Before the start of the experiment, we performed a standard clinical audiometry using a AS608 Basic (Interacoustics, Middelfart, Denmark) in order to assess participants' individual hearing ability. Afterwards, participants were prepared for MEG (see Data acquisition). We started the MEG measurement with five minutes of resting-state activity (not included in this manuscript). We then assessed the participants' individual hearing threshold in order to adjust our stimulation volume. If the participant stated afterwards that stimulation was not comfortable or not loud enough, we adjusted the volume again manually to the participant's requirement. Of the four stories, half were randomly chosen with the target speakers wearing face masks in the recording. In the remaining half, speakers did not wear a face mask. Each story presentation functioned as one stimulation block, resulting in four blocks overall. One block consisted of ten ~ 1 minute long trials. In three randomly selected trials per block-(i.e. 30% of trials), a same-sex audio-only distractor speaker was added at equal volume as the target speaker. We only added a distractor speaker in 30% of trials in order to retain enough data to train our backward model on clear speech (see stimulus reconstruction section). Distractor speaker presentation started five seconds after target speaker video and audio onset in order to give the participants time to pay attention to the target speaker. Within the blocks, the story presentation followed a consistent storyline across trials. After each trial, two unstandardised 'true or false' statements regarding semantic content were asked to assess comprehension performance and keep participants focused ( Figure 1A ). Additionally, participants rated subjective difficulty and motivation at four times per block on a five-point likert scale (not depicted in Figure 1A ). The participants' answers were given via button presses. In one half of the four blocks a female target speaker was presented, in the other half a male target speaker. Videos were back-projected on a translucent screen with a screen diagonal of 74 cm via a Propixx DLP projector (Vpixx technologies, Canada) ~ 110 cm in front of the participants. It was projected with a refresh rate of 120 Hz and a resolution of 1920 x 1080 pixels. Including preparation, the experiment took about 2 hours per participant. The experiment was coded and conducted with the Psychtoolbox-3 (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997) with an additional class-based library ('Objective Psychophysics Toolbox', o_ptb) on top of it (Hartmann & Weisz, 2020) . We recorded brain data with a sampling rate of 1 kHz at 306-channels (204 first-order planar gradiometers and 102 magnetometers) with a Triux MEG system (MEGIN, Helsinki, Finnland). The acquisition was performed in a magnetically shielded room (AK3B, Vacuumschmelze). Online bandpass filtering was performed from 0.1 Hz to 330 Hz. Prior to the acquisition, cardinal head points (nasion and pre-auricular points) were digitised with a Polhemus Fastrak Digitizer (Polhemus) along with around 300 points on the scalp in order to assess individual head shapes. Using a signal space separation algorithm provided by the MEG manufacturer (Maxfilter, version 2.2.15), we filtered noise resulting from sources outside the head and realigned the data to a standard head position, which was measured at the beginning of each block. (left or right button). On the right, a block is depicted with the male speaker wearing a face mask across the ten trials of the block. Otherwise the procedure is the same as the block without a face mask. Clear speech is defined as the condition without a mask and without a distractor speaker. The two depicted blocks were repeated with a female speaker, resulting in a total of four blocks. Figure * images have been removed/obscured due to a bioRxiv policy on the inclusion of faces All the speech features investigated are depicted in Figure 1B . The speech envelope was extracted using the Chimera toolbox. For this purpose, the speech signal was filtered forward and in reverse with a 4th order Butterworth bandpass filter at nine different frequency bands equidistantly spaced between 100 and 10000 Hz corresponding to the cochlear map (Smith et al., 2002) . Then, a Hilbert transformation was performed to extract the envelopes from the resulting signals. These nine envelopes were then summed up to one general speech envelope and normalised. The pitch (fundamental frequency, F0) was extracted using the built-in Matlab Audio Toolbox function pitch.m and downsampled to 50 Hz. The speech formants (first, second, third and the averaged second and third formant) were extracted using FormantPro (Xu & Gao, 2018) , a tool for automatic formant detection via Praat (Boersma & Weenink, 2001) at 50 Hz with an integration window length of 20 ms, and a smoothing window of 10 ms length. Phoneme and word onset values were generated using forced alignment with MAUS web services (Kisler et al., 2017; Schiel, 1999) in order to obtain a measure for speech segmentation. We generated two time-series with binary values indicating an onset of phoneme or word, respectively. Then, we smoothed the time-series of binary values using a gaussian window with a width of 10 ms. In the end, all features were downsampled to 50 Hz to match the sampling rate of the corresponding brain signal, as most speech relevant signals present themselves below 25 Hz (Crosse et al., 2021) . The raw data was analysed using Matlab R2020b (The MathWorks, Natick, Massachusetts, USA) and the FieldTrip toolbox (Oostenveld et al., 2011) . First, we computed 50 independent components to remove eye and heart artifacts. We removed on average 2.38 components per participant ( SD = .68 ). We further filtered the data using a sixth-order zero-phase Butterworth bandpass filter between 0.1 and 25 Hz. Afterwards, we epoched the data into 2.5 s segments. Finally, we downsampled our data to 50 Hz. To reconstruct the different speech characteristics (speech envelope, pitch, resonant frequencies as well as word and phoneme onsets) from the brain data, we utilised the mTRF Toolbox (Crosse, Di Liberto, Bednar, et al., 2016) . The goal of this approach is to map brain responses (i.e. all 306 MEG channels) back to the stimulus(-feature) (e.g. speech envelope) using linear models in order to obtain a measure of how well a certain characteristic is encoded in the brain. According to our 2x2 experimental design, the stimulus features were reconstructed for each condition. As the distractor speaker starts after five seconds of the trial start, these five seconds were not assigned to the Distractor condition, but rather reassigned to their respective condition with only a single speaker. The stimulus features and the brain data at all 306 MEG channels were z-scored and the epochs were shuffled. We then used the clear speech condition (with no masks and no distractor speaker presented) to train the backward model with ridge regression. In order to test the model on a clear audio data set as well, we split it into seven parts and trained our model on six parts, while using the remaining part to test it. This results in approximately twelve minutes of data for training the model. We defined our time lags to train our model from -150 ms to 450 ms. Then, we performed seven-fold leave-one-out cross-validation on our training dataset to find the optimal regularisation parameter (Willmore & Smyth, 2003) . We used the same data with the obtained regularisation parameter to train our backward model. For each condition, we used the same backward model trained on clear speech to reconstruct the speech characteristics of interest, namely the speech envelope, pitch, resonant frequencies (F1-3 and F2/3) and segmentational features (phoneme and word onsets). As we used clear audio trials for training the decoding model and added a distractor speaker only in 30 % percent of trials (see experimental procedure , Figure 1A) , this resulted in a variable length of test data sets. In the 'no mask/no distractor' condition it was ~ 2 minutes, in the 'mask/no distractor' condition it was ~ 14 minutes and for 'no mask/distractor' as well as 'mask/distractor' condition it was ~ 6 minutes each. The process was repeated six times, so that each subset of the clear speech condition had the opportunity to be used as a test set while all other subsets were used for training. For each participant, each speech feature and each of the four conditions we computed the correlation coefficient (Pearson's r ) of the reconstructed feature and the original feature as a measure of reconstruction accuracy. This was done by Fisher-Z transformation and averaging all respective correlation coefficients for each test set and each of the seven repetitions obtained through the aforementioned procedure. We performed a repeated measures ANOVA with the within-factors Mask (no face mask vs. face mask) and Distractor (no distractor speaker vs. distractor speaker) and the obtained Fisher z-transformed correlation coefficients (i.e. reconstruction accuracy) as dependent variables. For the behavioural results (comprehension performance and subjective difficulty), we also used a repeated measures ANOVA with the same factors Mask and Distractor . We used comprehension performance scores (i.e. the percentage of correct answers) and averaged subjective difficulty ratings respectively as dependent variables. The statistical analyses for reconstruction accuracies and behavioural data were performed using pingouin , a statistics package for Python 3 (Vallat, 2018) . In case of a significant interaction or a trend, a simple effect test was performed via the Matlab's Statistics and Machine Learning Toolbox in order to pinpoint the nature of the interaction. Furthermore, comparisons of spectral fine details between face masks and no masks, were computed in Matlab with the Measures of Effect Size toolbox (Hentschke & Stüttgen, 2011) . Comprehension performance scores were generated using two 'true or false' comprehension questions at the end of each of the 40 trials. We used a two-way repeated measures ANOVA to investigate the influence of the factors Mask and Distractor on the comprehension performance. Apart from the effect for the distractor speaker ( F(1,28) Using a stimulus reconstruction approach based on the recorded MEG data, we studied which speech-related features are impaired through face masks, with a special focus on difficult listening situations. We therefore analysed the correlation coefficients (Pearson's r) obtained using a backward model (Crosse, Di Liberto, Bednar, et al., 2016) . The correlation coefficient represents how well the specific stimulus characteristic was reconstructed from brain data and serves as a proxy for how well these features are represented in the neural signal. With this approach, we generated one correlation coefficient for each condition per participant. This process was repeated for each speech feature of interest. To analyse the effect of the face mask and the distractor speaker, we performed a two-way repeated measures ANOVA, with the Fisher z-transformed correlation coefficients as dependent variables. Detailed results and statistical values are found in the supplementary material (see Table S1 ). As expected, results show a strong effect (all p < .001 ) of the distractor speaker on the stimulus reconstruction across all stimulus characteristics of interest. Figure 2A shows example reconstructions for the speech envelope and the averaged second and third formant (Formant 2/3 or F2/3) as well as mean reconstruction accuracies for clear audiovisual speech (i.e. stimulation material with no mask and no distractor) in Figure 2B . Bars denote 95% confidence interval. We investigated how the stimulus reconstruction of the speech envelope is impaired through face masks, with a particular focus on difficult listening situations induced by a distractor speaker. Apart from the negative impact of the distractor speaker ( F(1,28) = 161.09, p < .001, η p ² = .85 ), we observed a strong negative effect of face masks on reconstruction accuracies of the speech envelope (F(1,28) = 24.42, p < .001, η p ² = .47, Figure 3A ). We found no significant interaction between the factors Mask and Distractor ( F(1,28) = .25, p = .619, η p ² = .01 , Figure 3B and Figure 3C ). As the speech envelope conveys crucial information about the syntactic structure of speech (Giraud & Poeppel, 2012; Poeppel & Assaneo, 2020) , reduced reconstruction accuracy points to difficulties in deriving this information for further higher-level speech processing. Moreover, we wanted to investigate the influence of face masks on spectral fine details of speech . In this study, we specifically analysed pitch (or fundamental frequency, F0), the first formant (F1), the second formant (F2) and the third formant (F3). When facing concurrent speakers, a listener must segregate the speech signal into different speech streams. It is suggested that pitch serves a fundamental role in this process (Bregman, 1990) . Formants are especially interesting, as they are vital for identifying vowels. Additionally, we investigated the averaged F2 and F3 (F2/3), as these two formants generated in the front cavity converge into 'focal points' after specific vowel-consonant combinations (Badin et al., 1990 ) and its frequency range has been shown to correlate strongly with lip movements (Chandrasekaran et al., 2009) . Furthermore , t his characteristic has been shown to be tracked by visual-only speech and is therefore prone to be affected through face masks (Suess et al., 2021) . Detailed results for F1, F2, F3 are depicted in Table S1 (see Supplementary Material). Effect sizes of the main effect are presented graphically in Figure 3A and for the interactions in Figure 3B . With a distractor speaker, reconstruction of pitch Detecting lexical boundaries is important for chunking the continuous speech stream into meaningful interpretable units. As a last step, we therefore investigated how face masks impair the reconstruction of phoneme and word onsets. For phoneme onsets, we found significant main effects of reconstruction accuracies for the factor Distractor (F(1, 28) = 187.81 , p < .001, η p ² = .87 ) and Mask (F(1, 28) = 16.63 , p < .001, η p ² = .37 ), as well as a strong significant interaction of Mask and Distractor ( F(1,28) = 10.75, p = .003, η p ² = .28 ). Similar results can be shown for word onset reconstruction accuracies with significant main effects of the Distractor ( F(1,28) = 278.19, p < .001, η p ² = .91 ), Mask ( F(1,28) = 19.95 , p < .001, η p ² = .42 ) and the interaction ( F(1,28) In order to rule out that our findings were influenced by major acoustic differences between mask wearing speakers and speakers without masks, we computed paired sample t-tests for pitch and formants. The features (pitch and formants) extracted out of the 40 stimuli with no mask were therefore compared to their counterpart with speakers wearing a mask. Mean frequencies split up by sex of speaker are depicted in Table S2 (see Supplementary Material ). Results showed no significant mean difference for pitch ( t(39) = .79, p = .43, g = .02 ) and the first formant (F1) ( t(39) = 1.90, p = .06, g = .14 ). We noted strong effects for a reduction of mean frequency for the second formant (F2) ( t(39) = 11. 89, p < .001, g = .26 ), the third formant (F3) ( t(39) = 14.01, p < .001, g = .26 ) and the averaged second and third formant (Formant 2/3 or F2/3) ( t(39) = 14.98, p < .001, g = .26 ) in stimuli with face masks compared to no face masks. However, in absolute terms differences are relatively small ( Hedges g of around .2). The adverse effect of face masks on speech comprehension has been investigated in various studies on a behavioural level (Brown et al., 2021; Giovanelli et al., 2021; Rahne et al., 2021; Toscano & Toscano, 2021; Yi et al., 2021) . Despite the overall agreement of the adverse effects of face masks on speech comprehension, it has been unclear which features of speech processing are specifically affected. Our results show that tracking of features responsible for successful processing of naturalistic speech are impaired through (surgical) face masks. From general temporal modulations of the speech envelope to modulations of spectral fine details (pitch and formants) and segmentation of speech (phoneme and word onsets), a face mask significantly reduces the decodability of these features from brain data. However, not all these speech features are affected by the face mask the same way. While the brain's tracking of low-level acoustic features (i.e. the speech envelope and spectral fine details) are affected generally, the higher-level segmentational features phoneme onset and word onset show particularly strong reduction of reconstruction accuracy through face masks when facing a challenging listening situation (i.e. using a distractor speaker). We used surgical face masks in our study, which have a small influence on the speech acoustics and attenuate only higher frequencies above 3 kHz (Corey et al., 2020; Toscano & Toscano, 2021) . With this, we attribute these interactions mostly to the missing visual input. Moreover, investigated spectral fine details (namely pitch and formants) present themselves in frequencies below 3 kHz (Peterson & Barney, 1952) . Furthermore, we found only small differences between stimuli with face masks and without a face mask. The speech envelope, mostly associated with conducting syntactic and phonetic information (Giraud & Poeppel, 2012; Poeppel & Assaneo, 2020) , has been deemed a core characteristic regarding speech tracking (Brodbeck & Simon, 2020) . In multi-speaker listening situations, attending to the target speaker is related to enhanced tracking of the envelope of the attended speech compared to the unattended speech Park et al., 2016) . A reduced tracking of this characteristic might represent a difficulty in following and segmenting the target speech stream when confronted with face masks. Nonetheless it is important as the speech envelope does not convey specific information regarding certain phonetic objects, like vowels. Formants on the other side define vowels directly (Peterson & Barney, 1952) . While the first (F1) and (F2) are generally considered core formants in speech (Peterson & Barney, 1952) , using an averaged F2 and F3 (F2/3) instead of F2 has proven to be beneficial as it smooths transitions from one vowel to the other (Stevens, 2000) and due to their convergence in the front cavity (Badin et al., 1990) . With regards to visual speech tracking, the encompassed frequencies of F2 and F3 correlate strongly with lip movements (Chandrasekaran et al., 2009 ) so that these frequencies likely contribute to a visual-phonological transformation (cf. Hauswald et al. (2018) ). While Hauswald et al. (2018) proposed a role of the visually conveyed envelope information for a visuo-phonological transformation, a study by our group further suggests that also the visually transported formant information contributes to such a transformation (Suess et al., 2021) . Further highlighting their importance, a recent study (Plass et al., 2020) demonstrated even stronger crossmodal AV enhancement for formant frequencies than for the speech envelope. Finally, the reconstruction of voice pitch or fundamental frequency, used to segregate concurrent speech streams (Bregman, 1990) , is also reduced through face masks, which might lead to difficulties disentangling the target speech stream and the distractor speech stream. Taking the effects face masks have on the envelope, pitch and formants together, face coverings might lead to subsequent difficulties in identifying phonemes and as a consequence also words. Tracking of phoneme and word onsets is affected such that face masks impair chunking in challenging listening situations especially strong. Studies investigating simple event-related potentials (ERPs) when listening to continuous speech found enhanced responses to word onsets (Sanders et al., 2002; Sanders & Neville, 2003) , pointing to an internal chunking mechanism of the brain for optimal speech processing. On a lower level, brain responses induced by phoneme onsets are reliably predicted by encoding models (Brodbeck et al., 2018; Daube et al., 2019; Di Liberto et al., 2015) , implying chunking already on this level. When deprived of visual cues (through face masks) and in noisy acoustic environments, our findings suggest that individuals face problems with segmenting the continuous speech stream into meaningful units (i.e. words and phonemes). As the speech envelope is associated with conveying syntactic information (Giraud & Poeppel, 2012) , reduced detail of this feature can cause difficulties in identifying semantically meaningful units. Furthermore, formant frequencies might also play an important role in detecting syllables and more importantly phonemes (and their boundaries) (Plass et al., 2020) . For compensating this degradation in challenging listening situations, watching the speaker's face provides important information (Mitchel & Weiss, 2014) for word segmentation. Highlighting this even further, visual cues from the mouth area have been found to be enhancing phonetic discrimination, by providing visemic information (Fisher, 1968) . Taken together, depriving listeners of these visual cues through covering the mouth affects an important step of unit identification (words and phonemes), which helps chunking the stream for further processing. With this study, we put past findings about AV speech processing and tracking into the context of face masks. Expectations about the influence of face masks on speech characteristics were confirmed in the way that compensating visual information does partly make up for the degraded speech acoustics in difficult listening situations. This effect can be shown in higher-level features of speech segmentation (i.e. phoneme onsets and word onsets) in the form of an interaction between the face mask and the distractor speaker, while reconstruction of acoustic information is generally impaired. One possible mechanism compensating in noisy environments is a visuo-phonological transformation from visual input to a phonetic representation in the range of F2 and F3, which is however not possible when speakers wear a face mask. Regarding our behavioural results, we observed significantly decreased performance through a distractor speaker, but not through the face mask. This is in line with previous findings on audio-only speech (Toscano & Toscano, 2021) which found no significant effect of surgical face masks on word recognition in easy and challenging listening situations. However, another study with audiovisual speech found significant effects of a surgical face mask in conditions of low (-3 SNR) and high (-9 SNR) background pink noise on sentence intelligibility (Brown et al., 2021) . As our study used longer duration audiobooks, our behavioural measurements might have not been precise enough (i.e only two binary unstandardised 'true or false' statements at the end of each trial regarding semantic comprehension) to detect this influence. We also found that subjective ratings of listening difficulty were significantly larger when speakers wore a face mask independent of a distractor speaker. An explanation for this is that removing congruent visual cues leads to an increase of linguistic ambiguity resulting in more effortful mental correction by the listener. This increased effort however might be at the same time negating the negative influence on the aforementioned comprehension performance (Winn & Teece, 2021) . Despite a comparable performance in speech comprehension between with-and without masks, increased listening effort has been found to be associated with social withdrawal in the hearing impaired population (Hughes et al., 2018) and should not be disregarded. Still, our behavioural results contradict previous findings, which only showed an effect of face masks on listening effort when combined with background noise (Brown et al., 2021) . Again, differences in study design (one minute audiobooks vs. single sentence) may account for this difference. Furthermore, only participants without hearing impairment participated in our study. It has been shown that individuals with hearing loss profit significantly more from visual input (Puschmann et al., 2019) when it comes to behavioural performance as well as neural tracking of speech features. For future studies, we expect even stronger effects through face masks for individuals with hearing impairments compared to normal hearing persons, already occurring in situations with only a single speaker. Following our findings, the use of transparent face masks, especially in critical settings like medical consultations, where poor communication can lead to poorer medical outcomes (Chodosh et al., 2020) , is principally favourable. However, some of the current transparent models come with significantly reduced transmission of acoustic detail (Corey et al., 2020) resulting in reduced intelligibility and increased difficulty ratings, when presented in noisy environments (Brown et al., 2021) . It is important to consider that this study investigated normal hearing subjects, and results for individuals with hearing loss might be different. In line with this notion, data collected before the Covid-19 pandemic suggests strong benefits of transparent face masks for listeners with hearing loss (Atcherson et al., 2017) . A recent study confirms this by comparing the impact of surgical face masks to (transparent) face shields (Homans & Vroegop, 2021) . Despite the face shield's larger impact on acoustics compared to the surgical face masks, individuals with hearing loss showed no significant decrease in speech intelligibility when confronted with a face shield compared to no facial covering, while scores were significantly worse when a surgical face mask was worn. Unfortunately, face shields are not a good option in regards to virus transmission as they show basically no blocking of cough aerosol (Lindsley et al., 2021) . Therefore, a transparent face mask which ensures the acoustic details are not lost may be the best solution for reducing mask-related speech processing costs. With this study, we showed that face masks impair the tracking of several important speech features in normal hearing individuals. Face masks show a more general impairment on tracking of spectral features such as pitch and the F2/3 and speech envelope modulations, while higher-level segmentational features are especially impaired through face masks in difficult listening situations. We attribute these findings to missing visual information, which could facilitate the speech tracking via a visuo-phonological transformation, partly compensating for the noisy acoustic input. Our behavioural results point towards a generally increased listening effort through face masks, which can be leading to social withdrawal particularly for hearing-impaired individuals. Following our findings, the use of acoustics-conserving transparent face masks can severely reduce listening related difficulties and should be considered when dealing with the hearing impaired. However, as we only tested normal hearing individuals, future research should focus on hearing-impaired individuals. Table S1 . Results of main effects and interaction for reconstruction accuracy. Note. Bold letters indicate significant effects. Table S2 . Mean frequencies of spectral fine details. Note. Numbers in parentheses denote the standard deviation. Bold letters indicate significant differences. For descriptive purposes, the mean frequencies are split up for male and female speakers. The statistical analysis was performed using paired sample t-test comparing features (pitch and formants) extracted out of the 40 stimuli with no mask to their counterpart with speakers wearing a mask. The Effect of Conventional and Transparent Surgical Masks on Speech Understanding in Individuals with and without Hearing Loss Vocalic nomograms: Acoustic and articulatory considerations upon formant convergences Electrophysiological (EEG, sEEG, MEG) evidence for multiple audiovisual interactions in the human auditory cortex Bimodal speech: Early suppressive visual effects in human auditory cortex PRAAT, a system for doing phonetics by computer The Psychophysics Toolbox Auditory Scene Analysis: The Perceptual Organization of Sound Rapid Transformation from Auditory to Linguistic Representations of Continuous Speech Continuous speech processing Face mask type affects audiovisual speech intelligibility and subjective listening effort in young and older adults How much COVID-19 face protections influence speech intelligibility in classrooms? Applied Acoustics The Natural Statistics of Audiovisual Speech Face masks can be devastating for people with hearing loss Acoustic effects of medical, cloth, and transparent face masks on speech signals Congruent Visual Speech Enhances Cortical Entrainment to Continuous Auditory Speech in Noise-Free Conditions The Multivariate Temporal Response Function (mTRF) Toolbox: A MATLAB Toolbox for Relating Neural Signals to Continuous Stimuli Eye Can Hear Clearly Now: Inverse Effectiveness in Natural Audiovisual Speech Processing Relies on Long-Term Crossmodal Temporal Integration Linear Modeling of Neurophysiological Responses to Naturalistic Stimuli: Methodological Considerations for Applied Research Simple Acoustic Features Can Explain Phoneme-Based Predictions of Cortical Responses to Speech Praat script to detect syllable nuclei and measure speech rate automatically Low-Frequency Cortical Entrainment to Speech Reflects Phoneme-Level Processing Cortical entrainment to continuous speech: Functional roles and interpretations Confusions Among Visually Perceived Consonants Unmasking the Difficulty of Listening to Talkers With Masks: Lessons from the COVID-19 pandemic Cortical oscillations and speech processing: Emerging computational principles and operations Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a "Cocktail Party An Introduction to the Objective Psychophysics Toolbox A Visual Cortical Network for Deriving Phonological Information from Intelligible Lip Movements Computation of measures of effect size for neuroscience data sets The impact of face masks on the communication of adults with hearing loss during COVID-19 in a clinical setting Social Connectedness and Perceived Listening Effort in Adult Cochlear Implant Users: A Grounded Theory to Establish Content Validity for a New Patient-Reported Outcome Measure Multilingual processing of speech via web services What's new in psychtoolbox-3 Efficacy of face masks, neck gaiters and face shields for reducing the expulsion of simulated cough-generated aerosols Visual speech segmentation: Using facial cues to locate word boundaries in continuous speech FieldTrip: Open Source Software for Advanced Analysis of MEG, EEG, and Invasive Electrophysiological Data. Computational Intelligence and Neuroscience Attentional Selection in a Cocktail Party Environment Can Be Decoded from Single-Trial EEG Lip movements entrain the observers' low-frequency brain oscillations to facilitate speech intelligibility. ELife , 5 , e14521 Prediction and constraint in audiovisual speech perception The VideoToolbox software for visual psychophysics: Transforming numbers into movies Control Methods Used in a Study of the Vowels Vision perceptually restores auditory spectral dynamics in speech Speech rhythms and their neural foundations Hearing-impaired listeners show increased audiovisual benefit when listening to speech in noise Influence of face surgical and N95 face masks on speech perception and listening effort in noise An ERP study of continuous speech processing: I Segmentation, semantics, and syntax in native speakers Segmenting nonsense: An event-related potential index of perceived onsets in continuous speech Automatic Phonetic Transcription of Non-Prompted Speech Chimaeric sounds reveal dichotomies in auditory perception Acoustic Phonetics Age-related decreases of cortical visuo-phonological transformation of unheard spectral fine-details Visual Contribution to Speech Intelligibility in Noise Effects of face masks on speech recognition in multi-talker babble noise Pingouin: Statistics in Python Methods for first-order kernel estimation: Simple-cell receptive fields from responses to natural scenes Listening Effort Is Not the Same as Speech Intelligibility Score FormantPro as a Tool for Speech Analysis and Segmentation / FormantPro como uma ferramenta para a análise e segmentação da fala The adverse effect of wearing a face mask during the COVID-19 pandemic and benefits of wearing transparent face masks and using clear speech on speech intelligibility This work is supported by the Austrian Science Fund, P34237 ("Impact of face masks on speech comprehension"). Sound icon made by Smashicon from www.flaticon.com.Thanks to the whole research team. Special thanks to Fabian Schmidt for providing support for the early days of PhD life and for graphical design. The authors have declared no competing interest.