key: cord-0426376-dzxvjkeh
authors: Polonenko, Melissa J; Maddox, Ross K
title: Exposing distinct subcortical components of the auditory brainstem response evoked by continuous naturalistic speech
date: 2020-08-20
journal: bioRxiv
DOI: 10.1101/2020.08.20.258301
sha: 7fec5e6c5d95551e3adcde15445515c420356010
doc_id: 426376
cord_uid: dzxvjkeh

The auditory brainstem is important for processing speech, yet we have much to learn regarding the contributions of different subcortical structures. These deep neural generators respond quickly, making them difficult to study during dynamic, ongoing speech. Recently developed techniques have paved the way to use natural speech stimuli, but the auditory brainstem responses (ABRs) they provide are temporally broad and thus have ambiguous neural sources. Here we describe a new method that uses re-synthesized “peaky” speech stimuli and deconvolution analysis of EEG data to measure canonical ABRs to continuous naturalistic speech of male and female narrators. We show that in adults with normal hearing, peaky speech quickly yields robust ABRs that can be used to investigate speech processing at distinct subcortical structures from auditory nerve to rostral brainstem. We further demonstrate the versatility of peaky speech by simultaneously measuring bilateral and ear-specific responses across different frequency bands. Thus, the peaky speech method holds promise as a powerful tool for investigating speech processing and for clinical applications.

Understanding speech is an important, complex process that spans the auditory system from cochlea to 37 cortex. A temporally precise network transforms the strikingly dynamic fluctuations in amplitude and 38 spectral content of natural, ongoing speech into meaningful information, and modifies that information 39 based on attention or other priors (Mesgarani et al., 2009) . Subcortical structures play a critical role in this 40 processthey do not merely relay information from the periphery to the cortex, but also perform important 41 functions for speech understanding, such as localizing sound (e.g., Grothe and Pecka, 2014 ) and encoding 42 vowels across different levels and in background noise (e.g., Carney et al., 2015) . Furthermore, subcortical 43 structures receive descending information from the cortex through corticofugal pathways (Bajo et al., 2010; 44 Bajo and King, 2012; Winer, 2005) , suggesting they may also play an important role in modulating speech 45 and auditory streaming. Given the complexity of speech processing, it is important to parse and understand 46

contributions from different neural generators. However, these subcortical structures are deep and respond 47

to stimuli with very short latencies, making them difficult to study during ecologically-salient stimuli such as 48 continuous and naturalistic speech. We created a novel paradigm aimed at elucidating the contributions 49 from distinct subcortical structures to ongoing, naturalistic speech. 50 Activity in deep brainstem structures can be "imaged" by the latency of waves in a surface electrical 51 potential (electroencephalography, EEG) called the auditory brainstem response (ABR). The ABR's 52 component waves have been attributed to activity in different subcortical structures with characteristic 53 latencies: the auditory nerve contributes to waves I and II (~1.5-3 ms), the cochlear nucleus to wave III (~4 54 ms), the superior olivary complex and lateral lemniscus to wave IV (~ 5 ms), and the lateral lemniscus and 55

inferior colliculus to wave V (~6 ms) (Møller and Jannetta, 1983 ; review by Moore, 1987; Starr and 56 Hamilton, 1976). Waves I, III, and V are most often easily distinguished in the human response. Subcortical 57 structures may also contribute to the earlier P0 (12-14 ms) and Na (15-25 ms) waves (Hashimoto, 1982 ; 58 Kileny et al., 1987; Picton et al., 1974) of the middle latency response (MLR), which are then followed by 59 thalamo-cortically generated waves Pa, Nb, and Pb/P1 (Geisler et al., 1958; Goldstein and Rodman, 1967) . 60 ABR and MLR waves have a low signal-to-noise ratio (SNR) and require numerous stimulus repetitions to 61 record a good response. Furthermore, they are quick and often occur before the stimulus has ended. 62

Therefore, out of necessity, most human brainstem studies have focused on brief stimuli such as clicks, 63 tone pips, or speech syllables, rather than more natural speech. 64

Recent analytical techniques have overcome limitations on stimuli, allowing continuous naturally uttered 65 speech to be used. One such technique extracts the fundamental waveform from the speech stimulus and 66 finds the envelope of the cross-correlation between that waveform and the recorded EEG data (Forte et al., 67 2017) . The response has an average peak time of about 9 ms, with contributions primarily from the inferior 68 colliculus (Saiz-Alia and Reichenbach, 2020). A second technique considers the rectified broadband 69 speech waveform as the input to a linear system and the EEG data as the output, and uses deconvolution 70

to compute the ABR waveform as the impulse response of the system (Maddox and Lee, 2018) . The 71

speech-derived ABR shows a wave V peak whose latency is highly correlated with the click response wave 72 V across subjects, demonstrating that the component is generated in the rostral brainstem. A third 73

technique averages responses to each chirp (click-like transients that quickly increase in frequency) in re-74 synthesized "cheech" stimuli (CHirp spEECH; Miller et al., 2017 ) that interleaves alternating octave 75 frequency bands of speech and of chirps aligned with some glottal pulses (Backer et al., 2019) . Brainstem 76 responses to these stimuli also show a wave V, but do not show earlier waves unless presented 77 monaurally over headphones (Backer et al., 2019; Miller et al., 2017) . While these methods reflect 78 subcortical activity, the first two provide temporally broad responses with a lack of specificity regarding 79 underlying neural sources. None of the three methods shows the earlier canonical components such as 80

waves I and III that would allow rostral brainstem activity to be distinguished from, for example, the auditory 81 nerve. Such activity is important to assess, especially given the current interest in the potential 82

contributions of auditory nerve loss in disordered processing of speech in noise (Bramhall et We asked if we could further assess underlying speech processing in multiple distinct early stages of the 85 auditory system by 1) evoking additional waves than wave V of the canonical ABR, and 2) measuring 86 responses to different frequency ranges of speech (corresponding to different places of origin on the 87 cochlea). The ABR is strongest to very short stimuli such as clicks, so we created "peaky" speech. The 88 design goal of peaky speech is to re-synthesize natural speech so that its defining spectrotemporal content 89 is unaltered but its pressure waveform consists of maximally sharp peaks so that it drives the ABR as 90 effectively as possible. The results show that peaky speech evokes canonical brainstem responses and 91 frequency-specific responses, paving the way for novel studies of subcortical contributions to speech 92 processing. 93

RESULTS 95

Broadband peaky speech yields more robust responses than unaltered speech 96

Broadband peaky speech elicits canonical brainstem responses 97

In previous work, brainstem responses to natural, on-going speech exhibited a temporally broad wave V 98 but no earlier waves (Maddox and Lee, 2018) . We re-synthesized speech to be "peaky" with the primary 99 aim to evoke additional, earlier waves of the ABR that identify different neural generators. Indeed, Figure 1  100 shows that waves I, III, and V of the canonical ABR are clearly visible in the group average and in the 101 individual responses to broadband peaky speech. This means that broadband peaky speech, unlike the 102 unaltered speech, can be used to assess naturalistic speech processing at discrete parts of the subcortical 103 auditory system, from the auditory nerve to rostral brainstem. These responses represent weighted 104 averaged data from ~43 minutes of continuous speech (40 epochs of 64 s each), and were filtered at a 105 typical high-pass cutoff of 150 Hz to highlight the earlier ABR waves. 106

Morphology of the broadband peaky speech ABR was inspected and waves marked by a trained 107 audiologist (MJP) on 2 occasions that were 3 months apart. The intraclass correlation coefficients for 108 absolute agreement (ICC3) were ≥ 0.91 (lowest ICC3 95% confidence interval was 0.78-0.96 for wave I, p 109 < 0.01), indicating excellent reliability for chosen peak latencies. Waves I and V were identifiable in 110 responses from all subjects (N = 22), and wave III was clearly identifiable in 16 of the 22 subjects. These 111 waves are marked on the individual responses in Figure 1 . Mean ± SEM peak latencies for ABR waves I, 112

III, and V were 2.95 ± 0.10 ms, 5.11 ± 0.09 ms, and 6.96 ± 0.07 ms respectively. These mean peak 113 latencies are shown superimposed on the group average response in Figure 1 (bottom right). Inter-wave 114 latencies were 2.13 ± 0.05 ms (N = 16) for I-III, 1.78 ± 0.06 ms (N = 16) for III-V, and 4.01 ± 0.07 (N = 22) 115

for I-V. These peak inter-wave latencies fall within a range expected for brainstem responses but the 116 absolute peak latencies were later than those reported for a click ABR at a level of 60 dB sensation level 117

(SL) and rate between 50 to 100 Hz (Burkard and Hecox, 1983; Chiappa et al., 1979; Don et al., 1977 components, a high-pass filter with a 30 Hz cutoff was used on the responses to ~43 minutes of each type 124 of speech. Figure 2A shows that overall there were morphological similarities between responses to both 125 types of speech; however, there were more early and late component waves to broadband peaky speech. 126

More specifically, whereas both types of speech evoked waves V, Na and Pa, broadband peaky speech  127  also evoked waves I, often III (14-16 of 22 subjects depending if a 30 or 150 Hz high-pass filter cutoff was  128 used), and P0. With a lower cutoff for the high-pass filter, wave III rode on the slope of wave V and was less 129 identifiable in the grand average shown in Figure 2A than that shown with a higher cutoff in Figure 1 . Wave 130 V was more robust and sharper to broadband peaky speech but peaked slightly later than the broader 131 wave V to unaltered speech. For reasons unknown to us, the half-rectified speech method missed the MLR 132 wave P0, and consequently had a broader and earlier Na than the broadband peaky speech method, 133 though this missing P0 was consistent with the results of Maddox and Lee (2018). These waveforms 134

indicate that broadband peaky speech is better than unaltered speech at evoking canonical responses that 135 distinguish activity from distinct subcortical and cortical neural generators. 136

Peak latencies for the waves common to both types of speech are shown in Figure 2B . Again, there was 137 good agreement in peak wave choices for each type of speech, with ICC3 ≥ 0.94 (the lowest two ICC3 138 95% confidence intervals were 0.87-0.98 and 0.92-0.99 for waves V and Na to unaltered speech 139 respectively). As suggested by the waveforms in Figure 2A , mean ± SEM peak latency differences for 140 waves V, Na, and Pa were longer for broadband peaky than unaltered speech by 0. The group average waveforms to female-and male-narrated broadband peaky speech showed similar 156 canonical morphologies but were smaller and later for female-narrated ABR responses ( Figure 3A ), much 157 as they would be for click stimuli presented at higher rates (e.g., Burkard et al., 1990; Figure 3B ). 163

To determine if this stimulus dependence was significantly different than variability introduced by averaging 164 only half the epochs (i.e., splitting by male-and female-narrated epochs), we reanalyzed the data split into 165 even and odd epochs. Each of the even/odd splits contained the same number of male-and female-166 narrated epochs, and were evenly distributed over the entire recording session. The median (interquartile 167 range) odd-even correlation coefficients were 0.86 (0.80-0.95) for ABR lags and 0.47 (0.36-0.80) for 168 ABR/MLR lags. These odd-even coefficients were significantly higher than the male-female coefficients for 169 the ABR (W(10) = 0.0, p < 0.001; Wilcoxon signed-rank test) but not when the response included the MLR 170 (W(10) = 18.0, p = 0.206), indicating that the choice of narrator for using peaky speech impacts the 171 morphology of the early response. 172

As expected from the waveforms, peak latencies of component waves differed between male-and female-173 narrated broadband peaky speech ( Figure 3C ). As before, ICC3 ≥ 0.83 indicated good agreement in peak 174

wave choices (the lowest two 95% confidence intervals were 0. brainstem responses used to evaluate processing at different stages of auditory processing, but ABRs can 185 also be used to assess hearing function across different frequencies. Traditionally, frequency-specific 186

ABRs are measured using clicks with high-pass masking noise or frequency-specific tone pips. We tested 187 the flexibility of using our new peaky speech technique to investigate how speech processing differs across 188 frequency regions, such as 0-1, 1-2, 2-4, and 4-8 kHz frequency bands. To do this, we created new 189 pulses trains with slightly different fundamental waveforms for each filtered frequency regions of speech, 190

and then combined those filtered frequency bands together as multiband speech (for details, see the 191

Multiband peaky speech subsection of Methods). Using this method, we took advantage of the fact that 192 over time, stimuli with slightly different fundamental frequencies will be independent, yielding independent 193 auditory brainstem responses. Therefore, the same EEG was regressed with each band's pulse train to 194 derive the ABR and MLR to each frequency band. 195

Mean ± SEM responses from 22 subjects to the 4 frequency bands (0-1, 1-2, 2-4, and 4-8 kHz) of ~43 196 minutes of male-narrated multiband peaky speech are shown as colored waveforms with solid lines in 197 Figure 4A . A high-pass filter with a cutoff of 30 Hz was used. Each frequency band response comprises a 198 frequency-band-specific component as well as a band-independent common component, both of which are 199 due to spectral characteristics of the stimuli and neural activity. The pulse trains are independent over time 200 in the vocal frequency rangethereby allowing us to pull out responses to each different pulse train and 201 frequency band from the same EEGbut they became coherent at frequencies lower than 72 Hz for the 202 male-narrated speech and 126 Hz for the female speech (see Figure 13 in Methods). This coherence was 203 due to all pulse trains beginning and ending together at the onset and offset of voiced segments and was 204 the source of the low-frequency common component of each band's response. To remove the common 205 component, there are two options. First, we could simply high-pass the response at 150 Hz to filter out the 206 regions of spectral coherence in the stimuli, as shown by the waveforms with dashed lines in Figure 4B . 207

However, this method reduces the amplitude of the responses, which in turn affects response SNR and 208 detectability. The second option is to calculate the common activity across the frequency band responses 209 and subtract this waveform from each of the frequency band responses. This common component was 210 calculated by regressing the EEG to multiband speech with 6 independent "fake" pulses trainspulse 211 trains with slightly different fundamental frequencies that were not used to create the multiband peaky 212 speech stimuli that were presented during the experimentand then averaging across these 6 responses. 213

This common component waveform is shown by the dot-dashed gray line, which is superimposed with 214 each response to the frequency bands in Figure 4A . The subtracted, frequency-specific waveforms to each 215 frequency band are shown by the solid lines in Figure 4B . Of course, the subtracted waveforms could also 216 be high-pass filtered at 150 Hz to highlight earlier waves of the brainstem responses, as shown by the 217 dashed lines in Figure 4B . Overall, the frequency-specific responses showed characteristic ABR and MLR 218 waves with longer latencies for lower frequency bands, as would be expected from responses arising from 219 different cochlear regions. Also, waves I and III of the ABR were visible in the group average waveforms of 220 the 2-4 kHz (≥41% of subjects) and 4-8 kHz (≥86% of subjects) bands, whereas the MLR waves were 221 more prominent in the 0-1 kHz (≥95% of subjects) and 1-2 kHz (≥54% of subjects) bands. 222

These frequency-dependent latency changes for the frequency-specific responses are highlighted further in 223 Figure 4C , which shows mean ± SEM peak latencies and the number of subjects who had a clearly 224 identifiable wave. ICC3 ≥ 0.89 indicated good agreement in peak wave choices (lowest two 95% 225 confidence intervals were 0.82-0.93 for Pa and 0.88-0.95 for Na). The nonlinear change in peak latency 226 with frequency band was modeled using mixed effects regression by including orthogonal linear and 227 quadratic terms for frequency band and their interactions with wave, as well as random effects of intercept 228 and each frequency band term for each subject. A model was completed for each filter cutoff of 30 and 150 229

Hz. There were insufficient numbers of subjects with identifiable waves I and III for the 0-1 kHz and 1-2 230 kHz bands, so these waves were not included in the full model. Details of each model are described in 231

Supplemental between wave and the quadratic frequency band term, for 30 and 150 Hz models). 239

Next, the frequency-specific responses (i.e., multiband responses with common component subtracted) 240

were summed and the common component added to derive the entire response to multiband peaky 241 speech. As shown in Figure 5 to such a degree. The similarity also verified that the additional changes we made to create re-synthesized 249 multiband peaky speech did not significantly affect responses compared to broadband peaky speech. 250 Frequency-specific responses also differ by narrator 253

We also investigated the effects of narrator on multiband peaky speech by deriving responses to 32 254 minutes (30 epochs of 64 s each) each of male-and female-narrated multiband peaky speech in the same 255 11 subjects. As with broadband peaky speech, responses to both narrators showed similar morphology, 256 but the responses were smaller and the MLR waves more variable for the female than male narrator 257 ( Figure 6A ). Figure 6B shows 

As expected from the grand average waveforms and male-female correlations, there were fewer subjects 270 who had identifiable waves across frequency bands for the female-than male-narrated speech. These 271 numbers are shown in Figure 6C , along with the mean ± SEM peak latencies for each wave, frequency 272 band and narrator. ICC3 ≥0.93 indicated good agreement in peak latency choices (the two lowest 95% 273 confidence intervals were 0.83-0.97 for wave III and 0.98-0.99 for Pa, both with a high-pass filter cutoff of 274 30 Hz). Again, there were few numbers of subjects with identifiable waves I and III for the lower frequency 275 bands. Therefore, the mixed effects model was completed for waves V, P0, Na, and Pa of responses in the 4 276 frequency bands that were high-pass filtered at 30 Hz. The model included fixed effects of narrator, wave, 277 linear and quadratic terms for frequency band, the interaction between narrator and wave, and the 278 interactions between wave and frequency band terms, as well as a random intercept per subject and 279 random frequency band terms for per subject. Details of the model are described in Supplemental Table 2.  280 For those subjects with identifiable waves, peak latencies shown in Figure 6C differed by wave (p < 0.001 281 for effects of each wave on the intercept), and latency decreased with increasing frequency band (p < 282 0.001 for the linear term, slope). This change with frequency was greater (i.e., steeper slope) for each MLR 283

wave compared to wave V (p < 0.013 for all interactions between wave and the linear term for frequency 284 band). There was no change in slope with band (p = 0.190 for the quadratic term, p > 0.318 for the 285 interactions between wave and the quadratic term). There was also no main effect of narrator on peak 286 latencies (narrator p = 0.481), except that the latencies for Pa were faster for the female than male narrator 287 (Pa-narrator interaction p = 0.003; other wave-narrator interactions p > 0.195). Therefore, as with 288

broadband peaky speech, frequency-specific peaky responses were more robust with the male narrator, 289 but unlike the broadband responses, the frequency-specific responses did not peak earlier for a narrator 290 with a lower fundamental frequency. 291 292

The focus so far has been on scientific use of peaky speech but there are also potential clinical 294

applications. Frequency-specific ABRs to tone pips are traditionally used to assess hearing function in each 295 ear across octave bands with center frequencies of 500-8000 Hz. Applying the same principles to generate 296 multiband peaky speech, we investigated whether ear-specific responses could be evoked across 5 297 standard, clinically-relevant (audiological) frequency bands using dichotic multiband speech. For peaky 298 dichotic (stereo) audiological multiband speech we created 10 independent pulse trains, 2 for each ear in 299 each of the 5 frequency bands (see Multiband peaky speech and Band filters in Methods). 300

We recorded responses to 64 minutes (60 epochs of 64 s each) each of male-and female-narrated 301 dichotic multiband peaky speech in 11 subjects. The frequency-specific (i.e., common component-302 subtracted) group average waveforms for each ear and frequency band are shown in Figure 7A . The ten 303 waveforms were small, especially for female-narrated speech, but a wave V was identifiable for both 304 narrators. MLR waves were not clearly identifiable for responses to female-narrated speech. Therefore, 305 correlations between responses were performed for ABR lags between 0-15 ms. As shown in Figure Figure 7C shows the mean ± SEM peak latencies of wave V for each ear and frequency band for the male-314

and female-narrated dichotic multiband peaky speech. The ICC3 for wave V was 0.98 (95% confidence 315 interval 0.98-0.99), indicating reliable peak latency choices. The nonlinear change in wave V latency with 316 frequency was modeled using mixed effects regression with fixed effects of narrator, ear, linear and 317 quadratic terms for frequency band, and the interactions between narrator and frequency band terms. 318

Random effects included an intercept and both frequency band terms for each subject. Details of the model 319 are described in Supplemental Table 3 . Wave V latency was significantly longer for female-than male-320 narrated multiband peaky speech in the 0.5 kHz band (narrator effect on the intercept, p = 0.001), 321 decreased at a steeper rate with frequency band (interaction between narrator and linear frequency band 322 term p < 0.001), and had a significantly different rate of change with frequency (interaction between 323 narrator and quadratic frequency band term, p < 0.001). Overall, latency did not differ between ears (p = 324 0.116). Taken together, these results confirm that, while small in amplitude, frequency-specific responses 325 can be elicited in both ears across 5 different frequency bands and show characteristic latency changes 326 across the different frequency bands. 327

Responses are obtained quickly for male-narrated broadband peaky speech but not multiband speech 329

Having demonstrated that peaky broadband and multiband speech provides canonical waveforms with 330 characteristic changes in latency with frequency, we next evaluated the acquisition time required for 331 waveforms to reach a decent SNR. We chose 0 dB SNR based on visual assessment of when waveforms 332

were responses high-pass filtered at 30 Hz) or ABR time interval 0-15 ms (for responses high-pass filtered at 335

150 Hz) to the variance in the pre-stimulus noise interval −480 to −20 ms (see Response SNR calculation 336

in Methods for details). 337 Figure 8 shows the cumulative proportion of subjects who had responses with ≥ 0 dB SNR to unaltered and 338 broadband peaky speech as a function of recording time. Acquisition times for 22 subjects were similar for 339 responses to both unaltered and broadband peaky male-narrated speech, with 0 dB SNR achieved by 8 340 minutes in 50% of subjects and by 18 and 20 minutes respectively in 100% of subjects. This time reduced 341 to 2 and 5 minutes for 50% and 100 % of subjects respectively for broadband peaky responses high-pass 342 filtered at 150 Hz to highlight the ABR (0-15 ms interval). These times for male-narrated broadband peaky 343 speech were confirmed in our second cohort of 11 subjects, who also all achieved 0 dB SNR within 26 344 minutes for the 0-30 ms MLR interval (10 / 11 subjects in 18 minutes; 50% by 10 minutes) and 4 minutes 345

for the 0-15 ms ABR interval (50% by 2 minutes). However, acquisition times were at least 3.6 timesbut 346 up to over 10 timeslonger for female-narrated broadband peaky speech, with 50% of subjects achieving 347 0 dB SNR by 36 minutes for the MLR interval and 8 minutes for ABR interval. In contrast to male-narrated 348 speech, not all subjects achieved this threshold for female-narrated speech by the end of the 32-minute 349 recording (45% and 63% for the MLR and ABR intervals respectively). Taken together, these acquisition 350 times confirm that responses with useful SNRs can be measured quickly for male-narrated broadband 351 peaky speech but longer recording sessions are necessary for narrators with higher fundamental 352

frequencies. 353

The longer recording times necessary for a female narrator became more pronounced for the multiband 355

peaky speech. Figure 9A and broader responses in the low frequency bands limited this testing timefor male-narrated speech, at 368 least 90% of subjects had 2-4 and 4-8 kHz responses (diotic 4-band speech) with ≥ 0 dB SNR in 15 369 minutes, 70% of subjects had 6 frequency-specific responses (2, 4, 8 kHz bands in both ears for dichotic 370 speech) by the end of the recording, and 50% of subjects had the 6 higher frequency-specific responses 371 within 40 minutes. These significant recording times suggest that deriving multiple frequency-specific 372 responses will require at least more than 30 minutes per condition for < 5 bands, and more than an hour 373 session for one condition of peaky multiband speech with 10 bands. 374 375 376 

The major goal of this work was to develop a method to investigate early stages of naturalistic speech 378 processing. We re-synthesized continuous speech taken from audio books so that the phases of all 379 harmonics aligned at each glottal pulse during voiced segments, thereby making speech as impulse-like 380

(peaky) as possible to drive the auditory brainstem. Then we used the glottal pulse trains as the regressor 381

in deconvolution to derive the responses. Indeed, comparing waveforms to broadband peaky and unaltered 382 speech validated the superior ability of peaky speech to evoke additional waves of the canonical ABR and 383 MLR, reflecting neural activity from multiple subcortical structures. Robust ABR and MLR responses were 384 recorded in less than 5 and 20 minutes respectively for all subjects, with half of the subjects exhibiting a 385 strong ABR within 2 minute and MLR within 8 minutes. Longer recording times were required for the 386 smaller responses generated by a narrator with a higher fundamental frequency. We also demonstrated 387 the flexibility of this stimulus paradigm by simultaneously recording up to 10 frequency-specific responses 388

to multiband peaky speech that was presented either diotically or dichotically, although these responses 389 required much longer recording times. Taken peaky speech evoked responses with canonical morphology comprised of waves I, III, V, P0, Na, Pa ( Figure  398 1), reflecting neural activity from distinct stages of the auditory system from the auditory nerve to thalamus 399 and primary auditory cortex (e.g., Picton et al., 1974) . Presence of these additional waves allows for new 400

investigations into the contributions of each of these neural generators to speech processing while using a 401 continuously dynamic and ecologically salient stimulus. 402

The same ABR waves evoked here were also evoked by a method using embedded chirps intermixed within 403

alternating transients are broad, clicks, chirps and even our previous speech stimuli (which was high-pass filtered at 1 439 kHz; Maddox and Lee, 2018) have relatively greater high-frequency energy than the unaltered and peaky 440 broadband speech used in the present work. Neurons with higher characteristic frequencies respond 441 earlier due to their basal cochlear location, and contribute relatively more to brainstem responses (e.g., 442

Abdala and Folsom, 1995), leading to quicker latencies for stimuli that have greater high frequency energy. 443

Also consistent with having greater lower frequency energy, our unaltered and peaky speech responses 444

were later than the response from the same speech segments that were high-pass filtered at 1 kHz 445 (Maddox and Lee, 2018) . In fact, the ABR to broadband peaky speech bore a close resemblance to the 446 summation of each frequency-specific response and the common component to peaky multiband speech 447 ( Figure 5) , with peak wave latencies representing the relative contribution of each frequency band. Latency of the peaky speech-evoked response also differed from the non-standard, broad responses to 463 unaltered speech. However, latencies from these waveforms are difficult to compare due to the differing 464 morphology and the different analyses that were used to derive the responses. Evidence for the effect of 465 analysis comes from the fact that the same EEG collected in response to peaky speech could be regressed 466

with pulse trains to give canonical ABRs (Figures 1, 2) , or regressed with the half-wave rectified peaky 467 speech to give the different, broad waveform (Supplemental Figure 1) . Furthermore, non-peaky continuous 468 speech stimuli with similar ranges of fundamental frequencies (between 100-300 Hz) evoke non-standard, 469

broad brainstem responses that also differ in morphology and latency depending on whether the EEG is 470 analyzed by deconvolution with the half-wave rectified speech (Figure 2 presented diotically (Figures 4, 6) , and 5 frequency bands presented dichotically (Figure 7) . with increasing frequency, resulting in a ~30 dB difference between the lowest and highest frequency 484 bands (Figure 12) . A greater response elicited by higher frequency bands is consistent with the relatively 485 greater contribution of neurons with higher characteristic frequencies to ABRs (Abdala and Folsom, 1995) , 486 as well as the need for higher levels to elicit low frequency responses to tone pips that are close to 487 threshold (Gorga et al., 2006 (Gorga et al., , 1993 Hyde, 2008; Stapells and Oates, 1997 individuals who do not provide reliable behavioral responses, as they may be more responsive to sitting for 493 longer periods of time while listening to a narrated story than to a series of tone pips. An extension of this 494 assessment would be to evaluate neural speech processing in the context of hearing loss, as well as 495 rehabilitation strategies such as hearing aids and cochlear implants. Therefore, the ability of peaky speech 496 to yield both canonical waveforms and frequency-specific responses makes this paradigm a flexible 497 method that assesses speech processing in new ways. 498

Having established that peaky speech is a flexible stimulus for investigating different aspects of speech 499

processing, there are several practical considerations for using the peaky speech paradigm. First, filtering 500 should be performed carefully. As recommended in Maddox and Lee (2018), causal filterswhich have 501

impulse responses with non-zero values at positive lagsshould be used to ensure cortical activity at later 502 peak latencies does not spuriously influence earlier peaks corresponding to subcortical origins. Applying 503 less aggressive, low-order filters (i.e., broadband with shallow roll-offs) will help reduce the effects of 504 causal filtering on delaying response latency. The choice of high-pass cutoff will also affect the response 505 amplitude and morphology. After evaluating several orders and cutoffs to the high-pass filters, we 506 determined that early waves of the peaky broadband ABRs were best visualized with a 150 Hz cutoff, 507

whereas a lower cutoff frequency of 30 Hz was necessary to view the ABR and MLR of the broadband 508 responses. For multiband responses, the 150 Hz high-pass filter significantly reduced the response but 509 also decreased the low-frequency noise in the pre-stimulus interval. For the 4-band multiband peaky 510 speech the 150 Hz and 30 Hz filters provided similar acquisition times for 0 dB SNR, but better SNRs were 511 obtained quicker with 150 Hz filtering for the 10-band multiband peaky speech. 512

Second, the choice of narrator impacts the responses to both broadband and multiband peaky speech. 513

Although overall morphology was similar, the male-narrated responses were larger, contained more clearly 514 identifiable component waves in a greater proportion of subjects, and achieved a 0 dB SNR at least 3.6 to 515 over 10 times faster than those evoked by a female narrator. These differences likely stemmed from the 516 ~77 Hz difference in average pitch, as higher stimulation rates evoke smaller responses due to adaptation 517 and refractoriness (e.g., Burkard , 2018) . However, using a narrator with a higher fundamental 539 frequency could increase testing time by 3-to over 10-fold. In this experiment, at most 2 conditions per 540 hour could be tested with the female-narrated broadband peaky speech. Furthermore, longer testing times 541 are likely needed, even for male-narrated speech, in order to reliably compare differences in the smaller 542 amplitude component waves I and III of the ABR. Our 30-to 40-minute recording sessions provided robust 543 responses with very good SNRs to evaluate the earlier ABR waves, but this long may be unnecessary. The 544 cumulative density functions in Figure 8 suggest that between 12 to 20 minutes should constitute ample 545 time to generate comparable responses with highly positive SNRs. Unlike broadband peaky speech, the 546 testing times required for all frequency-specific responses to reach 0 dB SNR were significantly longer, 547 making only 1 condition feasible within a recording session. At least 30 minutes was necessary for the 548 diotically presented multiband peaky speech with 4 frequency bands, but based on extrapolated testing 549 times, about 56-88 minutes is required for 90% of subjects to achieve this threshold for all 4 bands. For 550 dichotically presented multiband peaky speech with 5 frequency bands (for a total of 10 frequency-specific 551 waveforms), only 36% the responses achieved 0 dB SNR within an hour. Extrapolated testing times 552

suggest that over 2 hours is required for at least 75% of subjects, limiting the feasibility or utility of 553 multiband peaky speech with several frequency bands. 554

Fourth, as mentioned above, the number of frequency bands incorporated into multiband peaky speech 555 decreases SNR and increases testing time. Although it is possible to simultaneously record up to 10 556 frequency-specific responses, the significant time required to obtain decent SNRs reduces the feasibility of 557 testing multiple conditions or having recording sessions lasting less than 1-2 hours. However, pursuing 558

shorter testing times with multiband peaky speech is possible. Depending on the experimental question, 559 different multiband options could be considered. For male-narrated speech, the 2-4 and 4-8 kHz 560

responses had good SNRs and exhibited waves I, III, and V within 15 minutes for 90% of subjects. 561

Therefore, if researchers were more interested in comparing responses in these higher frequency bands, 562 they could stop recording once these bands reach threshold but before the lower frequency bands reach 563 criterion (i.e., within 15 minutes). Alternatively, the lower frequencies could be combined into a single 564 broader band in order to reduce the total number of bands, or the intensity could be increased to evoke 565 responses with larger amplitudes. Therefore, different band and parameter considerations could reduce 566 testing time and improve the feasibility, and thus utility, of multiband peaky speech. 567

Fifth, and finally, a major advantage of deconvolution analysis is that the analysis window for the response 568

can be extended arbitrarily in either direction to include a broader range of latencies (Maddox and Lee, 569 2018). Extending the pre-stimulus window leftward provides a better estimate of the SNR, and extending 570 the window rightward allows parts of the response that come after the ABR and MLR to be analyzed as 571

well, which are driven by the cortex. These later responses can be evaluated in response to broadband 572 peaky speech, but as shown in Figures 6 and 7 , only ABR and early MLR waves are present in the 573 frequency-specific responses. The same broadband peaky speech data from Figure 3 are displayed with 574 an extended time window in Figure 10 , which shows component waves of the ABR, MLR and late latency 575 responses (LLR). Thus, this method allows us to simultaneously investigate speech processing ranging 576 from the earliest level of the auditory nerve all the way through the cortex without requiring extra recording 577 time. Usually the LLR is larger than the ABR/MLR, but our subjects were encouraged to relax and rest, 578 yielding a passive LLR response. Awake and attentive subjects may improve the LLR; however, other 579 studies that present continuous speech to attentive subjects also report smaller and different LLR (Backer 580 et al., 2019; Maddox and Lee, 2018), possibly from cortical adaptation to a continuous stimulus. Here we 581 used a simple 2-channel montage that is optimized for recording ABRs, but a full multi-channel montage 582 could also be used to more fully explore the interactions between subcortical and cortical processing of 583 naturalistic speech. The potential for new knowledge about how the brain processes naturalistic and 584 engaging stimuli cannot be undersold. 585

The Finally, as aforementioned, the ability to customize peaky speech for measuring frequency-specific 595 responses provides potential applications to clinical research in the context of facilitating assessment of 596 supra-threshold hearing function and changes following intervention strategies and technologies. 597

In summary, the peaky speech paradigm is a viable method for recording canonical waveforms and 598 frequency-specific responses to an engaging, continuous speech stimulus. 

In each experiment, subjects listened to 128 minutes of continuous speech stimuli while reclined in a 617 darkened sound booth. They were not required to attend to the speech and were encouraged to relax and 618 to sleep. Speech was presented at an average level of 65 dB SPL over ER-2 insert earphones (Etymotic 619

Research, Elk Grove, IL) plugged into an RME Babyface Pro digital soundcard (RME, Haimhausen, 620

Germany) via an HB7 headphone amplifier (Tucker Davis Technologies, Alachua, FL). Stimulus 621 presentation was controlled by a custom python script using publicly available software (available at 622 https://github.com/LABSN/expyfun; Larson et al., 2014) . We interleaved conditions in order to prevent slow 623 impedance drifts or transient periods of higher EEG noise from unevenly affecting one condition over the 624 others. Physical measures to reduce stimulus artifact included: 1) hanging earphones from the ceiling so 625 that they were as far away from the EEG cap as possible; and 2) sending an inverted signal to a dummy 626 earphone (blocked tube) attached in the same physical orientation to the stimulus presentation earphones 627 in order to cancel electromagnetic fields away from transducers. The soundcard also produced a digital 628 signal at the start of each epoch, which was converted to trigger pulses through a custom trigger box 629

(modified from a design by the National Acoustic Laboratories, Sydney, NSW, Australia) and sent to the 630 EEG system so that audio and EEG data could be synchronized with sub-millisecond precision. 631 EEG was recorded using BrainVision's PyCorder software. Ag/AgCl electrodes were placed at the high 632 forehead (FCz, active non-inverting), left and right earlobes (A1, A2, inverting references), and the frontal 633 pole (Fpz, ground). These were plugged into an EP-Preamp system specifically for recording ABRs, 634

connected to an ActiCHamp recording system, both manufactured by BrainVision. Data were sampled at 635 10,000 Hz and high-pass filtered at 0.1 Hz. Offline, raw data were high-pass filtered at 1 Hz using a first-636 order causal Butterworth filter to remove slow drift in the signal, and then notch filtered with 5 Hz wide 637 second-order infinite impulse response (IIR) notch filters to remove 60 Hz and its first 3 odd harmonics 638

(180, 300, 420 Hz). To optimize parameters for viewing the ABR and MLR components of peaky speech 639 responses, we evaluated several orders and high-pass cutoffs to the filters. Early waves of the broadband 640 peaky ABRs were best visualized with a 150 Hz cutoff, whereas a lower cutoff frequency of 30 Hz was 641 necessary to view the ABR and MLR of the broadband responses. Conservative filtering with a first order 642 filter was sufficient with these cutoff frequencies. in that study a gentle high-pass filter was applied which was not done for this study. Briefly, the audiobooks 649 were resampled to 44,100 Hz and then silent pauses were truncated to 0.5 s. Speech was segmented into 650 64 s epochs with 1 s raised cosine fade-in and fade-out. Because conditions were interleaved, the last 4 s 651 of a segment were repeated in the next segment so that subjects could pick up where they left off if they 652 were listening. 653

In experiment 1 subjects listened to 3 conditions of male speech (42.7 min each): unaltered speech, re-654 synthesized broadband peaky speech, and re-synthesized multiband peaky speech (see below for a 655 description of re-synthesized speech). In experiment 2 subjects listened to 4 conditions of re-synthesized 656 peaky speech (32 minutes each): male and female narrators of both broadband and multiband peaky 657 speech. For these first 2 experiments, speech was presented diotically (same speech to both ears). In 658 experiment 3 subjects listened to both male and female dichotic (different speech in each ear) multiband 659 peaky speech designed for audiological applications (64 min each). The same 64 s of speech was 660 presented simultaneously to each ear, but the stimuli were dichotic due to how the re-synthesized 661 multiband speech was created (see below). 662 663

The brainstem responds best to impulse-like stimuli, so we re-synthesized the speech segments from the 665 audiobooks (termed "unaltered") to create 3 types of "peaky" speech, with the objectives of 1) evoking 666 additional waves of the ABR reflecting other neural generators, and 2) measuring responses to different 667 frequency regions of the speech. The process is described in detail below, but is best read in tandem with 668 the code that will be publicly available (https://github.com/maddoxlab). Figure 11 compares the unaltered 669 speech and re-synthesized broadband and multiband peaky speech. Comparing the pressure waveforms 670

shows that the peaky speech is as click-like as possible, but comparing the spectrograms (how sound 671 varies in amplitude at every frequency and time point) shows that the overall spectrotemporal content that 672 defines speech is basically unchanged by the re-synthesis. were considered voiced (vowels and voiced consonants like /z/). 17 ms is the longest inter-pulse interval 681 one would expect in natural speech because it is the inverse of 60 Hz, the lowest pitch at which someone 682 with a deep voice would likely speak. A longer gap in pulse times was considered a break between voiced 683

sections. These segments were identified in a "mixer" function of time, with indices of 1 indicating unvoiced 684 and 0 indicating voiced segments (and would later be responsible for time-dependent blending of re-685 synthesized and natural speech, hence its name). Transitions of the binary mixer function were smoothed 686 using a raised cosine envelope spanning the time between the first and second pulses, as well as the last 687 two pulses of each voiced segment. During voiced segments, the glottal pulses set the fundamental 688 frequency of speech (i.e., pitch), which were allowed to vary from a minimum to maximum of 60-350 Hz for 689 the male narrator and 90-500 Hz for the female narrator. For the male and female narrators, these pulses 690 gave a mean ± SD fundamental frequency (i.e., pulse rate) in voiced segments of 115.1 ± 6.7 Hz and 198.1 691 Figure 11 . Unaltered speech waveform (top left) and spectrogram (top right) compared to re-synthesized broadband peaky speech (middle left and right) and multiband peaky speech (bottom left and right). Comparing waveforms shows that the peaky speech is as "click-like" as possible, while comparing the spectrograms shows that the overall spectrotemporal content that defines speech is basically unchanged by the re-synthesis. A naïve listener is unlikely to notice that any modification has been performed, and subjective listening confirms the similarity. Yellow/lighter colors represent larger amplitudes than purple/darker colors in the spectrogram. See supplementary files for audio examples of each stimulus type for both narrators.

± 20 Hz respectively, and a mean ± SD pulses per second over the entire 64 s, inclusive of unvoiced 692 periods and silences, of 69.1 ± 5.7 Hz and 110.8 ± 11.4 respectively. These pulse times were smoothed 693 using 10 iterations of replacing pulse time with the mean of pulse times −1 to +1 if the log 2 absolute 694 difference in the time between and −1 and +1 was less than log 2 (1.6). 695

The fundamental frequency of voiced speech is dynamic, but the signal always consists of a set of integer-696 related frequencies (harmonics) with different amplitudes and phases. To create the waveform component 697 at the fundamental frequency, 0 ( ), we first created a phase function, ( ), which increased smoothly by 698 2 between glottal pulses within the voiced sections as a result of cubic interpolation. We then computed 699 the spectrogram of the unaltered speech waveformwhich is a way of analyzing sound that shows its 700 amplitude at every time and frequency (Figure 11 , top-right)which we called [ , 0 ( )]. We then created 701 the fundamental component of the peaky speech waveform as: 702

This waveform has an amplitude that changes according to the spectrogram but always peaks at the time 704

of the glottal pulses. 705

Next the harmonics of the speech were synthesized. The th harmonic of speech is at a frequency of 706 ( + 1) 0 so we synthesized each harmonic waveform as: 707

Each of these harmonic waveforms has multiple peaks per period of the fundamental, but every harmonic 709 also has a peak at exactly the time of the glottal pulse. Because of these coincident peaks, when the 710 harmonics are summed to create the re-synthesized voiced speech, there is always a large peak at the 711 time of the glottal pulse. In other words, the phases of all the harmonics align at each glottal pulse, making 712 the pressure waveform of the speech appear "peaky" (left-middle panel of Figure 11 ). 713

The resultant re-synthesized speech contained only the voiced segments of speech and was missing 714 unvoiced sounds like /s/ and /k/. Thus the last step was to mix the re-synthesized voiced segments with the 715 original unvoiced parts. This was done by cross-fading back and forth between the unaltered speech and 716

re-synthesized speech during the unvoiced and voiced segments respectively, using the binary mixer 717 function created when determining where the voiced segments occurred. We also filtered the peaky 718 speech to an upper limit of 8 kHz, and used the unaltered speech above 8 kHz, to improve the quality of 719 voiced consonants such as /z/. Filter properties for the broadband peaky speech are further described 720 below in the "Band filters" subsection. 721

Multiband peaky speech 723

The same principles to generate broadband peaky speech were applied to create stimuli designed to 724 investigate the brainstem's response to different frequency bands that comprise speech. This makes use of 725 the fact that over time, speech signals with slightly different 0 are independent, or have (nearly) zero 726 cross-correlation, at the lags for the ABR. To make each frequency band of interest independent, we 727

shifted the fundamental frequency and created a fundamental waveform and its harmonics as: In these studies, we increased fundamentals for each frequency band by the square root of each 733 successive prime number and subtracting one, resulting in a few tenths of a hertz difference between 734

bands. The first, lowest frequency band contained the un-shifted 0 . Responses to this lowest, un-shifted 735 frequency band showed some differences from the common component for latencies > 30 ms that were not 736 present in the other, higher frequency bands (Figure 4 , 0-1 kHz band), suggesting some low-frequency 737 privilege/bias in this response. Therefore, we suggest that following studies create independent frequency 738 bands by synthesizing a new fundamental for each band. The static shifts described above could be used, 739

but we suggest an alternative method that introduces random dynamic frequency shifts of up to ±1 Hz over 740 the duration of the stimulus. From this random frequency shift we can compute a dynamic random phase 741

shift, to which we also add a random starting phase, ∆ , which is drawn from a uniform distribution between 742 0 and 2 . The phase function from the above set of formulae would be replaced with this random dynamic 743 phase function: 744

This random 0 shift method is described further in the supplementary material and validation data from 746 one subject is provided in Supplemental Figure 2 . Responses from all four bands show more consistent 747 resemblance to the common component, indicating that this method is effective at reducing stimulus-748 related bias. However, low-frequency dependent differences remained, suggesting there is also unique 749 neural-based low-frequency activity to the speech-evoked responses. 750

This re-synthesized speech was then band-pass filtered to the frequency band of interest (e.g. from 0-1 751 kHz or 2-4 kHz). This process was repeated for each independent frequency band, then the bands were 752 mixed together and then these re-synthesized voiced parts were mixed with the original unaltered voiceless 753

speech. This peaky speech comprised octave bands with center frequencies of: 707, 1414, 2929, 5656 Hz 754

for experiments 1 and 2, and of 500, 1000, 2000, 4000, 8000 Hz for experiment 3. Note that for the lowest 755 band, the actual center frequency was slightly lower because the filters were set to pass all frequencies 756

below the upper cutoff. Filter properties for these two types of multiband speech are shown in the middle 757

and right panels of Figure 14 and further described below in the "Band filters" subsection. For the dichotic 758 multiband peaky speech, we created 10 fundamental waveforms -2 in each of the five filter bands for the 759 two different ears, making the output audio file stereo (or dichotic). We also filtered this dichotic multiband 760 peaky speech to an upper limit of 11.36 kHz to allow for the highest band to have a center frequency of 8 761

kHz and octave width. The relative mean-squared magnitude in decibels for components of the multiband 762 peaky speech (4 filter bands) and dichotic (audiological) multiband peaky speech (5 filter bands) are shown 763

in Figure 12 . 764

For peaky speech, the re-synthesized speech waveform was presented during the experiment but the 766 pulse trains were used as the input stimulus for calculating the response (i.e., the regressor, see Response 767

derivation section below). These pulse trains all began and ended together in conjunction with the onset 768 and offset of voiced sections of the speech. To verify which frequency ranges of the multiband pulse trains 769

were independent across frequency bands, and would thus yield truly band-specific responses, we 770 conducted a spectral coherence analyses on the pulse trains. All 60 unique 64 s sections of each male-771 and female-narrated multiband peaky speech used in the three experiments were sliced into 1 s segments 772 for a total of 3,840 slices. Phase coherence across frequency was then computed across these slices for 773 each combination of pulse trains according to the formula: 774 Spectral coherence for each narrator is shown in Figure 13 . For the 4-band multiband peaky speech used 779 in experiments 1 and 2 there were 6 pulse train comparisons. For the audiological multiband peaky speech 780 used in experiment 3, there were 5 bands for each of 2 ears, resulting in 10 pulse trains and 45 781

comparisons. All 45 comparisons are shown in Figure 13 . Pulse trains were coherent (> 0.1) up to a 782 maximum of 71 and 126 Hz for male-and female-narrated speech respectively, which roughly correspond 783

to the mean ± SD pulse rates (calculated as total pulses / 64 s) of 69.1 ± 5.7 Hz and 110.8 ± 11.4 784 respectively. This means that above ~130 Hz the stimuli were no longer coherent and evoked frequency-785 specific responses. Importantly, responses would be to correlated stimuli, i.e., not frequency-specific, at 786 frequencies below this cutoff and would result in a low-frequency response component that is present in (or 787

common to) all band responses. 788

To identify the effect of the low-frequency stimulus coherence in the responses, we computed the common 790 component across pulse trains by creating an averaged response to 6 additional "fake" pulses trains that 791 were created during stimulus design but were not used during creation of the multiband peaky speech wav 792 files. The common component was assessed for both "fake" pulse trains taken from shifts lower than the 793 original fundamental frequency and those taken from shifts higher than the highest "true" re-synthesized 794 fundamental frequency. To assess frequency-specific responses to multiband speech, we subtracted this 795 common component from the band responses. Alternatively, one could simply high-pass the stimuli at 150 796

Hz using a first-order causal Butterworth filter (being mindful of edge artifacts). However, this high-pass 797 filtering reduces response amplitude and may affect response detection (see Results for more details). 798

We also verified the independence of the stimulus bands by treating the regressor pulse train as the input 799

to a system whose output was the rectified stimulus audio and performed deconvolution (see 800

Deconvolution and Response derivation section below). Further details are provided in the supplementary 801 material. The responses are given in Supplemental Figure 5 , and showed that the non-zero responses only 802 occurred when the correct pulse train was paired with the correct audio. 803 804 805 Figure 13 . Spectral coherence of pulse trains for multiband peaky speech narrated by a male (left) and female (right). Spectral coherence was computed across 1 s slices from 60 unique 64 s multiband peaky speech segments (3,840 total slices) for each combination of bands. Each light gray line represents the coherence for one band comparison. There were 45 comparisons across the 10-band (audiological) speech used in experiment 3 (5 frequency bands x 2 ears). Pulse trains (i.e., the input stimuli, or regressors, for the deconvolution) were frequency-dependent (coherent) below 72 Hz for the male multiband speech and 126 Hz for the female multiband speech.

Because the fundamental frequencies for each frequency band were designed to be independent over 807 time, the band filters for the speech were designed to cross over in frequency at half power. To make the 808 filter, the amplitude was set by taking the square root of the specified power at each frequency. Octave 809 band filters were constructed in the frequency domain by applying trapezoidswith center bandwidth and 810

roll-off widths of 0.5 octaves. For the first (lowest frequency) band, all frequencies below the high-pass 811 cutoff were set to 1, and likewise for all frequencies above the low-pass cutoff for the last (highest 812 frequency) band were set to 1 (Figure 14 top row) . The impulse response of the filters was assessed by 813

shifting the inverse FFT (IFFT) of the bands so that time zero was in the center, and then applied a Nuttall 814 window, thereby truncating the impulse response to length of 5 ms (Fig 14 middle row) . The actual 815 frequency response of the filter bands was assessed by taking the FFT of the impulse response and 816

plotting the magnitude (Figure 14 bottom row) . 817

As mentioned above, broadband peaky speech was filtered to an upper limit of 8 kHz for diotic peaky 818 speech and 11.36 kHz for dichotic peaky speech. This band filter was constructed from the second last 819 octave band filter from the multiband filters (i.e., the 4-8 kHz band from the top-middle of Figure 14 , dark 820 red line) by setting the amplitude of all frequencies less than the high-pass cutoff frequency to 1 (Figure 14  821 top-left panel, blue line). As mentioned above, unaltered (unvoiced) speech above 8 kHz (diotic) or 11.36 822 kHz (dichotic) was mixed with the broadband and multiband peaky speech, which was accomplished by 823 applying the last (highest) octave band filter (8+ or 11.36+ kHz band, black line) to the unaltered speech 824

and mixing this band with the re-synthesized speech using the other bands. 825 826 Figure 14 . Octave band filters used to create re-synthesized broadband peaky speech (left, blue), diotic multiband peaky speech with 4 bands (middle, red), and dichotic multiband peaky speech using 5 bands with audiological center frequencies (right, red). The last band (2 nd , 5 th , 6 th respectively, black line) was used to filter the high-frequencies of unaltered speech during mixing to improve the quality of voiced consonants. The designed frequency response using trapezoids (top) were converted into the time-domain using IFFT, shifted and Nuttall windowed to create impulse responses (middle), which were then used to assess the actual frequency response by converting into the frequency domain using FFT (bottom).

To limit stimulus artifact, we also alternated polarity between segments of speech. To identify regions to flip 828 polarity, the envelope of speech was extracted using a first-order causal Butterworth low-pass filter with a 829 cutoff frequency of 6 Hz applied to the absolute value of the waveform. Then flip indices were identified 830

where the envelope became less than 1 percent of the median envelope value, and then a function that 831 changed back and forth between 1 and −1 at each flip index was created. This function of spikes was 832 smoothed using another first-order causal Butterworth low-pass filter with a cutoff frequency of 10,000 Hz, 833

which was then multiplied with the re-synthesized speech before saving to a wav file. 834 835

Deconvolution 837

The peaky-speech ABR was derived by using deconvolution, as in previous work (Maddox and Lee, 2018) , 838 though the computation was performed in the frequency domain for efficiency. The speech was considered 839 the input to a linear system whose output was the recorded EEG signal, with the ABR computed as the 840 system's impulse response. As in Maddox and Lee (2018), for the unaltered speech, we used the half-841

wave rectified audio as the input waveform. Half-wave rectification was accomplished by separately 842

calculating the response to all positive and all negative values of the input waveform for each epoch and 843

then combining the responses together during averaging. For our new re-synthesized peaky speech, the 844 input waveform was the sequence of impulses that occurred at the glottal pulse times and corresponded to 845 the peaks in the waveform. Figure 15 shows a section of stimulus and the corresponding input signal of 846 glottal pulses used in the deconvolution. 847

The half-wave rectified waveforms and glottal pulse sequences were down-sampled to the EEG sampling 848 frequency prior to deconvolution. To avoid temporal splatter due to standard downsampling, the pulse 849 sequences were resampled by placing unit impulses at sample indices closest to each pulse time. 850

Regularization was not necessary because the amplitude spectra of these regressors were sufficiently 851 broadband. data for each epoch. We used methods incorporated into the mne-python package (Gramfort et al., 2013) . 860 In practice, we made adaptations to this formula to improve SNR with Bayesian-like averaging (see below). 861

For multiband peaky speech the same EEG was deconvolved with the pulse train of each band separately, 862

and then with an additional 6 "fake" pulse trains to derive the common component across bands due to the 863 pulse train coherence at low frequencies (shown in Figure 13 ). The averaged response across these 6 fake 864 pulse trains, or common component, was then subtracted from the multiband responses to identify the 865 frequency-specific band responses. 866 867

The quality of the ABR waveforms as a function of each type of stimulus was of interest, so we calculated 869 the averaged response after each 64 s epoch. We followed a Bayesian-like process (Elberling and 870 Wahlgreen, 1985) to account for variations in noise level across the recording time (such as slow drifts or 871 movement artifacts) and to avoid rejecting data based on thresholds. was performed after the response was shifted to the middle of the 64 s time window. To remove high-888 frequency noise and some low-frequency noise, the average waveform was band-pass filtered between 889 30-2000 Hz using a first-order causal Butterworth filter. An example of this weighted average response to 890 broadband peaky speech is shown in the right panel of Figure 15 . This bandwidth of 30 to 2000 Hz is 891 sufficient to identify additional waves in the brainstem and middle latency responses (ABR and MLR 892 respectively). To further identify earlier waves of the auditory brainstem responses (i.e., waves I and III), 893 responses were high-pass filtered at 150 Hz using a first-order causal Butterworth filter. This filter was 894 determined to provide the best morphology without compromising the response by comparing responses 895 filtered with common high-pass cutoffs of 1, 30, 50, 100 and 150 Hz each combined with first, second and 896 fourth order causal Butterworth filters. 897 898

An advantage of this method over our previous one (Maddox and Lee, 2018) is that because the regressor 900 comprises unit impulses, the deconvolved response is given in meaningful units which are the same as the 901 EEG recording, namely microvolts. With a continuous regressor, like the half-wave rectified speech 902 waveform, this is not the case. Therefore, to compare responses to half-wave rectified speech versus 903 glottal pulses, we calculated a normalization factor, , based on data from all subjects: 904

where is the number of subjects, , is the SD of subject 's response to unaltered speech between 0-906 20 ms, and , is the same for the broadband peaky speech. Each subject's responses to unaltered 907 speech were multiplied by this normalization factor to bring these responses within a comparable amplitude 908 range as those to broadband peaky speech. Consequently, amplitudes were not compared between 909 responses to unaltered and peaky speech. This was not our prime interest, rather we were interested in 910 latency and presence of canonical component waves. In this study the normalization factor was 0.26, which 911 cannot be applied to other studies because this number also depends on the scale when storing the digital 912 audio. In our study, this unitless scale was based on a root-mean-square amplitude of 0.01. The same 913 normalization factor was used when the half-wave rectified speech as used as the regressor with EEG 914 collected in response to unaltered speech, broadband peaky speech and multiband peaky speech ( Figure  915 2, Supplemental Figure 1 ). 916 917

We were also interested in the recording time required to obtain robust responses to re-synthesized peaky 919

speech. Therefore, we calculated the time it took for the ABR and MLR to reach a 0-dB SNR. The SNR of 920 each waveform in dB, , was estimated as: 921 R Core Team, 2020). A likelihood ratio test was used to determine that the model significantly improved 948 upon adding an orthogonal 2 nd order polynomial of frequency band (transformed so that each coefficient is 949 independent), with linear and quadratic components to estimate the decrease in latency (slope) and 950 change in the rate of latency decrease with increasing frequency band, respectively. Random effects of 951 subject and each frequency band term were included to account for individual variability that is not 952 generalizable to the fixed effects. A power analysis was completed using the simR package (Green and  953 MacLeod, 2016), which uses a likelihood ratio test 

Frequency contribution to the click-evoked auditory brain-stem response in 968 human adults and infants

A novel EEG paradigm to simultaneously 970 and rapidly assess the functioning of auditory and visual pathways

Cortical modulation of auditory processing in the midbrain

The descending corticocollicular pathway mediates 975 learning-induced auditory plasticity

Fitting Linear Mixed-Effects Models Using lme4

Auditory and auditory-visual intelligibility of speech in fluctuating maskers 979 for normal-hearing and hearing-impaired listeners

Praat: doing phonetics by computer

Canlon 983 B. 2019. The search for noise-induced cochlear synaptopathy in humans: Mission impossible?

The effect of broadband noise on the human brainstem auditory evoked 986 response. I. Rate and intensity effects

A comparison of maximum length and Legendre sequences for the 988 derivation of brain-stem auditory-evoked responses at rapid rates of stimulation

Speech Coding in the Brain: Representation of Vowel Formants by 991

Midbrain Neurons Tuned to Sound Fluctuations

Brain Stem Auditory Evoked Responses: Studies of 993 Waveform Variations in 50 Normal Human Subjects

Auditory brainstem responses with optimized chirp signals 996 compensating basilar-membrane dispersion

Effect of Click Rate on the Latency of Auditory Brain Stem Responses in 999 Humans

Auditory brainstem responses to a chirp stimulus designed from derived-band 1001 latencies in normal-hearing subjects

Estimation of Auditory Brainstem Response, Abr, by Means of Bayesian 1003 Inference

The human auditory brainstem response to running speech 1005 reveals a subcortical mechanism for selective attention

Extracranial Responses to Acoustic Clicks in Man

Early components of averaged evoked responses to rapidly repeated 1009 auditory stimuli

Using a Combination 1011 of Click-and Tone Burst-Evoked Auditory Brain Stem Response Measurements to Estimate Pure-1012 Tone Thresholds

Auditory brainstem responses to tone bursts 1014 in normally hearing subjects

A Comparison of Auditory Brain Stem 1016 Response Thresholds and latencies Elicited by Air-and Bone-Conducted Stimuli

MEG and EEG data analysis with MNE-Python

Integration efficiency for speech perception within and across 1022 sensory modalities by normal-hearing and hearing-impaired individuals

SIMR: an R package for power analysis of generalized linear mixed models 1025 by simulation

The natural history of sound localization in mammals--a story of neuronal 1027 inhibition

Auditory evoked potentials from the human midbrain: slow brain stem responses

Ontario Infant Hearing Program Audiologic Assessment Protocol Version 3.1

Age-related changes in BAER at different click rates from neonates 1032 to adults

Effects of cortical lesions on middle-latency auditory evoked 1034 responses (MLR)

lmerTest Package: Tests in Linear Mixed Effects 1037 Models

A Wrinkle in Time: 50th Anniversary Commemorative Edition

Toward a Differential Diagnosis of 1042 Hidden Hearing Loss in Humans

Auditory Brainstem Responses to Continuous Natural Speech in Human 1044 Listeners. eNeuro 5

Influence of Context and Behavior on Stimulus 1046 Reconstruction From Neural Activity in Primary Auditory Cortex

Frequency-multiplexed speech-sound stimuli for hierarchical neural 1049 characterization of speech processing

Interpretation of brainstem auditory evoked potentials: results from 1051 intracranial recordings in humans

The human auditory brain stem as a generator of auditory evoked potentials

Look at me when I'm talking to you: Selective attention at a 1055 multisensory cocktail party can be decoded using stimulus reconstruction and alpha power 1056 modulations

Human auditory evoked potentials. I. Evaluation of 1058 components

The Parallel Auditory Brainstem Response

Effects of noise 1063 exposure on young adults with normal audiograms I: Electrophysiology

Latency of tone-burst-evoked 1068 auditory brain stem responses and otoacoustic emissions: Level, frequency, and rise-time effects

Individual differences in the attentional modulation of the 1071 human auditory brainstem response to speech inform on speech-in-noise deficits

Computational modeling of the auditory brainstem response to 1074 continuous speech

The Alchemyst: the secrets of the immortal Nicholas Flamel

High-synchrony cochlear compound action potentials evoked by rising 1078 frequency-swept tone bursts

Estimation of the Pure-Tone Audiogram by the Auditory Brainstem Response: 1080 A Review

Correlation between confirmed sites of neurological lesions and abnormalities 1082 of far-field auditory brainstem responses

EEG decoding of the target speaker in a cocktail party scenario: considerations 1085 regarding dynamic switching of talker location

ABR and MLR responses were 1112 similar to both types of input but are smaller for female-narrated speech, which has a higher glottal pulse 1113 rate. Peak latencies for female-evoked speech were delayed during ABR time lags but faster for early MLR 1114 time lags

Comparison of responses to ~43 minutes of male-narrated multiband peaky speech

(ABR) to ~43 minutes of broadband peaky speech. Area for the group average shows ± 1 SEM. 1093Responses were high-pass filtered at 150 Hz using a first order Butterworth filter. Waves I, III, and V of the 1094 canonical ABR are evident in most of the single subject responses (N = 22, 16, 22 respectively), and are 1095 marked by the average peak latencies on the average response. 1096 show ± 1 SEM) are shown for responses measured to 32 minutes of broadband peaky speech narrated by 1163 a male (dark blue) and female (light blue). Responses were high-pass filtered at 30 Hz using a first order 1164 Canonical waves of the ABR, MLR and LLR are labeled for the male-narrated speech. Due to adaptation, 1166 amplitudes of the late potentials are smaller than typically seen with other stimuli that are shorter in 1167 duration with longer inter-stimulus intervals than our continuous speech. Waves I and III become more 1168 clearly visible by applying a 150 Hz high-pass cutoff. 1169 Figure 11 . Unaltered speech waveform (top left) and spectrogram (top right) compared to re-synthesized 1170 broadband peaky speech (middle left and right) and multiband peaky speech (bottom left and right). 1171Comparing waveforms shows that the peaky speech is as "click-like" as possible, while comparing the 1172 spectrograms shows that the overall spectrotemporal content that defines speech is basically unchanged 1173 by the re-synthesis. A naïve listener is unlikely to notice that any modification has been performed, and 1174subjective multiband peaky speech with 4 bands (middle, red), and dichotic multiband peaky speech using 5 bands 1191 with audiological center frequencies (right, red). The last band (2 nd , 5 th , 6 th respectively, black line) was 1192 used to filter the high-frequencies of unaltered speech during mixing to improve the quality of voiced 1193consonants. The designed frequency response using trapezoids (top) were converted into the time-domain 1194 using IFFT, shifted and Nuttall windowed to create impulse responses (middle), which were then used to 1195 assess the actual frequency response by converting into the frequency domain using FFT (bottom). 1196 peaky speech response from a single subject. The response shows ABR waves I, III, and V at ~3, 5, 7 ms 1199 respectively. It also shows later peaks corresponding to thalamic and cortical activity at ~17 and 27 ms 1200 respectively. 1201