key: cord-260092-pmufsvg9 authors: Nieuwland, Mante S.; Arkhipova, Yana; Rodríguez-Gómez, Pablo title: Anticipating words during spoken discourse comprehension: A large-scale, pre-registered replication study using brain potentials() date: 2020-09-30 journal: Cortex DOI: 10.1016/j.cortex.2020.09.007 sha: doc_id: 260092 cord_uid: pmufsvg9 Numerous studies report brain potential evidence for the anticipation of specific words during language comprehension. In the most convincing demonstrations, highly predictable nouns exert an influence on processing even before they appear to a reader or listener, as indicated by the brain’s neural response to a prenominal adjective or article when it mismatches the expectations about the upcoming noun. However, recent studies suggest that some well-known demonstrations of prediction may be hard to replicate. This could signal the use of data-contingent analysis, but might also mean that readers and listeners do not always use prediction-relevant information in the way that psycholinguistic theories typically suggest. To shed light on this issue, we performed a close replication of one of the best-cited ERP studies on word anticipation (Van Berkum, Brown, Zwitserlood, Kooijman & Hagoort, 2005; Experiment 1), in which participants listened to Dutch spoken mini-stories. In the original study, the marking of grammatical gender on pre-nominal adjectives (‘groot/grote’) elicited an early positivity when mismatching the gender of an unseen, highly predictable noun, compared to matching gender. The current pre-registered study involved that same manipulation, but used a novel set of materials twice the size of the original set, an increased sample size (N=187), and Bayesian mixed-effects model analyses that better accounted for known sources of variance than the original. In our study, mismatching gender elicited more negative voltage than matching gender at posterior electrodes. However, this N400-like effect was small in size and lacked support from Bayes Factors. In contrast, we successfully replicated the original’s noun effects. While our results yielded some support for prediction, they do not support the Van Berkum et al. effect and highlight the risks associated with commonly employed data-contingent analyses and small sample sizes. Our results also raise the question whether Dutch listeners reliably or consistently use adjectival inflection information to inform their noun predictions. According to current theories of language comprehension, people implicitly and routinely anticipate upcoming words by activating their meaning and possibly other features in advance (e.g., Altmann & Mirkovic, 2009; Dell & Chang, 2014; Levy, 2008; Pickering & Gambi, 2018; Pickering & Garrod, 2013) . The most convincing evidence for word anticipation or prediction is when a neural effect of a predictable word is obtained before that word is presented, for example, on a preceding article or adjective. When readers or listeners predict a specific noun, pre-nominal words that mismatch the predicted word elicit an enhanced event-related potential (ERP) response compared to matching words (for a review, see Kutas, DeLong & Smith, 2011) . Several studies have provided such evidence in the past decades (e.g., DeLong, Urbach & Kutas, 2005; Van Berkum, Brown, Zwitserlood & Hagoort, 2005; Otten & Van Berkum, 2008 , 2009 Wicha, Moreno & Kutas, 2004) . However, the replicability and consistency of the observed patterns have recently come into question (Ito, Martin & Nieuwland, 2017a,b; Kochari & Flecken, 2019; Nieuwland, Politzer-Ahles, Heyselaar, Segaert, Darley, Kazanina, et al., 2018) , which has highlighted the need for replication of key findings. The current study aims to replicate one such highly influential ERP study on linguistic prediction (Experiment 1 of Van Berkum et al., 2005 , henceforth VB05), which tested for effects of prediction on pre-nominal adjectives during comprehension of spoken Dutch mini-stories. It has long been known that the language comprehension system works highly incrementally by incorporating novel input into the unfolding interpretation as soon as possible (e.g., Marslen-Wilson, 1975; Marslen-Wilson & Tyler, 1980) . Rather than waiting for a word or sentence to finish, listeners can use the initial sound of a word to identify the potential meaning or reference of a word (e.g., Connolly & Phillips, J o u r n a l P r e -p r o o f But an amplitude reduction of the noun-elicited N400 alone does not provide clear evidence that a specific word was predicted ('lexical prediction'), because it is also compatible with a more passive pre-activation of related semantic content (which may emerge naturally from comprehension of the preceding context; for discussion, see Baggio, 2018; Van Berkum, 2009 ). In addition, the observed effect could mean that the predictable noun is easier to integrate with the sentence than an unpredictable word, simply because it is a more plausible sentence continuation (for discussion, see Federmeier & Kutas, 1999) , regardless of whether it was actually predicted. For these reasons, researchers typically argue that evidence for lexical prediction is strongest when it is observed before the predicted noun is heard or read, and is obtained by comparing ERPs to words that themselves have little semantic meaning (e.g., the English articles 'a/an') and/or do not differ in meaning (e.g., the Dutch adjectives 'groot/grote', which have the same meaning but differ in the presence of the inflectional suffix '-e' to mark grammatical gender). Differential effects elicited by these prenominal critical words cannot be due to a difference in the meaning of the words themselves, and are therefore thought to arise from the phonological or grammatical relationship between the prenominal word and the predicted noun. ERP evidence for lexical prediction has been reported from various prenominal manipulations in Spanish, English and Dutch. In a pioneering study by Wicha, Bates, Moreno and Kutas (2003) , native Spanish speakers listened to sentence pairs in which a relatively predictable (i.e., moderate-to-high cloze probability) or an unexpected and incongruent noun was replaced with a drawing. Crucially, gender-marked prenominal articles elicited an enhanced N400 ERP when their gender did not match that of the predictable nouns. In J o u r n a l P r e -p r o o f a follow-up experiment with written Spanish sentences, found a very similar N400 effect of gender-mismatch with a predictable noun. In a subsequent experiment with written Spanish sentences but without accompanying drawings, Wicha et al. (2004) found that gender-mismatching articles elicited a different pattern, namely a positive ERP effect (P600) compared to matching articles. In more recent studies on comprehension of written sentences, gender-mismatch on prenominal articles was associated with N400-like effects, i.e. an enhanced negativity in the typical N400 time window (Dutch: Fleur, Flecken, Rommers & Nieuwland, 2020; Otten & Van Berkum, 2009; Spanish: Foucart, Martin, Moreno & Costa, 2014; Martin, Branzi & Bar, 2018; Molinaro, Gianelle, Caffarra & Martin, 2014) , although sometimes with a time course or scalp distribution unlike the typical N400 effects elicited by nouns. In a series of studies on comprehension of Dutch mini-stories, VB05 reported evidence for prediction from a different manipulation also involving grammatical gender. These studies capitalized on the Dutch grammatical rule by which adjectives are marked with an inflectional suffix when they modify an indefinite noun of common gender but not of neuter gender. For example, the suffix on 'grote' in 'een grote boekenkast' (a large bookcase) agrees with the common gender of 'boekenkast', whereas lack of the inflectional suffix '-e' in 'groot' in 'een groot schilderij' (a large painting) agrees with the neuter gender of 'schilderij'. In Experiment 1 of VB05, participants listened to a two-sentence context that presumably led people to predict a specific noun (e.g., schilderij -painting). This context was followed by an adjective phrase that contained two gender-marked adjectives and either the predictable noun (e.g., 'groot maar onopvallend schilderij' -big but unobtrusive painting) or a less predictable noun of a different gender (e.g., 'grote maar onopvallende boekenkast' -J o u r n a l P r e -p r o o f big but unobtrusive bookcase). VB05 found that a mismatch between the inflectional suffix (or lack of thereof) on the first adjective and the gender of a predictable noun (e.g., 'grote' when the predictable noun was 'schilderij') elicited a more positive ERP than a match. This positive ERP effect had a very early onset, namely about 50-250 ms after the first acoustic difference between words with and without inflection, although when ERPs were time-locked to adjective onset the ERP difference had a later time-course (500-800 ms). In Experiment 2, participants listened to only the target sentences, which by themselves presumably did not lead to a specific noun prediction. Consistent with this hypothesis, no statistically significant ERP effect was obtained for the adjectives. In Experiment 3, participants read a subset of the materials from Experiment 1 in a self-paced reading study, where participants press a button to make each next word appear on the screen. The participants slowed down when the second adjective mismatched the predictable noun, compared to a match. Although no such effect was obtained on the first adjective (which would be a behavioral equivalent of the ERP results obtained in Experiment 1), these reading time results nevertheless supported the hypothesis that people anticipated the predictable noun (but see Guerra, Nicenboim & Helo, 2018 , for a recent failure to find prediction-related reading time results for Spanish sentence comprehension). In two follow-up studies (Otten, Nieuwland & Van Berkum, 2007; Otten & Van Berkum, 2008) , the same suffix-based manipulation elicited ERP effects that were different from those obtained in Experiment 1 of VB05. In a study with spoken materials (Otten et al., 2007; henceforth OT07) , prediction-mismatching adjectives elicited a negative ERP effect at right-frontal electrodes that started at about 300 ms and lasted until 600 ms (based on the associated scalp distribution, the authors were reluctant to interpret this effect as an N400 modulation), time-locked to the adjective J o u r n a l P r e -p r o o f onset. In a study with written materials (Otten & Van Berkum, 2008) , predictioninconsistent adjectives elicited a negative ERP effect that appeared as late as 900-1200 ms after adjective-onset. The gender-effects reported by Wicha and colleagues, by OT07 and VB05, as well as by various others, indeed suggest prediction of a specific noun, but various questions about the functional significance of these effects remain. For example, it is unclear whether pre-activated information (which is presumably already available before an article or adjective is presented) includes grammatical gender (e.g., Pickering & Gambi, 2018; Wicha et al., 2004) . It is possible that the initial prediction is limited to word meaning, and that people use gender information to evaluate whether the specific word can still appear. The second question is whether effects of gender-mismatch reflect the detection of a prediction mismatch or (also) the updating or revision of a prediction (for discussion, see Nieuwland et al., 2018b) . Importantly, aside from these questions about interpretation, a major obstacle to any unitary interpretation of the available results is that qualitatively different types of effects have been obtained with (sometimes very) similar gender-based manipulations (for discussion, see Ito et al., 2017b; Kochari & Flecken, 2019) . This could signal something meaningful, namely that different processes are engaged in each of these studies. However, it could also signal the problem with statistically significant effects obtained in noisy, small-sample settings, which are associated with increased probability of overestimated or wrong-sign effect estimates (e.g., Gelman & Carlin, 2014; Vasishth, Mertzen, Jäger & Gelman, 2018) . This problem with small-sample effects has also surfaced in another wellknown demonstration of prediction, a study on English sentence comprehension by DeLong, Urbach and Kutas (2005) , who capitalized on the phonological rule for J o u r n a l P r e -p r o o f indefinite articles (i.e., 'a/an' signals that the next word will start with a consonant or a vowel, respectively). Indefinite articles that mismatched a predictable noun in terms of phonology (e.g., 'an' if the predictable word was 'kite') elicited an enhanced N400 compared to matching articles. This effect is therefore sometimes taken to demonstrate phonological prediction (Pickering & Garrod, 2013) . However, the DeLong et al. results have proven controversial. A study by Martin, Thierry, Kuipers, Boutonnet and Costa (2013) reported a similar effect for mismatching articles, but differences in the analysis complicated a quantitative and qualitative comparison to the DeLong et al. results (for discussion, see Ito, Martin & Nieuwland, 2017a,b) . Another study with this manipulation (Ito et al., 2017) did not obtain a reliable effect of gender mismatch. Moreover, a recent, large-scale (N=334) direct replication study (Nieuwland et al., 2018b) failed to replicate the result of DeLong et al. in an analysis that duplicated the original, and found no statistically significant effect in an additional analysis that took into account subject-and itemlevel variance. Nieuwland et al. concluded that the 'a/an' article-effect may indeed be non-zero, but that it is likely far smaller than originally reported and too small to observe without very large sample sizes. Nieuwland et al. further speculated that the a/an manipulation does not elicit reliable or strong prediction effects because these articles are diagnostic of the next word, which need not be a noun (e.g., 'an old kite'). Unexpected articles thus do not actually refute the upcoming noun altogether, but signal that the noun cannot appear immediately after the article. Stronger or more reliable effects might therefore be obtained with gender-marked articles or adjectives, J o u r n a l P r e -p r o o f which can disconfirm the predicted noun because they agree with that noun in gender irrespective of intervening words 1 . Given the strength of gender agreement relationships, the apparent lack of consistent patterns across studies with gender-based manipulations may seem disconcerting. Establishing the nature and timing of such effects is critical to developing hypotheses about the mechanisms that underlie the generation and evaluation of predictions. For example, prediction-mismatching suffixes elicited an early onset, positive ERP effect in VB05, who did not commit to a specific functional interpretation of this effect beyond the conclusion that the effect demonstrated lexical prediction (see also Van Berkum, 2004) . However, this positive ERP response could be related to P600 effects seen for syntactically unexpected information (e.g., Osterhout, Holcomb & Swinney, 1994) and for morphosyntax agreement mismatch (e.g., Tanner, Grey & Hell, 2017; Wicha et al., 2004) . Such P600 effects are thought to reflect a reanalysis or syntactic integration process (e.g., Kaan, Harris, Gibson & Holcomb, 2000; Kaan & Swaab, 2003) . In contrast, studies reporting N400 or N400like effects suggest that predictions impact the activation of word meaning (lexical access), and have led some authors to argue that people even predict the specific form of the prenominal article itself along with the noun (DeLong et al., 2005) . However, before attempting to explain the different effects of prediction and their association with specific linguistic manipulations or experimental procedures, the field needs to establish which of the key findings can be replicated with similar methods and materials, in a sufficiently large sample to obtain a sufficiently reliable and plausible effect estimate. This is not a trivial issue in a research field where it has long been and still is rather common to select a dependent variable based on visual inspection of low-sample ERP data (Kilner, 2013; Luck & Gaspelin, 2017) , a practice that leads to over-estimated effect sizes and higher rates of false positives (e.g., Gelman & Loken, 2013; Gelman & Carlin, 2014; Vul, Harris, Winkielman & Pashle, 2009 ). Moving away from ERP analysis based on visual inspection, recent ERP studies on language comprehension have pre-registered data processing steps and statistical analyses, and explicitly distinguish between confirmatory and exploratory analyses (e.g., Fleur et al., 2018; Nieuwland et al., 2018b; Sassenhagen & Bornkessel-Schlesewsky, 2015) . The current study tries to replicate the main result obtained in Experiment 1 of VB05, along with that of OT07, as it used the same manipulation and similar materials. Like DeLong et al. (2005) , VB05 is an influential and highly cited ERP study (at time of writing, 843 citations on Google Scholar) that features in major theoretical reviews on linguistic prediction (e.g., Altmann & Mirkovic, 2009; Kutas & Federmeier, 2011; Pickering & Garrod, 2013) . However, like DeLong et al., VB05 has yet to be successfully replicated. The only available study with the same genderbased inflection-manipulation found an effect in the opposite direction (OT07). Moreover, the key evidence reported by VB05 came from an analysis in a time window that was based on visual inspection of the grand-average ERP waveforms (p. 448). This procedure has long been, and probably still is common (see Kilner, 2013; Luck & Gaspelin, 2017 ; see also Nieuwland, 2019, for related discussion), although it is not always explicitly mentioned in Methods sections (for example, in some work by J o u r n a l P r e -p r o o f the first author of this paper, see Nieuwland, Ditman & Kuperberg, 2010; Nieuwland & Kuperberg, 2008) . Selection of a spatial and/or temporal region-of-interest is an appropriate method to sidestep the requirement for multiple comparison correction. However, it is only robust and valid when the selection is independent of the data, whereas it has an increased risk of false positives if the selection is based on where an effect looks strongest in grand-average ERPs (e.g., Kilner, 2013; Luck & Gaspelin, 2017 ; see also http://deevybee.blogspot.com/2013/06/interpreting-unexpectedsignificant.html). Therefore, we deem it important to perform a pre-registered attempt to replicate the effect of key results of VB05 in a sufficiently powered study (see Methods for power analyses). The ERP effect of prediction-mismatching inflections is also important because this manipulation tests for online prediction in a unique, possibly quite subtle way. Dutch adjective-suffix inflection is a better cue to noun gender than definite articles. In Dutch, 'het' is also used for diminutive nouns, whereas 'de' is used for plural nouns, irrespective of noun gender. This pattern is different for adjectives following the indefinite article 'een' that rules out a plural noun. The absence of a suffix is compatible with a diminutive noun irrespective of gender ('een leuk boekje/tafeltje', a nice little book/table), but suffix-presence is only consistent with a common gender noun. Absence and presence of the suffix therefore have different repercussions for whether the general semantic meaning of the predicted noun can still follow: if one predicted 'boek', then 'een leuke' disconfirms that lexical meaning; if one predicted 'tafel, then 'een leuk' does not entirely disconfirm that meaning, as the diminutive 'tafeltje' could follow. Even in an experiment where no such diminutives appear, participants may be sensitive to the possibility of the expected J o u r n a l P r e -p r o o f word appearing in diminutive form and therefore do not take 'missing' inflection as a cue that the predicted meaning is wrong (see also Nieuwland et al., 2018b) . Aside from this issue of 'cue reliability', gender-mismatching inflections might be relatively hard to detect compared to prediction-mismatching gender-marked articles ('el/la' in Spanish, 'de/het' in Dutch), and therefore be less likely to yield prediction effects. The inflection manipulation relies on the detection of an absent or present inflection within a short time frame, whereas the article manipulation relies on detecting two entirely different lexical items. In addition, it involves the detection of prediction-relevant information from an adjective that itself is relatively unexpected 2 and contains novel semantic content. This differs from detection of a gender-marked article that itself might be predicted along with the noun and therefore generate stronger effects. For these reasons, the inflection-based gender manipulation might pose a stronger test of the predictive use of gender-relevant information during language comprehension than the more commonly used article-based gender manipulation (e.g., Foucart et al., 2014; Martin et al., 2018; Wicha et al., 2004) . The current study aims to replicate previously observed patterns for prediction-mismatching adjective inflections (VB05; OT07). We use a novel set of experimental materials that is twice the size of that in VB05, and based on materials from a recent ERP study on lexical prediction (Fleur et al., 2020) . Fleur et al. constructed two-sentence mini-stories that either suggested a definite noun phrase (e.g., 'het boek', the book) or an indefinite noun phrase ('een boek', a book) as its most likely continuation. Following these contexts, participants saw a definite noun phrase with either the expected noun ('het boek') or an unexpected, different-gender noun ('de roman'). Using pre-registered data preprocessing procedures and statistical analyses, Fleur et al. found that gender-mismatching articles elicited an enhanced N400 compared to gender-matching articles 3 (see also Otten & Van Berkum, 2009, for similar results). These findings are relevant for the current study, because they show that we use materials that have already demonstrated relevant ERP effects of prediction. In the current study, we only used a subset of the story contexts of Fleur et al., namely those in which an indefinite noun phrase was the expected continuation, with a minimum cloze value of 75% for both articles and nouns (the cut-off used for the main analysis in VB05). To match the manipulation of VB05, we added two adjectives to each target noun phrase (see Table 1 for an example story). Using Bayesian mixed-effects model analyses, we take a spatiotemporal region of interest approach to test for effects of prediction-match on average voltage values for each trial. The choice of ROIs and time windows was based on VB05 and OT07. These analyses aim to answer the question of whether previous neural evidence of lexical prediction from gender-marked, pre-nominal adjectives can be replicated. We preregister further exploratory analyses to test the effect of prediction-match with traditional ANOVAs (on average values per condition per subject, following VB05 and OT07), and to test whether the prediction-match effect is similar for common and neuter gender nouns. An initial, minimum sample size was determined by a mixed-effects a priori power analysis with the SIMR package (Green & MacLeod, 2016) . Because no single-trial data were readily available from VB05 and OT07, single-trial data were adapted from a previously published study by Ito et al. (2017a,b (Nieuwland et al., 2018) . For the simulation, the number of items for the model was extended to 150 to match the number used in the current study. Power analysis by simulation (number of simulations = 1000) showed that 90 subjects was sufficient to detect an effect at a significance level of alpha = 0.02 with 90% power. We set the initial sample size slightly higher at N=100, which is more than 4 times the sample size of VB05, and refers to the participants ultimately used for statistical analysis, and thus to the minimum number of participants that were tested. Participants were excluded from the statistical analysis using pre-defined criteria (a response accuracy under 75%, or insufficient trials after artefact rejection, described in the data preprocessing section). Each excluded participant was replaced by another participant, and the number of excluded participants will be reported. However, this sample size was set as a minimum, because the simulation did not take into account the potential effects of the absence/presence of the inflection, and does not guarantee the Bayes Factor evidence strength required by this journal. In the original studies, which used ANOVAs, the absence/presence of inflection was approximately balanced across items. In the current study, however, this factor is explicitly accounted for in the model (see Sassenhagen & Alday, 2016) , using a more powerful analysis that simultaneously takes into account sources of variance (subjects, items, presence of inflection) that were not included in the original study's ANOVAs. This is important because, not only may the effect of match differ for different-gender adjectives (see the pre-registered exploratory analysis), unaccountedfor variation that is orthogonal to the effect of interest (e.g., random intercept variation) can reduce power, while unaccounted-for variation that is confounded with our effect of interest (e.g., random slope variation) can drive differences between means, with increased risk of false positives and overestimation of effect size (for discussion, see Barr et al., 2013) . Therefore, our final sample size was not based on the a priori power analysis and we continued to increase our sample size from 100 to 200 in steps of 20 participants until we reached the Bayes Factor evidence strength required by this journal (see Statistical Analysis). However, our laboratory was shut down due to the unfolding covid-19/coronavirus pandemic when we reached 189. Because we had no view of a continuation of testing in the foreseeable future, and because we were confident that the remaining 11 to-be-tested participants would not change our conclusions, we were granted an early sampling finish by the editor. Instead of the materials used in the original study (VB05), we used a suitable set of stimuli readily available from a previous study (Fleur et al., 2020) . This set of materials was created following a similar procedure as that of the original, was larger than that of the original, and had already been normed for cloze probability. Moreover, as stated in the introduction, Fleur et al. had already demonstrated a prediction-consistency N400 effect on pre-nominal articles using these stimuli (see also Otten & Van Berkum, 2009 ). The critical stimuli for the current study consisted of 150 mini-stories, each of which had one context sentence and two possible target sentences. Each mini-story was written to suggest a specific combination of an indefinite article plus predictable J o u r n a l P r e -p r o o f noun in the target sentence. In a cloze test (N=20), each mini-story was truncated before the article 4 , and participants were asked to complete each story. The stories in the current study were completed by at least 75% of the respondents using the same combination of the expected indefinite article and noun 5 . The average cloze probability of the indefinite articles was 95.9% (SD = 5.9 %, range 75-100%), and 93.4% for the nouns (SD = 6.4 %, range 75-100%). These values are numerically higher than those of the original studies 6 . Of the 150 predictable nouns, 80 were common gender 'de' words, and 70 were neuter-gender 'het' words 7 . After the norming, we added two adjectives (separated by a function word, e.g., 'groot en sterk' or 'grote en sterke', which both mean 'big and strong') in 4 Materials in the cloze test of VB05 were truncated after the indefinite articles. Our procedure differed in where the materials were truncated because we also wanted to obtain cloze values for the prenominal articles (Fleur et al., 2018) . 5 Different spellings of the same word were permitted and counted towards the same response, such as 'tattoo/tatoeage' or 'tv/televisie'. 6 For comparison, the expected nouns in VB05 had an average cloze of 86% (SD = 6%, minimum = 75%) while the unexpected nouns had an average of 2% (SD = 3%). The expected nouns of OT07 had an average of 74% (SD = 14%, range 53-100%), whereas the unexpected nouns had an average of 3% (SD = 6%). Although the difference in noun cloze-values between the current study and the original study is small, it is unlikely to make it harder to find a prediction-effect in the current study, given that higher noun-cloze is associated with more, not less predictive processing (e.g., Kutas & Federmeier, 2011) . between the indefinite article and the critical noun. If one or more participants in the cloze test had used an adjective to complete a sentence fragment (which was rare), that adjective was not used for that item in the EEG experiment. For each mini-story, we created a prediction-matching and -mismatching condition. In the matching condition, the absence or presence of an overtly realized suffix '-e' on both adjectives matched the gender of the predictable noun ('dik en spannend' when the predictable noun is 'boek'), and the second adjective was followed by the predictable noun. In the mismatching condition, the absence or presence of the suffix mismatched the gender of the predictable noun ('dikke en spannende' when the predictable noun is 'boek'), and the second adjective was followed by a different-gender noun that was semantically possible but less predictable (e.g., 'roman'). These alternative nouns had appeared at most only once in the cloze responses (average cloze value 0.2 %, SD = 0.9 %, range 0-5%). Like in the materials of VB05 and OT07, there was no overlap in the set of predictable nouns and alternative nouns (which means that a direct comparison between these nouns may be confounded by lexical variables). All sentences were grammatically correct. The critical nouns were never sentence-final, and all subsequent words were identical for the matching and mismatching condition of a given story. An additional set of 120 filler mini-stories of 2 sentences each was added to the materials. Sixty fillers were similar to the prediction-match condition: they also contained high cloze nouns (average cloze 74.5%, SD = 17.7 %, range 29.4-95%), preceded by pre-nominal adjectives. Due to these fillers, highly-constraining stories in the entire experiment were almost twice as likely to end in a predictable than an unpredictable noun. We added these fillers to counter the argument that participants will adapt to unexpected syntactic information (i.e., start expecting unexpected J o u r n a l P r e -p r o o f information) and therefore not show prediction-consistency effects. Such adaptationeffects have been reported, albeit only for frequent repetition of an unexpected syntactic structure (see Fine, Jaeger, Farmer, & Qian, 2013 ; but see also a recent failure to replicate this type of adaptation effect, Stack, James & Watson, 2018), not for varied sentences structures as used in the current experiment. We have also added 60 relatively non-constraining stories to increase the variability of our materials and to make the ratio between the experimental and filler stories more similar to that of VB05. These were adapted from a subset of low-constraint materials used by Otten and Van Berkum (2009; 'prime control' stories), and they did not contain nouns preceded by two adjectives in the second sentences. The current study thus used different fillers and a slightly different ratio between experimental and filler items (150/120) than VB05 and OT07 (120/150 and 160/90, respectively), although it was similar to these previous studies in using only grammatically correct and semantically coherent/plausible mini-stories as fillers. Although the possible effect of the number and type of fillers on comprehension is not precisely known, some ERP studies suggest that a high proportion of predictionlicensing materials actually boost predictive processing (e.g., Brothers, Swaab & Traxler, 2017; Lau, Holcomb & Kuperberg, 2013) . In addition, Fleur et al. (2020) obtained N400 evidence for prediction on prenominal articles in a study that only included high-constraint sentences, of which 66% contained a critical 'de/het' article. For these reasons, we do not think there is a convincing a priori argument that our materials will elicit less predictive processing than those of VB05 and OT07. The mini-stories were recorded with a normal speaking rate and intonation by a female native speaker 8 . Recordings followed the procedure described in VB05. Target sentences of the critical stories were recorded in both conditions. The context sentence was recorded once, together with either the prediction-matching target sentence or the prediction-mismatching target sentence (counterbalanced over stories). We ensured that the critical inflections were always clearly distinguishable from the subsequent word. The recordings of context and target sentences were stored separately. From the target sentence recordings, we identified the acoustic onset of the critical adjectives, of the critical inflections therein, and of the critical noun. Inflection onset was determined as the moment at which adjectives begin to differ between conditions in terms of their respective phonemes, following VB05. Unlike VB05 and OT07, we included comprehension questions to encourage participants to pay attention to the meaning of the stories, and as a means to exclude participants who did not pay sufficient attention. A potential null effect of prediction is then unlikely to result from participants not paying attention to the meaning of the stories. In our view, the importance of ruling out such a 'lack of attention' account balances out the slight deviation from the original studies. It is not known whether and how comprehension questions -that are orthogonal to the manipulation of interest -change the way that people process the meaning of linguistic stimuli. To the best of our knowledge, there is no evidence to suggest that comprehension questions cause participants to process stimuli less predictively compared to when there are no comprehension questions. Moreover, our materials including the comprehension questions largely overlap with and are highly similar to those of Fleur et al. (2018) , who obtained relevant ERP evidence for prediction on pre-nominal, gender-marked articles, as has been also reported in experiments without comprehension questions (Otten & Van Berkum, 2009 ). In our study, 80 of the 270 stories (29%) were followed by a yes/no comprehension question (38 follow a filler story, 42 follow a critical story; 40 questions were designed to elicit a 'yes' response, and the other 40 to elicit a 'no' response). Each question was answerable from the preceding mini-story, irrespective of the critical manipulation when it followed a critical story. Participants who answered fewer than 60 from the 80 questions correctly (75%) would have been replaced by new participants who are presented with the same materials. However, none of our participants met this exclusion criterion, and they had on average 96% correct answers (SD = 3 , range = 83-100). Materials were organized into two trial lists. In one list, one half of the critical stories was presented in the matching condition and the other half in the mismatching condition, and vice versa in the second list. Conditions within each list were pseudorandomized such that no more than 2 filler stories were presented successively and no more than 3 matching or mismatching critical stories were presented successively. Both lists start with 5 practice stories, including two practice comprehension questions. Procedure After electrode application, subjects sat in a sound-attenuating booth and listened to the stories from a speaker placed in front of them. Although participants in VB05 listened to the stories over earphones, we decided to present the stories over speakers, which was also done in later work by Van Berkum and colleagues, including OT07. This method of delivery is more comfortable for participants and, unlike a headphone setup, does not introduce additional artefacts. Our participants were asked to listen attentively to the stories and answer the comprehension questions. After the 5 practice stories, the 270 stories were presented in 5 blocks of 54 items, separated by rest periods. Participants self-paced through the experiment by button press. Upon pressing, each trial started with a 300 ms auditory tone, followed by a 700 ms silence, the spoken context sentence, a 1000 ms of silence, and the spoken target sentence. Participants were instructed to sit still and to refrain from eye-movements and blinks during the second, target sentence. To signal to subjects when to sit still and refrain from eye-blinks and eye-movements, an asterisk was displayed from 500 ms before the onset of the target sentence to 1000 ms after the sentence offset. If a trial was followed by a comprehension question, the question appeared in its entirety on the screen upon disappearance of the asterisk. used by some groups of speakers but that are considered ungrammatical by others). The third test was the Peabody vocabulary test (Dunn & Dunn, 1997; Schlichting, 2005) , in which participants hear words and have to select matching pictures from sets of four options. The fourth test was the Dutch version of STAIRS4WORDS, an adaptive test for assessing receptive vocabulary size (Hintz, Jongman, Dijkhuis, van 't Hoff, Damian, Schröder et al., 2018) . On each trial, the participant saw a word or a non-word foil (ratio 3:1) and indicated whether or not they knew the item. The EEG was recorded from 27 active electrodes (Fz, FCz, Cz, Pz, Oz, F7/8, F3/4, FC5/6, FC1/2, T7/8, C3/4, CP5/6, CP1/2, P7/8, P3/4, O1/2) relative to a leftmastoid reference electrode, along with activity at a right-mastoid reference channel and 4 EOG channels. The electrode locations were similar but not identical to those of VB05 and OT07, however they still allowed for a similar quadrant-based ROI analysis as reported in these earlier studies. Data was recorded with a BrainAmp DC J o u r n a l P r e -p r o o f amplifier, at a sampling rate of 1000 Hz, using a time constant of 10 s (0.016 Hz) and high cut-off of 250 Hz in the hardware filter (this high cut-off differed from the 70 Hz used in VB05 but matched that of OT07 and allowed for later analysis in the 30-100 Hz gamma frequency band), with an additional high cut-off of 100 Hz in the recording software 9 . Electrode impedance was kept below 20 kΩ where possible, which differed from the procedure in VB05 (who used passive electrodes), but it is under the guidelines of the hardware manufacturer, and lowering impedances to under 3 kΩ takes prohibitively long and could cause too much discomfort to participants. Because we had a large number of trials and participants, and because the recordings took place in an air-conditioned room to ensure a cool and dry environment, such impedance differences were very unlikely to meaningfully impact statistical power (see Kappenman & Luck, 2010) . In addition, our sample size calculation was based on high impedance (Biosemi) data and we had already demonstrated prediction ERP effects in our lab with a less restrictive impedance threshold (<25kΩ, Fleur et al., 2020) . Regardless, impedance values were stored before and after the experiment for potential checks. Data pre-processing was performed using BrainVision Analyzer. First, bad EEG channels were identified through visual inspection (as electrodes showing poor signal for at least half of the experiment due to blocking, faulty connectivity or other large-amplitude artefact) and interpolated through spherical splines. Our pre-9 Because the BrainRecorder software could only achieve the pre-registered filteroutcome by a combination of hardware and software filtering, we deviated from the pre-registered protocol in the filter settings. To err on the safe side, we furthermore doubled the pre-registered sampling rate. We matched the precise VB05 hardware filter settings during offline filtering. J o u r n a l P r e -p r o o f registration stipulated interpolation of maximally 4 EEG channels per participant. We ended up interpolating 1 channel from 13 participants, 2 channels from 6 participants, 3 channels from 3 participants, and 4 channels from 1 participant. An interpolation procedure was not used or mentioned in VB05 but we use it to avoid unnecessary data loss. Then, the continuous data were filtered with a 0.02-70 Hz (24 dB) Butterworth IIR band-pass filter (we also included a 50 Hz notch filter, which was not preregistered) to match the hardware filter of VB05. We then segmented the data into epochs from 150 before to 2100 ms after the onset of the first critical adjective. However, because this epoch did not always extend 1000 ms beyond the later noun as in VB05, we created separate segments for the nouns ranging from 150 ms before to 1000 ms after noun onset. We used separate segments instead of longer segments that included all the critical adjective and noun data, because longer segments would have also included artefact-rich post-sentence data in many of the trials. Each segment was then screened for large muscle artifacts, electrode drift, and amplifier blocking. We corrected for artefacts in the segments (due to eye-movements, blink, cardiac or steady muscle activity 10 ) using Independent Component Analysis (trained on segments extracted from continuous data that was filtered with a .1-70 Hz zero phase shift Butterworth band-pass filter plus notch filter). This procedure was not included in VB05 but was used in OT07 and avoids unnecessary data loss, which is important because participants might find it hard or distracting to avoid blinking and eyemovements. We subsequently extracted smaller segments running from 150 ms before to 1000 ms after the onset of adjectives, inflections and nouns. For each segment, baseline correction was performed by subtracting average voltage in the relevant (of three) 150 ms prestimulus interval from the entire segment. We then applied an automated artefact rejection procedure that rejects epochs with values exceeding +/-100 μV. Although no such rejection procedure was applied in VB05, we felt it was important to apply one objective artefact criterion instead of only relying on visual inspection, also to remove segments with artefacts that had been overlooked during visual inspection. In VB05, no participant exclusion criteria were mentioned, but it was stated that, on average, 20.5% of all trials were rejected (based on visual inspection of the data), and that there were no asymmetries between conditions. Here, participants were excluded if their comprehension question accuracy was under 75%, if they had fewer than 40 remaining trials from the initial 75 trials (53%) in any of the 6 conditions (matching/mismatching adjectives time-locked to onset or inflection, matching/mismatching nouns), or if they had fewer than 50 remaining trials (66.7%) on average across all 6 conditions. Two participants were excluded, which left us with a total of 187 participants, who had, on average, 72 match and 72 mismatch trials time-locked to inflection, 72 match and 72 mismatch trials time-locked to onset, and 73 match and 73 mismatch noun-trials (corresponding to a trial rejection rate of approximately 4% for inflection and adjective onset trials, and 3% for noun trials). All raw data and pre-processed data are available on https://osf.io/jqhpz. Before statistical analysis, we downsampled the data segments to 500 Hz, matching that of VB05. spatiotemporal ROI, so that we did not miss effects at ROIs where VB05/OT07 did not observe statistically significant effects. We defined three spatiotemporal ROIs for the nouns (based on VB05), such that we averaged activity within a 300-500 ms time window after noun onset within the left-posterior quadrant, the right-posterior quadrant, and the midline. Using the trial-level data from these ROIs, we performed Bayesian linear mixed effects model analysis using the 'brms' package (Bürkner, 2017) in the R software (R Core Team, 2018), which fits Bayesian multilevel models in the Stan programming language (Stan Development Team, 2018) with formula syntax that is similar to that of the 'lme4' package (Bates, Maechler, Bolker & Walker, 2014) . In addition to Bayes Factor hypothesis testing, this analysis allows for Bayesian (match/mismatch with the predictable word) and the deviation-coded fixed factor 'gender' (common/neuter suffix). The factor 'gender' captures the potential effect of the absence or presence of the inflectional suffix '-e', which is not manipulated within each item (and therefore not included as a random slope for 'item'). voltage ~ match + gender + (match + gender | subject) + (match | item) We followed the suggestions for replication studies by Dienes (2014) and Dienes and McLatchie (2018) , namely to use, for the effect of interest, a prior with a zero mean and a standard deviation that is the previously reported effect size. The previous effect sizes went in both directions (i.e. a positivity in VB05, a negativity in OT07) and corresponded to roughly a .75μV difference. To perform replication tests of those studies, the prior on the effect of match was a normal distribution with a zero mean and SD = .75μV (here and elsewhere, the normal distribution was used lacking an obvious reason to assume non-normality). In other words, there is a 95% prior probability that the parameter lies between -1.5 and 1.5μV (of note, these are twosided tests, but we make up for the associated overall lower prior density in our sampling plan described below). We also included a prior for the effect of gender, which centered on mean zero with a normal distribution because the effect could be positive or negative, and a prior SD that was the same as for match (assuming the effect of gender is unlikely to be bigger than that of match), such that there is a 95% probability that the parameter lies between -1.5 and 1.5μV. We included an intercept prior with a normal distribution, mean zero and SD of 1.5, such that there is a 95% probability that the intercept parameter lies between -3 and 3μV. This decision was informed by intercept parameters in previous studies J o u r n a l P r e -p r o o f (Fleur et al., 2019; OT07; VB05) , and appeared suitable for analyses time-locked to inflections, adjectives or nouns. We did not include priors for the standard deviations of group-level ('random') effects, but used the corresponding default priors, which "are used (a) to be only very weakly informative in order to influence results as little as possible, while (b) providing at least some regularization to considerably improve convergence and sampling efficiency" (https://rdrr.io/cran/brms/man/set_prior.html; Bürkner, 2017) . Likewise, we also did not include a prior for the standard deviation of the residual error. We did include a prior for the correlations of group-level ('random') effects using as the LKJ (2) To sum up, we performed 10 prediction-match replication tests (5 using inflection-onset time-locked data, 5 involving the adjective-onset time-locked data), after having collected data from 100 participants who met the inclusion criteria. The required strength of the evidence for statistical inference was set at a Bayes Factor of at least 12, either for the alternative hypothesis or the null-hypothesis. This was double the evidence strength required by this journal, which we used to 'make up' for the lower prior density of our normal prior (i.e., a two-sided test) compared to a halfnormal prior (i.e. a one-sided test). We took sufficiently strong evidence (Bayes Factor > 12) for the alternative hypothesis at any ROI selection as a successful replication in the sense that it demonstrates the use of inflection information. In that scenario, the observed effect was either positive or negative, which we would take as a replication of either VB05 or OT07, respectively. Based on VB05, we expected the effect polarity to be the same in the two time windows. If this was not the case, this could signal a problem with the inflection-onset time-locked analysis, perhaps due to application of a baseline correction in an already unfolding effect (for example, because participants pick up on relevant acoustic differences between conditions before inflection onset, see discussion in VB05). The adjective-timelocked effect then receives priority in guiding our conclusions, since baseline differences are less likely to be a problem for that analysis. Importantly, we took sufficiently strong evidence for the null hypothesis at all these selections as a failure to replicate both VB05 and OT07. Factors for the alternative hypothesis reached 12 or all obtained Bayes Factors for the null hypothesis stayed below 12, we followed the guidelines of this journal and tested additional participants until that evidence strength is reached (one Bayes Factor that J o u r n a l P r e -p r o o f sufficiently supports the alternative hypothesis, or all Bayes Factors sufficiently supporting the null hypothesis). Sample size was increased in steps of 20 participants, to be capped at a maximum of 200 participants for practical considerations. However, we were forced to halt testing before reaching that number because of a covid-19 pandemic related lockdown of our institute, and we therefore report analyses on a total sample size of only 187 participants. After completion of data collection, we performed additional analyses with different priors for the effect size of 'match' to investigate the robustness of the obtained results (normal(0,1) and normal(0,.5) to cover a wider/narrower range of plausible values). The noun analyses were performed once the data collection had stopped, and served as positive controls because N400 effects of predictable versus unpredictable words are highly common throughout the psycholinguistic literature (for review, see Kutas & Federmeier, 2011 ) and typically of a relatively large effect size compared to prenominal manipulations. If no effect is observed for the adjectives and no N400 effect of match is observed for the nouns, no valid inference about predictive processing can be drawn from the adjective data (see also DeLong et al., 2005; Nieuwland et al., 2018b; VB05) . For the three noun-ROIs, we tested the following model: voltage ~ match + (match | subject) + (match | item). Similar to the adjective analyses, the prior on the estimated effect size had a normal distribution and a zero mean (although unexpected nouns were expected to elicit more negative voltage than expected nouns). The prior standard deviation of the match parameter was set at 2μV, corresponding to the J o u r n a l P r e -p r o o f approximate effect sizes reported by VB05 and OT07. The intercept prior was the same as for the adjective analysis, as were the other priors (but no prior is included for 'gender'). Like the nouns in VB05 and OT07, there was no overlap between predictable and unpredictable nouns, which means that the noun-comparison was confounded by lexical variables (e.g., frequency) and contextual variables (e.g., plausibility) that are known to influence N400 amplitude (e.g., Kutas & Federmeier, 2011; Nieuwland et al., 2018b) . In addition, VB05, OT07, and this study used different nouns altogether. Despite these caveats about between-study differences, an additional model was tested with a stronger noun prior, to test whether the obtained N400 effect had changed the support for the specific noun effect-size reported by VB05. For this analysis, we used a normal prior for 'match' with a mean at -2.2 μV and a standard deviation of .50 μV, corresponding to the strongest effect reported in VB05 (at the left-posterior quadrant). This prior defines a 95% probability that the 'match' parameter lies between -1.2 and -3.2μV. The factor 'gender' was included in the analyses because, in principle, it was considered a nuisance variable. Instead of only counterbalancing gender across items, our confirmatory analyses explicitly accounted for the associated variance when testing the effect of prediction-match (see also Sassenhagen & Alday, 2016) . However, we further considered an effect of 'gender' in an exploratory analysis, because the effect of prediction-match could depend on gender (see also Loerts, Wieling, & Schmid, 2013) . For example, the prediction effect could be greater when the predictable noun has common gender compared to neuter gender, perhaps because J o u r n a l P r e -p r o o f it is easier to detect a mismatch on an overt suffix than on the absence of a suffix, or perhaps because an overt suffix rules out the expected noun meaning, whereas the absence of the suffix does not rule out entirely the expected noun meaning (it could signal an upcoming diminutive noun irrespective of gender; see also Loerts et al., 2013, for relevant discussion). An alternative scenario is also possible, namely that the prediction-match effect is greater when the predictable noun has neuter gender as opposed to common gender, perhaps because people find it harder to detect a mismatch on an overt suffix. This could be related to the fact that language learners tend to add the suffix incorrectly more often than they omit it incorrectly; both young and old language learners tend to overgeneralize the suffix like in 'een moeilijke boek' (Weerman, Bisschop & Punt, 2014 ; for a review on Dutch adjectival inflection, see Van de Velde & Weerman, 2014) . It is possible that such overgeneralizations by L2 learners change inflection processing even in L1 speakers, or that overgeneralizations in childhood continue to impact inflection processing later in life, even if only very subtly. In both these hypothetical scenarios, one could expect to observe an interaction between 'match' and 'gender' on inflection-elicited ERPs. There was no strong a priori reason to assume that this interaction term would yield a strong effect, given that previous studies approximately balanced the 'gender' factor across items. Nevertheless, we considered this possible interaction effect and performed exploratory tests that included the interaction term as a fixed effect and included a bysubject random slope in the brms model. voltage ~ match * gender + (match * gender | subject) + (match | item) For this analysis, we add one prior to the brms model for the adjective and inflection analyses, namely a normal distribution prior with a mean zero and SD of 1 J o u r n a l P r e -p r o o f for the slope of the interaction parameter. This prior defines a 95% probability that the parameter for the interaction term, i.e. the difference in the match effect for common and neuter gender adjectives, lies between -2 and 2μV. At the request of the reviewers, we conducted repeated-measures ANOVAs that closely follow the analyses of VB05 and OT07 11 . However, we note that the primary basis for our conclusions is the Bayes mixed-effects analysis approach described previously. The ANOVA analyses were performed after data collection had ended, using the function 'aov_car' from the 'afex' R package (Singmann, Bolker, Westfall, Aust, & Ben-Shachar, 2019 ; in our pre-registration we planned on using JASP, 2018). Averaging over items, we analyzed mean amplitude values per condition per subject in the 50-250 ms time window after adjective inflection onset (VB05), and in the 300-600 ms time window after adjective onset (OT07). For the N400 effects at the noun, only the 300-500 ms time window was analysed. We conducted repeated measures ANOVAs on the five ROIs defined above. In the midline region, Prediction-match (matching vs. mismatching) was fully crossed with 11 Using G*Power with 'SPSS standards' (Faul, Erdfelder, Lang, & Buchner, 2007) , we established that our minimum sample size of N=100 also yielded sufficient a priori power for these ANOVA analyses. For the lowest relevant F-value from VB05 (F (1, 23) = 5.16), ie. the most conservative estimate, the partial eta-squared measure of effect size was η 2 p = 0.183 (see Lakens, 2013) . The lowest F-value in OT07 was (F (1, 28) = 4.5), which yielded η 2 p = 0.138. To detect this latter, more conservative effect size with a power of .9 at an alpha level of 0.02, the required sample size is N = 86. This would be the required sample size if we used only about half of our items. J o u r n a l P r e -p r o o f the five electrodes (Fz, FCz, Cz, Pz, Oz) . The analysis at the four quadrants was carried out by crossing Prediction-match with Hemisphere (left vs. right) and Anteriority (anterior vs posterior). In all analyses, Greenhouse-Geisser correction was applied to F-tests with more than one degree of freedom (Greenhouse & Geisser, 1959) . We report the mean voltage and SD for each condition, the estimated difference between conditions with the 95% CI, and Cohen's d measure of effect size. J o u r n a l P r e -p r o o f Figure 1 . Inflection effects. The graphs show the grand-average ERPs elicited by gender-matching (solid blue lines) and gender-mismatching inflection (dotted red lines) at the 5 pre-registered ROIs. In these and following figures, color-shaded areas show the within-subject standard error of the condition mean (Cousineau, 2005; Morey, 2008 ; calculated with the 'Rmisc' package in R). We emphasize that these ERP plots do not directly correspond to the results of our statistical analyses, which account for variance associated with different items and with common/neuter gender. Time-locked to inflection onset and to adjective onset, the mismatch estimate is negative at all ROIs and a majority of the posterior distribution falls under zero. This is also visible in Figure 5 J o u r n a l P r e -p r o o f Figure 5 . Results from the Bayesian hypothesis tests for the gender-mismatch effect time-locked to inflection onset. Graphs depict the prior (light blue) and posterior (dark blue) density, with prior and posterior density at zero marked by a yellow and red dot, respectively. The ratio of the density values at zero, the Bayes Factor, is labelled on each graph, here showing the Bayes Factor evidence in support of the null hypothesis (BF null ), with higher values corresponding to the increased belief in the nullhypothesis given our data. Results from the Bayesian hypothesis tests for the noun mismatch effects at the three pre-registered ROIs. Although the graphs highlight posterior density at zero with a red dot, the posterior samples did not contain the value zero, which is why BF null is labelled as zero. We also computed Bayes Factors with different priors to investigate the robustness of the obtained results (Table 3) . For effects time-locked to the inflection or adjective onset, use of a wider prior (SD = 1) increased the obtained BF null at each ROI, with half of the ROIs yielding moderate evidence for the null hypothesis and the other half yielding anecdotal evidence. With a narrower prior (SD=0.5), the tests yielded anecdotal evidence. For comparison, we also performed exploratory analyses on ERPs time-locked to inflection with a prior mean and SD roughly based on VB05 (M=0.75, SD=0.375) and OT07 (M=-0.75, SD=0.375), reported in Appendix Table A .1. With this 'stronger' VB05 prior, we found strong evidence for the null hypothesis (BFs null ranging from 12.7 to 22.7 for the 5 ROIs). With the OT07 priors, we found anecdotal/moderate evidence for the null hypothesis (BFs null ranging from 2.1 to 6.8), which suggests that while the obtained effect is likely to be a negativity, it is probably much smaller than the effect reported by OT07. For the noun effects, using a prior that corresponded to the strongest effect reported in VB05 did not impact the results because our posterior samples never included zero. Hence, even though the obtained N400 effects had mean estimates that were lower than those reported in VB05, our results nevertheless yielded extreme evidence against the null due to the high precision of our estimates. J o u r n a l P r e -p r o o f Table 3 . Bayes Factor results (BFs null ) from analyses with different priors. For effects time-locked to inflection or adjective onset, the new prior for the standard deviation for the mismatch effect was either wider (1) or narrower (0.5) than in the main analyses. For noun effects, the prior corresponded to the strongest effect reported in VB05. Inflection onset prior Adjective onset prior Noun prior ROI SD = 1 SD = 0.5 SD = 1 SD = 0. As shown in Figure 8 , mismatching inflection elicited a more pronounced negativity for common gender than for neuter gender, at least at frontal ROIs (see Appendix Figures A.4 and A.5 for effects at individual ERP channels, and see Figure 4 for scalp distributions of the mismatch effects). At the left-anterior ROI, this negativity started as early as 0 ms and lasted for 1000 ms, whereas the other ROIs showed a negativity mostly in the 150-600 ms time window. For neuter nouns, the negativity was also visible at posterior ROIs, but anterior ROIs showed a positivity, at least from approximately 300 ms onwards. The corresponding adjective onset effects are shown in Figure 9 (see also gender. These results only lend some support to the interaction pattern, albeit weak. In all analyses, the credible interval included zero and the BF null only yielded anecdotal evidence, although the posterior probability of the effect being negative was high for inflection-locked effects at anterior ROIs. Figure 10 shows the pairwise mismatch J o u r n a l P r e -p r o o f effects for common and neuter nouns separately. Not reported in detail here, estimates for the mismatch effect were very similar to those from the models without interaction term. J o u r n a l P r e -p r o o f J o u r n a l P r e -p r o o f Table 4 . Results from the pre-registered exploratory analysis of the interaction between gender and gender-mismatch. Each cell gives the corresponding estimate (b) in μV for the interaction term (negative values correspond to a more negative mismatch effect for common gender than for neuter gender), the associated credible interval (CrI), and the posterior probability of the effect being negative (p(b)<0, the percentage of posterior samples under zero), and the BF null . J o u r n a l P r e -p r o o f Figure 10 . Results from the pre-registered exploratory analyses. The graphs show the mismatch effects (mismatch minus match) at each ROI, time-locked to inflection and adjective onset, for common gender (purple) and neuter gender (orange). Dots represent the marginal mean, whiskers represent the 95% credible interval. For ERPs time-locked to inflection onset, repeated measures ANOVAs on the four quadrants revealed a marginally significant prediction-mismatch effect (F(1,186) = 3.21, p = 0.074, mean difference = -0.14 µV, 95% CI Our pre-registered analyses averaged activity from selected electrodes and time points within spatiotemporal ROIs based on VB05/OT07. To better characterize the effects of interest inside and outside of the ROIs, we performed exploratory mass regression analyses. First, we downsampled the pre-processed, segmented adjective onset and inflection data to 100 Hz (i.e., one sample for every 10 ms) to speed up the analysis. Then, we performed a mixed-effects model analysis using the 'lme4' package (Bates et al., 2014) for each channel, and for each sample between -150 to 500 ms relative to inflection onset (this shorter window minimized distortion from effects associated with noun onset) and between -150 to 1000 ms relative to adjective onset. We first tried analyses with the same fixed and random effects as in the preregistered exploratory analysis, but because all models failed to converge, even after removing random correlations, we opted for simpler models, also to further speed up the analysis. We here report results from a model with the main effect and interaction between match and gender as fixed effects, a by-subjects random slope for match, and Here too, the effect fluctuated at an alpha-range frequency, especially between 700 and 1000 ms after onset. Corresponding results for the interaction terms are available on our OSF page. Taken together, these results suggest that the effect of gender-mismatch was strongest towards the end or after the pre-registered time windows, and had a posterior scalp distribution. However, we emphasize that the results of these exploratory analyses only have a descriptive purpose. Using a rather conservative method to control the false discovery rate that does not take into account spatiotemporal contingencies in the data (Benjamini and Hochberg, 1995) , the tested samples did not survive correction for multiple comparisons. We performed a pre-registered, close replication of Van Berkum et al. (2005, Experiment 1), a canonical ERP study on lexical prediction during spoken discourse comprehension. In the original study, the marking of grammatical gender on prenominal adjectives ('groot/grote') elicited an early positivity when it mismatched the gender of an unseen, highly predictable noun, compared to matching gender. In our large-scale (N=187) replication effort, we did not obtain this pattern of effects, but, if anything, a reverse pattern: mismatching gender elicited enhanced negativity J o u r n a l P r e -p r o o f compared to matching gender, reminiscent of the effects reported by Otten and colleagues (2007) . We observed enhanced negativity at all spatiotemporal ROIs, whether time-locked to onset of the inflection or the adjective. However, this enhanced negativity was generally very small (approximately between -0.15 and -0.20 μV at the different ROIs), and our Bayes Factor hypothesis tests either anecdotally or moderately favoured the null hypothesis. In contrast, we successfully replicated VB05's prediction-mismatch N400 effect for the nouns, observing extreme evidence against the null hypothesis even when our prior corresponded to the strongest nounelicited effect reported in VB05. Pre-registered exploratory analyses showed that, at the anterior and midline ROIs, the negativity obtained in the inflection time-locked analysis was primarily generated by common gender adjectives ('grote') and close to zero for neuter gender adjectives ('groot'). However, like the main effect of gender-mismatch, the observed gender by mismatch interaction effect was weak and not supported by our Bayes Factor tests. Further exploratory analyses suggested that the main effect of gendermismatch was most pronounced at posterior electrodes, where it was similar for common and neuter gender, and strongest near the end or even after the pre-registered time windows. Taken together, these results do not support the effect reported by VB05. However, the results did not yield clear evidence against lexical prediction more generally, and in fact yielded some evidence in support of prediction. In the sections below, we discuss our results in more detail and briefly consider their implications for theory and research on predictive language comprehension. Interpreting our pre-nominal results is not entirely straightforward because different sources of evidence point in different directions. As our primary and preregistered source, the obtained Bayes Factors weakly favoured the null hypothesis. From the 10 BFs null (quantifying support for the null-hypothesis at each of the 5 ROIs, time-locked to inflection or adjective), 9 were over 1, and 4 were over 3 ('moderate evidence'). However, these values were generally low, and nowhere near the preregistered threshold (BF=12) that would have allowed us to halt sampling and claim replication success or failure. At the same time, the gender mismatch effect estimates themselves were clearly suggestive of a negativity. This was evident from the posterior probabilities of the effect being negative, which were consistently higher than 78% across all pre-registered tests, and from the exploratory analyses. This discrepancy is primarily caused by the prior, which influences the Bayes Factor much more strongly than the estimate, as also demonstrated by our results obtained with varying priors. With pre-registered, widened and narrowed zero-mean priors, Bayes Factor support for the null hypothesis increased and decreased, respectively, without noticeable effect on the obtained estimates. With exploratory priors centered on the estimates reported by VB05, we obtained strong Bayes Factor support for the null hypothesis while the estimate became less negative by only about 0.05 µV. For this reason, it is generally advisable to pre-register a range of informative and plausible priors. The influence of the prior can be considered either a bug or a feature of Bayesian null-hypothesis testing, depending on your perspective (for discussion, see Kruschke, 2011; Kruschke & Liddell, 2018; Rouder, Haaf, Vandekerckhove, 2018; van Ravenzwaaij & Wagenmakers, 2019) . Weighing the two sources of evidence is ultimately a judgment call. Despite insufficient Bayes Factor support to claim a replication failure, we are fairly confident that our results do not replicate VB05's positivity. However, we are much less confident regarding a replication of OT07's negativity, because our effect appears quantitatively and qualitatively different. Our effect was approximately only one-third of the OT07 effect size. Whereas the OT07 effect had a clear right-anterior maximum, our effect was not particularly lateralized and most prominent at posterior channels (and therefore not unlike an N400 effect in terms of scalp distribution and timing). Nevertheless, and despite the Bayes Factor evidence supporting the null hypothesis, our results do suggest that if a true population-level effect exists at all, it is likely small and negative. Although defining a close or exact replication remains controversial (e.g., Simons, 2014; Zwaan, Etz, Lucas, Donnellan, 2018), we consider our study to be a close replication of VB05 and OT07, not an exact one. Readers might therefore be tempted to attribute the difference in results to a difference in methods. While influences of methodological differences cannot be ruled out, we consider it unlikely that they are the primary cause of the different results. For example, we used a different and larger set of prediction-inducing stories, but our stories were constructed in the same way as those of VB05/OT07, and had an equally strong, if not stronger cloze probability manipulation as the originals. Moreover, our items were based on a set of items that have twice demonstrated a prenominal prediction effect on gendermarked articles ('de/het') with a much smaller sample size (N=48 and N=80; Fleur et al., 2020) . We also used different filler items, but retained a similar experimental to filler ratio as VB05/OT07. We used a different speaker, but this speaker was not J o u r n a l P r e -p r o o f faster than that of VB05. Differences between our study and VB05/OT07 are detailed and justified in the Methods section, and also summarized in the online Supplementary Table 1 . We also summarize differences between our study and our pre-registration in Supplementary Table 2 . Perhaps the strongest argument against the role of methodological differences does not involve a comparison between our study and VB05/OT07, but between VB05 and OT07. These studies were highly similar to each other, but nevertheless yielded two different types of effects. This discrepancy has previously been discussed in terms of as-yet-unidentified differences. For example, when discussing VB05, OT07 and other studies, Otten and van Berkum (2009, p.96 ) note that "a systematic inventarization across all studies shows that this variability cannot be accounted for by differences in language, stimulus modality, type of prediction probe, or differences in working memory capacity of participants. One possibility is that perhaps the broader context in which stimuli are presented (i.e. the type of filler that is used,the length of the experiment) matters more than commonly assumed, but we refrain from speculating about specific other factors that could critically influence the way people make predictions, or process prediction-inconsistent data". Otten and van Berkum thus discussed the discrepant effects as two meaningful demonstrations of lexical prediction (i.e., as two 'true' effects), as is typical for the broader psycholinguistic literature (e.g., Ito, Corley, Pickering, Martin & Nieuwland, 2016; Pickering & Gambi, 2018) . The current study, however, was premised on the assumption that only one of the original effects can be a 'true' effect, and that the other effect therefore is likely a false positive. False positives and wrong-sign estimates are to be expected in noisy, small-sample settings (Gelman & Carlin, 2014) , especially when analysis choices are J o u r n a l P r e -p r o o f contingent on the data (e.g., based on visual inspection of ERP waveforms, as was the case in VB05 and OT07). Our results suggest that the positivity reported by VB05 is more likely to be a false positive finding than the negativity reported by OT07. We anticipated a potential role of adjective-gender and the concomitant inflection in shaping the neural response to gender-mismatch. We considered two potential scenarios. The mismatch effect could be greater for common gender than for neuter gender, either because people find it easier to detect mismatch on overt suffixes than on absent suffixes, or because overt suffixes have a bigger impact than absent ones because only overt suffixes rule out the expected noun entirely. Alternatively, the mismatch effect could be greater for neuter gender than for common gender, possibly because detection of a mismatch on overt suffixes is more difficult for language developmental reasons (e.g., Weerman et al., 2014) . The results from our pre-registered exploratory analyses did not conclusively favour one scenario over the other. ERPs time-locked to inflection suggested a stronger mismatch effect at the left-anterior ROI for common gender than for neuter gender. However, this pattern was very weak and its significance remains unclear. One obstacle to interpretation is the early positive ERP effect that was visible right after adjective onset and therefore not elicited by the inflections (whose onset occurred at least 200 ms after adjective onset). This positive effect may have shown up as a negativity when we time-locked to inflection onset, as an artefact introduced by the baselining procedure 13 . This brings us to two general caveats. First, our design was not optimized for this interaction analysis. Because predictable common and neuter gender nouns were preceded by different contexts and adjectives, the mismatch effects for each gender involve a comparison between different sets of adjectives. How this impacted the results is not known, but it could have generated patterns such as the early positivity for common gender adjectives. Second, while our sample size is much larger than in typical ERP experiments on language comprehension, it was also not optimized (and probably too small) to reliably detect a gender by mismatch interaction. One additional relevant observation pertains to the left-posterior ROI. From all 5 ROIs, the gender-mismatch effect there was strongest, whereas the interaction effect there was weakest and near zero, reflecting similar gender-mismatch effects for common and neuter gender. We conclude, therefore, that our results are most consistent with a gender-mismatch effect for common and neuter gender adjectives. While our gender-mismatch effect may appear surprisingly weak, a weaker effect than those reported by VB05 and OT07 was to be expected. The original effects were both just statistically significant at the α = 0.05 level. In small-sample, noisy data sets, such effects already tend to have an overestimated effect size and increased chance of a wrong sign (Gelman & Carlin, 2014) . Moreover, their effects were based on data selected via visual inspection, a procedure that further overestimates the effect size. In the current study, weakness of the observed effect was partly the result of the pre-registered time windows based on VB05/OT07; the effect appeared strongest towards the end or even after the pre-registered time windows. Beyond the comparison to VB05/OT07, the pre-nominal prediction effect on Dutch adjectival inflection may be generally smaller than other pre-nominal prediction effects for several potential reasons. One reason is the unexpectedness of the gender-marked adjectives themselves, as suggested by recent results from our laboratory on written language comprehension (Fleur et al., 2020) . In Fleur et al., gender-mismatching definite articles elicited enhanced negativity in the N400 time window compared to matching articles when the context presumably led participants to expect a particular, gender-marked article-noun combination (e.g., 'de' when they expected 'het boek'). This effect was found in two identical experiments with preregistered analyses and much smaller sizes (N=48 and N=80) than the current one. However, when participants expected an indefinite article-noun combination that lacks gender-marking (e.g., 'een boek'), there was only a small gender-mismatch effect on definite articles in the N400 time window. In other words, gender-mismatch effects may be relatively small, and therefore harder to detect, when participants do not expect a gender-marked word in the first place, as may have been the case in our study. That said, there is no principled reason why prediction-effects cannot be obtained on adjectives at all, even when they are unexpected. A highly predictive language comprehension system should be able to make do 14 . 14 Unless perhaps the meaning of the adjective is incompatible with the predicted noun or changes the noun prediction. In the current study, VB05 and OT07, the critical adjectives were selected for being semantically compatible or congruent with the high-cloze noun, and it is assumed that the meaning of the adjective does not change the noun prediction. Whether gender-mismatch effects can be obtained on semantically incongruent adjectives (e.g., 'blue' if the predicted noun is 'banana') is an open question, but we think this is unlikely. Another reason, one that we find more plausible, could be the difficulty with detecting mismatch on fleeting, relatively subtle information in the spoken signal. We emphasize that we do not claim people predict less when listening than when reading. However, our manipulation on word-final inflections is arguably more subtle than a comparison between two entirely different words (e.g., 'de/het', 'el/la'), because our participants needed to distinguish between a schwa sound or an inter-word 'silent' period. This relatively small acoustic/phonetic difference might be hard to discern (e.g., Bailey & Hahn, 2005) , and is sometimes further distorted by coarticulation effects (influences on pronunciation associated with preceding or subsequent sounds) 15 . We would expect a large pre-nominal prediction effect when the mismatching condition differs more strongly acoustically from the matching condition (e.g., a spoken version of the 'de/het' manipulation). People typically need only one or two phonemes to detect a deviation from a predicted noun (e.g., Van Berkum et al., 2005; Van Petten et al., 1999) . This was also demonstrated by our noun results; prediction-mismatching nouns elicited strong N400 effects starting as early as 100 ms after noun onset. The weak nature of our pre-nominal prediction effect should not be taken as evidence against lexical prediction more generally. It does raise the question, however, whether listeners reliably or consistently use adjectival inflection information to inform their noun predictions. When a misprediction is evident, people may use the available gender information to revise their initial noun prediction, and perhaps even change their initial prediction to a new noun (as demonstrated by 15 In some items, coarticulation might make the conditions phonemically more dissimilar (e.g., /d/ sounds different in 'verkleed' versus 'verklede', see also VB05 and our Methods section). J o u r n a l P r e -p r o o f concomitant effects on noun-elicited N400s, e.g., Fleur et al., 2020; Szewczyk, & Wodniecka, 2020) . However, when evidence for misprediction is less compelling or ambiguous, people might be 'reluctant' to let go of their initial noun prediction (e.g., Nieuwland et al., 2018) . Such reluctance could make sense because our comprehension system must deal with or compensate for coarticulation effects, disfluencies and noisy real-world environments (e.g., Corley & Stewart, 2008; Mattys, Davis, Bradlow & Scott, 2012; Norris, McQueen & Cutler, 2016) . Future research efforts should elucidate which pre-nominal manipulations elicit more reliable spoken language prediction effects than others. Especially when combined with computational modelling (e.g., Norris et al., 2016) , such efforts can reveal the speech processing mechanisms involved in evaluating discourse-based lexical predictions. Furthermore, it remains to be established whether the adjectival inflection manipulation has different effects on predictive processing during reading and listening (e.g., Otten et al., 2007; Otten & Van Berkum, 2008 ). In VB05's self-paced reading experiment (Experiment 3), readers slowed down upon encountering gendermismatching adjectives compared to matching adjectives. However, this effect did not occur at the first of two gender-mismatching adjectives but on the second one appearing 3 words downstream (e.g., 'onopvallende' in 'grote maar nogal onopvallende', English translation: 'unobtrusive' in 'big but rather unobtrusive'). In a written language version of OT07, Otten and Van Berkum (2008) observed enhanced negativity for gender-mismatching adjectives compared to matching adjectives, but this effect occurred as late as 900-1200 ms after word onset. In sum, while gendermismatching adjectives elicited rather weak effects in the current spoken language study, their effects during reading may be even weaker or less consistent. to Dutch spoken mini-stories. In the original study, the marking of grammatical gender on pre-nominal adjectives ('groot/grote') elicited an early positivity when mismatching the gender of an unseen, highly predictable noun, compared to matching gender. In our large-scale, pre-registered replication effort, we did not obtain such a positivity, but found enhanced negativity instead. However, this negativity was small and our pre-registered Bayes Factor analyses generally favoured the null-hypothesis. Although reminiscent of the right-anterior negativity reported in a similar study by Otten et al. (2007) , the current negativity was much smaller and had a posterior scalp distribution. Our results highlight the risks associated with data-contingent analysis. Given that data-contingent analysis has been and still is common in the psycholinguistic literature, especially in EEG research, some key findings in this literature may prove hard to replicate (e.g., Nieuwland et al., 2018; Nieuwland, 2019) . The weak nature of our pre-nominal prediction effect should not be taken as evidence against lexical prediction more generally. Recent work from our laboratory, for example, observed strong pre-nominal prediction effects on gender-marked articles during reading (Fleur et al. 2020) , with pre-registered analyses but smaller sample sizes. The weak nature of the current effect may reflect the difficulty in detecting gender-mismatch from fleeting, relatively subtle information in the spoken signal. Our results therefore raise the question whether Dutch listeners reliably or consistently use adjectival inflection information to inform their noun predictions. J o u r n a l P r e -p r o o f Effect of gender-mismatch (mismatch minus match) on ERP time-locked to inflection onset, plotted as the voltage estimate and corresponding 95% confidence interval (gray area) at each timepoint and channels. Dots underneath the voltage estimates indicate statistically significant samples (not corrected for multiple comparisons). N.B. samples occurring 480 ms after inflection may be distorted by effects associated with noun onset. Figure A.9. Results from the mass regression analyses. Effect of gender-mismatch (mismatch minus match) on ERP time-locked to adjectives onset, plotted as the voltage estimate and corresponding 95% confidence interval (gray area) at each timepoint and channels. Dots underneath the voltage estimates indicate statistically significant samples (not corrected for multiple comparisons). N.B. samples occurring 800 ms after adjective onset may be distorted by effects associated with noun onset. New and updated tests of print exposure and reading abilities in college students Incrementality and Prediction in Human Sentence Processing Mixed-effects modeling with crossed random effects for subjects and items Meaning in the Brain Phoneme similarity and confusability Random effects structure for confirmatory hypothesis testing: Keep it maximal Fitting Linear Mixed-Effects Models Using lme4 Controlling the false discovery rate: a practical and powerful approach to multiple testing Goals and strategies influence lexical prediction during sentence comprehension brms: An R Package for Bayesian Multilevel Models Using Stan Event-related potential components reflect phonological and semantic processing of the terminal word of spoken sentences Hesitation disfluencies in spontaneous speech: The meaning of um The P-chain: relating sentence production and its disorders to comprehension and acquisition Praat script to detect syllable nuclei and measure speech rate automatically Probabilistic word pre-activation during language comprehension inferred from electrical brain activity Using Bayes to get the most out of non-significant results Four reasons to prefer Bayesian analyses over significance testing PPVT-III: Peabody picture vocabulary test G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences Does Reading Ability Predict Individual Differences In The Syntactic Processing Of Spoken Language? Poster presented at the International Meeting of the A rose by any other name: Long-term memory structure and sentence processing Rapid expectation adaptation during syntactic comprehension Definitely saw it coming? The dual nature of the pre-nominal prediction effect Can Bilinguals See It Coming? Word Anticipation in L2 Sentence Reading Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors SIMR: an R package for power analysis of generalized linear mixed models by simulation The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time A crack in the crystal ball: Evidence against pre-activation of gender features in sentence comprehension Taking perspective: personal pronouns affect experiential aspects of literary reading STAIRS4WORDS: A new adaptive test for assessing receptive vocabulary size in English, Dutch, and German. Poster presented at Architectures and Mechanisms of Language Processing Rmisc: Rmisc: Ryan Miscellaneous. R package version 1 How the brain processes violations of the grammatical norm: An fMRI study Predicting form and meaning: Evidence from brain potentials How robust are prediction effects in language comprehension? Failure to replicate article-elicited N400 effects Why the A/AN prediction effect may be hard to replicate: a rebuttal to Delong JASP (Version 0.9 Theory of probability The P600 as an index of syntactic integration difficulty Repair, revision, and complexity in syntactic analysis: An electrophysiological differentiation The effects of electrode impedance on data quality and statistical significance in ERP recordings Bias in a common EEG and MEG statistical analysis and how to avoid it Lexical prediction in language comprehension: a replication study of grammatical gender effects in Dutch. Language Bayesian assessment of null values via parameter estimation and model comparison A look around at what lies ahead: Prediction and predictability in language processing Thirty years and counting: finding meaning in the N400 component of the event-related brain potential (ERP) Reading senseless sentences: brain potentials reflect semantic incongruity Brain potentials during reading reflect word expectancy and semantic association Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs A lexical basis for N400 context effects: evidence from MEG Dissociating N400 effects of prediction from association in single-word contexts A cortical network for semantics: (de)constructing the N400 Expectation-based syntactic comprehension Generating random correlation matrices based on vines and extended onion method Neuter is not common in Dutch: Eye movements reveal asymmetrical gender processing How to get statistically significant effects in any ERP experiment (and why you shouldn't) Vocabulary knowledge predicts lexical processing: Evidence from a group of participants with diverse educational backgrounds The temporal structure of spoken language understanding Sentence perception as an interactive parallel process Prediction is Production: The missing link between language production and comprehension Bilinguals reading in their second language do not predict upcoming words as native readers do Speech recognition in adverse conditions: A review. Language and Cognitive Processes Hierarchical levels of representation in language prediction: The influence of first language acquisition in highly proficient bilinguals The Peer Reviewers' Openness Initiative: incentivizing open research practices through peer review Do 'early' brain responses reveal word form prediction during language comprehension? A critical review Dissociable effects of prediction and integration during language comprehension: Evidence from a large-scale study using brain potentials On the incrementality of pragmatic processing: An ERP investigation of informativeness and pragmatic abilities When the truth is not too hard to handle: An event-related potential study on the pragmatics of negation Large-scale replication study reveals a limit on probabilistic prediction in language comprehension Prediction, Bayesian inference and feedback in speech recognition. Language, cognition and neuroscience Brain potentials elicited by garden-path sentences: evidence of the application of verb information during parsing Great expectations: specific lexical anticipation influences the processing of spoken language Discourse-Based Word Anticipation During Language Processing: Prediction or Priming? Discourse Processes Does working memory capacity affect the ability to predict upcoming words in discourse? Predicting while comprehending language: A theory and review An integrated theory of language production and comprehension R: A language and environment for statistical computing. R Foundation for Statistical Computing Bayesian inference for psychology, part IV: Parameter estimation and Bayes factors A common misapplication of statistical inference: Nuisance control with null-hypothesis significance tests The P600 as a correlate of ventral attention network reorientation Peabody picture vocabulary test-III-NL afex: Analysis of Factorial Experiments. R package version 0 A failure to replicate rapid syntactic adaptation in comprehension RStan: the R interface to Stan Exposure to print and orthographic processing The mechanisms of prediction updating that impact the processing of upcoming word: An event-related potential study on sentence comprehension Dissociating retrieval interference and reanalysis in the P600 during sentence comprehension How inappropriate high-pass filters can produce artifactual effects and incorrect conclusions in ERP studies of language and cognition Sentence comprehension in a wider discourse: Can we use ERPs to keep track of things The neuropragmatics of 'simple' utterance comprehension: An ERP review Anticipating upcoming words in discourse: evidence from ERPs and reading times Studying texts in a second language: The importance of test type The resilient nature of adjectival inflection in Dutch Time course of word identification and semantic integration in spoken language Advantages masquerading as 'issues' in Bayesian hypothesis testing: A commentary on Tendeiro and Kiers Bayesian data analysis in the phonetic sciences: A tutorial introduction The statistical significance filter leads to overoptimistic expectations of replicability Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey method L1 and L2 acquisition of Dutch adjectival inflection Anticipating words and their gender: an event-related brain potential study of semantic integration, gender expectancy, and gender agreement in Spanish sentence reading Potato not Pope: human brain potentials to gender expectation and agreement in Spanish spoken sentences Expecting gender: An event related brain potential study on the role of grammatical gender in comprehending a line drawing within a written sentence in Spanish Reshaping Data with the reshape Package ggplot2: Elegant Graphics for Data Analysis stringr: Simple, Consistent Wrappers for Common String Operations forcats: Tools for Working with Categorical Variables (Factors) cowplot: Streamlined Plot Theme and Plot Annotations for 'ggplot2 Making replication mainstream For help with creating the stimulus materials and/or data collection, we thank J o u r n a l P r e -p r o o f