Transactions of the Association for Computational Linguistics, 1 (2013) 125–138. Action Editor: Sharon Goldwater. Submitted 10/2012; Revised 3/2013; Published 5/2013. c©2013 Association for Computational Linguistics. Modeling Child Divergences from Adult Grammar Sam Sahakian University of Wisconsin-Madison sahakian@cs.wisc.edu Benjamin Snyder University of Wisconsin-Madison bsnyder@cs.wisc.edu Abstract During the course of first language acquisi- tion, children produce linguistic forms that do not conform to adult grammar. In this paper, we introduce a data set and approach for sys- tematically modeling this child-adult grammar divergence. Our corpus consists of child sen- tences with corrected adult forms. We bridge the gap between these forms with a discrim- inatively reranked noisy channel model that translates child sentences into equivalent adult utterances. Our method outperforms MT and ESL baselines, reducing child error by 20%. Our model allows us to chart specific aspects of grammar development in longitudinal stud- ies of children, and investigate the hypothesis that children share a common developmental path in language acquisition. 1 Introduction Since the publication of the Brown Study (1973), the existence of standard stages of development has been an underlying assumption in the study of first language learning. As a child moves towards lan- guage mastery, their language use grows predictably to include more complex syntactic structures, even- tually converging to full adult usage. In the course of this process, children may produce linguistic forms that do not conform to the grammatical standard. From the adult point of view these are language er- rors, a label which implies a faulty production. Con- sidering the work-in-progress nature of a child lan- guage learner, these divergences could also be de- scribed as expressions of the structural differences between child and adult grammar. The predictability of these divergences has been observed by psychol- ogists, linguists and parents (Owens, 2008).1 Our work leverages the differences between child and adult language to make two contributions to- wards the study of language acquisition. First, we provide a corpus of errorful child sentences anno- tated with adult-like rephrasings. This data will al- low researchers to test hypotheses and build models relating the development of child language to adult forms. Our second contribution is a probabilistic model trained on our corpus that predicts a gram- matical rephrasing given an errorful child sentence. The generative assumption of our model is that sentences begin in underlying adult forms, and are then stochastically transformed into observed child utterances. Given an observed child utterance s, we calculate the probability of the corrected adult trans- lation t as P(t|s) ∝ P(s|t)P(t), where P(t) is an adult language model and P(s|t) is a noise model crafted to capture child grammar errors like omission of certain function words and corruptions of tense or declension. The parame- ters of this noise model are estimated using our cor- pus of child and adult-form utterances, using EM to handle unobserved word alignments. We use this generative model to produce n-best lists of candi- date corrections which are then reranked using long range sentence features in a discriminative frame- work (Collins and Roark, 2004). 1For the remainder of this paper we use “error” and “diver- gence” interchangeably. 125 One could argue that our noisy channel model mirrors the cognitive process of child language pro- duction by appealing to the hypothesis that children rapidly learn adult-like grammar but produce errors due to performance factors (Bloom, 1990; Ham- burger and Crain, 1984). That being said, our pri- mary goal in this paper is not cognitive plausibility, but rather the creation of a practical tool to aid in the empirical study of language acquisition. By au- tomatically inferring adult-like forms of child sen- tences, our model can highlight and compare devel- opmental trends of children over time using large quantities of data, while minimizing the need for hu- man annotation. Besides this, our model’s predictive success it- self has theoretical implications. By aggregating training and testing data across children, our model instantiates the Brown hypothesis of a shared de- velopmental path. Even when adequate per-child training data exists, using data only from other chil- dren leads to no degradation in performance, sug- gesting that the learned parameters capture general child language phenomena and not just individual habits. Besides aggregating across children, our model coarsely lumps together all stages of devel- opment, providing a frozen snapshot of child gram- mar. This establishes a baseline for more cognitively plausible and temporally dynamic models. We compare our correction system against two baselines, a phrase-based Machine Translation (MT) system, and a model designed for English Second Language (ESL) error correction. Relative to the best performing baseline, our approach achieves a 30% decrease in word error-rate and a four point increase in BLEU score. We analyze the perfor- mance of our system on various child error cate- gories, highlighting our model’s strengths (correct- ing be drops and morphological overgeneralizations) as well as its weaknesses (correcting pronoun and auxiliary drops). We also assess the learning rate of our model, showing that very little annotation is needed to achieve high performance. Finally, to showcase a potential application, we use our model to chart one aspect of four children’s grammar ac- quisition over time. While generally vindicating the Brown thesis of a common developmental path, the results point to subtleties in variation across individ- uals that merit further investigation. 2 Background and Related Work While child error correction is a novel task, com- putational methods are frequently used to study first language acquisition. The computational study of speech is facilitated by TalkBank (MacWhinney, 2007), a large database of transcribed dialogues in- cluding CHILDES (MacWhinney, 2000), a subsec- tion composed entirely of child conversation data. Computational tools have been developed specif- ically for the large-scale analysis of CHILDES. These tools enable further computational study such as the automatic calculation of the language devel- opment metrics IPSYN (Sagae et al., 2005) and D-Level (Lu, 2009), or the automatic formula- tion of novel language development metrics them- selves (Sahakian and Snyder, 2012). The availability of child language is also key to the design of computational models of language learning (Alishahi, 2010), which can support the plausibility of proposed human strategies for tasks like semantic role labeling (Connor et al., 2008) or word learning (Regier, 2005). To our knowledge this paper is the first work on error correction in the first language learning domain. Previous work has employed a classifier-based approach to iden- tify speech errors indicative of language disorders in children (Morley and Prud’hommeaux, 2012). Automatic correction of second language (L2) writing is a common objective in computer assisted language learning (CALL). These tasks generally target high-frequency error categories including ar- ticle, word-form, and preposition choice. Previous work in CALL error correction includes identify- ing word choice errors in TOEFL essays based on context (Chodorow and Leacock, 2000), correcting errors with a generative lattice and PCFG rerank- ing (Lee and Seneff, 2006), and identifying a broad range of errors in ESL essays by examining linguis- tic features of words in sequence (Gamon, 2011). In a 2011 shared ESL correction task (Dale and Kilgar- riff, 2011), the best performing system (Rozovskaya et al., 2011) corrected preposition, article, punctu- ation and spelling errors by building classifiers for each category. This line of work is grounded in the practical application of automatic error correction as a learning tool for ESL students. Statistical Machine Translation (SMT) has been 126 applied in diverse contexts including grammar cor- rection as well as paraphrasing (Quirk et al., 2004), question answering (Echihabi and Marcu, 2003) and prediction of twitter responses (Ritter et al., 2011). In the realm of error correction, SMT has been ap- plied to identify and correct spelling errors in inter- net search queries (Sun et al., 2010). Within CALL, Park and Levy (2011) took an unsupervised SMT approach to ESL error correction using Weighted Fi- nite State Transducers (FSTs). The work described in this paper is inspired by that of Park and Levy, and in Section 6 we detail differences between our approaches. We also include their model as a base- line. 3 Data To train and evaluate our translation system, we first collected a corpus of 1,000 errorful child-language utterances from the American English portion of the CHILDES database. To encourage diversity in the grammatical divergences captured by our corpus, our data is drawn from a large pool of studies (see bibliography for the full list of citations). In the annotation process, candidate child sen- tences were randomly selected from the pool and classified by hand as either grammatically correct, divergent or unclassifiable (when it was not possi- ble to tell what a child is trying to say). We con- tinued this process until 1,000 divergent sentences were found. Along the way we also encountered 5,197 grammatically correct utterances and 909 that were unclassifiable.2 Because CHILDES includes speech samples from children of diverse age, back- ground and language ability, our corpus does not capture any specific stage of language development. Instead, the corpus represents a general snapshot of a learner who has not yet mastered English as their first language. To provide the grammatically correct counterpart to child data, our errorful sentences were corrected by workers on Amazon’s Mechanical Turk web ser- vice. Given a child utterance and its surrounding conversational context, annotators were instructed to translate the child utterance into adult-like En- glish. We limited eligible workers to native English 2These hand-classified sentences are available online along with our set of errorful sentences. Error Type Child Utterance Insertion I did locked it. Inflection More cookie? Deletion That not how. Lemma Choice I got grain. Overgeneralization I drawed it. Table 1: Examples of error types captured by our model. speakers residing in the US. We also required anno- tators to follow a brief tutorial in which they prac- tice correcting sample utterances according to our guidelines. These guidelines instructed workers to minimally alter sentences to be grammatically con- sistent with a conversation or written letter, without altering underlying meaning. Annotators were eval- uated on a worker-by-worker basis and rejected in the rare case that they ignored our guidelines. Ac- cepted workers were paid 7 cents for correcting each set of 5 sentences. To achieve a consistent judgment, we posted each set of sentences for correction by 7 different annotators. Once multiple reference translations were ob- tained we selected a single best correction by plu- rality, arbitrating ties as necessary. There were sev- eral cases in which corrections obtained by plurality decision did not perfectly follow instructions. These were manually corrected. Both the raw translations provided by individual annotators as well as the cu- rated final adult forms are provided online as part of our data set.3 Resulting pairs of errorful child sen- tences and their adult-like corrections were split into 73% training, 7% development and 20% test data, which we use to build, tune and evaluate our gram- mar correction system. In the final test phase, devel- opment data is included in the training set. 4 Model According to our generative model, adult-like utter- ances are formed and then transformed by a noisy channel to become child sentences. The structure of our noise model is tailored to match our observa- tions of common child errors. These include: func- tion word insertions, function word deletions, swaps of function words and, inflectional changes to con- tent words. Examples of each error type are given 3Data is available at http://pages.cs.wisc.edu/~bsnyder 127 in Table 1. Our model does not allow reorderings, and can thus be described in terms of word-by-word stochastic transformations to the adult sentence. We use 10 word classes to parameterize our model: pronouns, negators, wh-words, conjunc- tions, prepositions, determiners, modal verbs, “be” verbs, other auxiliary verbs, and lexical content words. The list of words in each class is provided as part of our data set. For each input adult word w, the model generates output word w′ as a hierarchi- cal series of draws from multinomial distributions, conditioned on the original word w and its class c. All distributions receive an asymmetric Dirichlet prior which favors retention of the adult word. With the sole exception of word insertions, the distribu- tions are parameterized and learned during train- ing. Our model consists of 217 multinomial distri- butions, with 6,718 free parameters. The precise form and parameterization of our model were handcrafted for performance on the de- velopment data, using trial and error. We also con- sidered more fine-grained model forms (i.e. one parameter for each non-lexical input-output word pair), as well as coarser parameterizations (i.e. a single shared parameter denoting any inflection change). The model we describe here seemed to achieve the best balance of specificity and general- ization. We now present pseudocode describing the noise model’s operation upon processing each word, along with a brief description of each step. Action selection (lines 3-7): On reading an input word, an action category a is selected from a prob- ability distribution conditioned on the input word’s class. Our model allows up to two function word insertions or deletions in a row before a swap is re- quired. Lexical content words may not be deleted or inserted, only swapped. Insert and Delete (lines 8-15): The deletion case requires no decision after action selection. In the insertion case, the class of the inserted word, c′, is selected conditioned on cPREV , the class of the previous adult word. The precise identity of the inserted word is then drawn from a uniform distribution over words in class c′. It is important to note that in the insertion case, the input word at a given iteration will be re-processed at the next iteration (lines 33-35). insdel ← 0 for word w with class c, inflection f, lemma ` do 3: if insdel = 2 then a ← swap else 6: a ∼{insert, delete, swap} | c end if if a = delete then 9: insdel++ c′ ← � w′ ← � 12: else if a = insert then insdel++ c′ ∼ classes | cPREV , insert 15: w′ ∼ words in c′ | insert else insdel ← 0 18: c′ ← c if c ∈ uninflected-classes then w′ ∼ words in c | w, swap 21: else if c = aux then `′ ∼ aux-lemmas | `, swap f ′ ∼ inflections | f, swap 24: w′ ← COMBINE(`′,f ′) else f ′ ∼ inflections | f, swap 27: w′ ← COMBINE(`,f ′) end if end if 30: if w′ ∈ irregular then w′ ∼ OVERGEN(w′)∪{w′} end if 33: if a = insert then goto line 3 end if 36: end for Swap (lines 16 - 29): In the swap case, a word of given class is substituted for another word in the same class. Depending on the source word’s class, swaps are handled in slightly different ways. If the word is a modal, conjunction, determiner, preposi- tion, “wh-” word or negative, it is considered “unin- 128 flected.” In these cases, a new word w′ is selected from all words in class c, conditioned on the source word w. If w is an auxiliary verb, the swap procedure con- sists of two parallel steps. A lemma is selected from possible auxiliary lemmas, conditioned on the lemma of the source word.4 In the second step, an output inflection type is selected from a distribution conditioned on the source word’s inflection. The precise output word is fully specified by the choice of lemma and conjugation. If w is not in either of the above two categories, it is a lexical word, and our model only allows changes in conjugation or declension. If the source word is a noun it may swap to singular or plural form con- ditioned on the source form. If the word is a verb, it may swap to any conjugated or non-finite form, again conditioned on the source form. Lexical words that are not marked by CELEX (Baayen et al., 1996) as nouns or verbs may only swap to the exact same word. Overgeneralization (lines 30-32): Finally, the noisy channel considers the possibility of produc- ing overgeneralized word forms (like “maked” and “childs”) in place of their correct irregular forms. The OVERGEN function produces the incorrect over- generalized form. We draw from a distribution which chooses between this form and the correct original word. Our model maintains separate dis- tributions for nouns (overgeneralized plurals) and verbs (overgeneralized past tense). 5 Implementation In this section, we describe steps necessary to build, train and test our error correction model. Weighted Finite State Transducers (FSTs) used in our model are constructed with OpenFst (Allauzen et al., 2007). 5.1 Sentence FSTs These FSTs provide the basis for our translation pro- cess. We represent sentences by building a simple linear chain FST, progressing from node to node with each arc accepting and yielding one word in the sentence. All arcs are weighted with probability one. 4Auxiliary lemmas include have, do, go, will, and get. 5.2 Noise FST The noise model provides a conditional probability over child sentences given an adult sentence. We en- code this model as a FST with several states, allow- ing us to track the number of consecutive insertions or deletions. We allow only two of these operations in a row, thereby constraining the length of the out- put sentence. This constraint results in three states (insdel = 0, insdel = 1, insdel = 2), along with an end state. In our training data, only 2 sentence pairs cannot be described by the noise model due to this constraint. Each arc in the FST has an � or adult-language word as input symbol, and a possibly errorful child- language word or � as output symbol. Each arc weight is the probability of transducing the input word to the output word, determined according to the parameterized distributions described in Sec- tion 4. Arcs corresponding to insertions or dele- tions lead to a new state (insdel++) and are not al- lowed from state insdel = 2. Substitution arcs all lead back to state insdel = 0. Word class infor- mation is given by a set of word lists for each non- lexical class.5 Inflectional information is derived from CELEX. 5.3 Language Model FST The language model provides a prior distribution over adult form sentences. We build a a trigram language model FST with Kneser-Ney smoothing using OpenGRM (Roark et al., 2012). The lan- guage model is trained on all parent speech in the CHILDES studies from which our errorful sentences are drawn. In the language model FST, the input and output words of each arc are identical. Arcs are weighted with the probability of the n-gram beginning with some prefix associated with the source node, and ending with the arc’s input/output word. In this setup, the probability of a string is the total weight of the path accepting and emitting that string. 5.4 Training As detailed in Section 4, our noise model consists of a series of multinomial distributions which govern 5Word lists are included for reference with our dataset. 129 0 1 that:that 2 is: 3him:him his:him him:him his:him 4 hat:hat hats:hat 5 .:. Figure 1: A simplified decoding FST for the child sentence “That him hat.” In an actual decoding FST many more transduction arcs exist, including those translating “that” and “him” to any determiner and pronoun, respectively, and affording opportunities for many more deletions and insertions. Input and output strings given by FST paths correspond to possible adult-to-child translations. the transformation from adult word to child word, al- lowing limited insertions and deletions. We estimate parameters θ for these distributions that maximize their posterior probability given the observed train- ing sentences {(s,t)}. Since our language model P(t) does not depend on on the noise model param- eters, this objective is equivalent to jointly maximiz- ing the prior and the conditional likelihoods of child sentences given adult sentences: argmax θ P(θ) ∏ P(s|t,θ) To represent all possible derivations of each child sentence s from its adult translation t, we compose the sentence FSTs with the noise model, obtaining: FSTtrain = FSTt ◦FSTnoise ◦FSTs Each path through FSTtrain corresponds to a sin- gle derivation d, with path weight P(s,d|t,θ). By summing all path weights, we obtain P(s|t,θ). We use a MAP-EM algorithm to maximize our objective while summing over all possible derivations. Our training scheme relies on FSTs weighted in the V-expectation semiring (Eisner, 2001), imple- mented using code from fstrain (Dreyer et al., 2008). Besides carrying probabilities, arc weights are sup- plemented with a vector to indicate parameter counts involved in the arc traversal. The V-expectation semiring is designed so that the total arc weight of all paths through the FST yields both the probabil- ity P(s|t,θ), along with expected parameter counts. Our EM algorithm proceeds as follows: We start by initializing all parameters to uniform distribu- tions with random noise. We then weight the arcs in FSTnoise accordingly. For each sentence pair (s,t), we build FSTtrain by composition with our noise model, as described in the previous paragraph. We then compute the total arc weight of all paths through FSTtrain by relabeling all input and output symbols to � and then reducing FSTtrain to a single state using epsilon removal (Mohri, 2008). The stop- ping weight of this single state is the sum of all paths through the original FST, yielding the probability P(s|t,θ), along with expected parameter counts ac- cording to our current distributions. We then reesti- mate θ using the expected counts plus pseudo-counts given by priors, and repeat this process until conver- gence. Besides smoothing our estimated distributions, the pseudo-counts given by our asymmetric Dirich- let priors favor multinomials that retain the adult word form (swaps, identical lemmas, and identical inflections). Concretely, we use pseudo-counts of .5 for these favored outcomes, and pseudo-counts of .01 for all others.6 In practice, 109 of the child sentences in our data set cannot be translated into a corresponding adult version using our model. This is due to a range of rare phenomena like rephrasing, lexical word swaps and word-order errors. In these cases, the composed FST has no valid paths from start to finish and the sentence is removed from training. We run EM for 100 iterations, at which time the log likelihood of all sentences generally converges to within .01. 5.5 Decoding After training our noise model, we apply the sys- tem to translate divergent child language to adult- like speech. As in training, the noise FST is com- posed with the FST for each child sentence s. In 6corresponding to Dirichlet hyperparameters of 1.5 and 1.01 respectively. 130 place of the adult sentence, the language model FST is used, yielding: FSTdecode = FSTlm ◦FSTnoise ◦FSTs Each path through FSTdecode corresponds to an adult translation and derivation (t,d), with path weight P(s,d|t,θ)P(t). Thus, the highest-weight path corresponds to the most likely translation and derivation pair: argmax t,d P(t,d|s,θ) We use a dynamic program to find the n highest- weight paths with distinct adult sentences t. This can be viewed as finding the n most likely adult trans- lations, using a Viterbi approximation P(t|s,θ) = argmaxd P(t,d|s,θ). In our experiments we set n = 50. A simplified FSTdecode example is shown in Figure 1. 5.6 Discriminative Reranking To more flexibly capture long range syntactic fea- tures, we embed our noisy channel model in a dis- criminative reranking procedure. For each child sen- tence s, we take the n-best candidate translations t1, . . . , tn from the underlying generative model, as described in the previous section. We then map each candidate translation ti to a d-dimensional feature vector f(s,ti). The reranking model then uses a d- dimensional weight vector λ to predict the candidate translation with highest linear score: t∗ = argmax ti λ ·f(s,ti) To simulate test conditions, we train the weight vec- tor on n-best lists from 8-fold cross-validation over training data, using the averaged perceptron rerank- ing algorithm (Collins and Roark, 2004). Since the n-best list might not include the exact gold-standard correction, a target correction which maximizes our evaluation metric is chosen from the list. The n-best list is non-linearly separable, so perceptron training iterates for 1000 rounds, when it is terminated with- out converging. Our feature function f(s,ti) yields nine boolean and real-valued features derived from (i) the FST that generates child sentence s from candidate adult- form ti, and (ii) the POS sequence and dependency parse of candidate ti obtained with the Stanford Parser (de Marneffe et al., 2006). Features were se- lected based on their performance in reranking held- out development data from the training set. Rerank- ing features are given below: Generative Model Probabilities: We first include the joint probability of the child sentence s and can- didate translation ti, given by the generative model: Plm(ti)Pnoise(s|ti). We also isolate the candidate translation’s language model and noise model prob- abilities as features. Since both of these proba- bilities naturally favor shorter sentences, we scale them to sentence length, yielding n √ Plm(ti) and n √ Pnoise(s|ti) respectively. By not scaling the joint probability, we allow the reranker to learn its own bias towards longer or shorter corrected sentences. Contains Noun Subject, Accusative Noun Sub- ject: The first boolean feature indicates whether the dependency parse of candidate translation ti con- tains a “nsubj” relation. The second indicates if a “nsubj” relation exists where the dependent is an ac- cusative pronoun (e.g. “Him ate the cookie”). These features and the one following have previously been used in classifier based error detection (Morley and Prud’hommeaux, 2012). Contains Finite Verb: This boolean feature is true if the POS tags of ti include a finite verb. This feature differentiates structures like “I am going” from “I going.” Question Template Features: We define tem- plates for wh- and yes-no questions. A sentence fits the wh- question template if it begins with a wh- word, followed by an auxiliary or copula verb (e.g. “Who did...”). A sentence fits the yes-no template when it begins with an auxiliary or copula verb, then a noun subject followed by a verb or adjective (e.g. “Are you going...”). We include one boolean feature for each of these templates indicating when a tem- plate match is inappropriate, when the original child utterance terminates in a period instead of a question mark. In addition to the two features for inappropri- ate template matches, we have a single feature that signals appropriate matches of either question tem- plate — when the original child utterance terminates in a question mark. 131 Child Utterance Human Correction Machine Correction I am not put in my mouth. I am not putting it in my mouth. I am not going to put it in my mouth. This one have water? Does this one have water? This one has water? Want to read the book. I want to read the book. You want to read the book. Why you going to get two? Why are you going to get two? Why are you going to have two? You very sticky. You are very sticky. You are very sticky. He no like. He does not like it. He does not like that. Yeah it looks a lady. Yeah it looks like a lady Yeah it looks like a lady. Eleanor come too. Eleanor came too. Eleanor come too. Desk in here. The desk is in here Desk is in here. Why he’s doc? Why is he called doc? He’s up doc? Table 2: Randomly selected test output generated by our complete error correction model, along with corresponding child utterances and human corrections. 6 Experiments and Analysis Baselines We compare our system’s performance with two pre-existing baselines. The first is a stan- dard phrase-based machine translation system using MOSES (Koehn et al., 2007) with GIZA++ (Och and Ney, 2003) word alignments. We hold out 9% of the training data for tuning using the MERT algo- rithm with BLEU objective (Och, 2003). The second baseline is our implementation of the ESL error correction system described by Park and Levy (2011). Like our system, this baseline trains FST noise models using EM in the V-expectation semiring. Our noise model is crafted specifically for the child language domain, and so differs from Park and Levy’s in several ways: First, we capture a wider range of word-swaps, with richer parameteri- zation allowing many more translation options. As a result, our model has 6,718 parameters, many more than the ESL model’s 187. These parameters corre- spond to learned probability distributions, whereas in the ESL model many of the distributions are fixed as uniform. We also capture a larger class of errors, including deletions, change of auxiliary lemma, and inflectional overgeneralizations. Finally, we use a discriminative reranking step to model long-range syntactic dependencies. Although the ESL model is originally geared towards fully unsupervised train- ing, we train this baseline in the same supervised framework as our model. Evaluation and Performance We train all models on 80% of our child-adult sentence pairs and test on the remaining 20%. For illustration, selected output from our model is shown in Table 2. Predictions are evaluated with BLEU score (Pap- ineni et al., 2002) and Word Error Rate (WER), de- fined as the minimum string edit distance (in words) between reference and predicted translations, di- vided by length of the reference. As a control, we compare all results against scores for the uncor- rected child sentences themselves. As reported in Table 3, our model achieves the best scores for both metrics. BLEU score increases from 50 for child sentences to 62, while WER is reduced from .271 to .224. Interestingly, MOSES achieves a BLEU score of 58 — still four points below our model — but ac- tually increases WER to .449. For both metrics, the ESL system increases error. This is not surprising given that its intended application is in an entirely different domain. Error Analysis We measured the performance of our model over the six most common categories of child divergence, including deletions of various function words and overgeneralizations of past tense forms (e.g. “maked” for “made”). We first iden- tified model parameters associated with each cate- gory, and then counted the number of correct and in- correct parameter firings on the test sentences. As Table 4 indicates, our model performs reasonably well on “be” verb deletions, preposition deletions, and overgeneralizations, but has difficulty correcting pronoun and auxiliary deletions. In general, hypothesizing dropped words burdens the noise model by adding additional draws from multinomial distributions to the derivation. To pre- 132 BLEU WER WER reranking 62.12 .224 BLEU reranking 60.86 .231 No reranking 60.37 .233 Moses 58.29 .449 ESL 40.76 .318 Child Sentences 49.55 .271 Table 3: WER and BLEU scores. Our system’s perfor- mance using various reranking schemes (BLEU objec- tive, WER objective and none) is contrasted with Moses MT and ESL error correction baselines, as well as un- corrected test sentences. Best performance under each metric is shown in bold. dict a deletion, either the language model or the reranker must strongly prefer including the omit- ted word. A syntax-based noise model may achieve better performance in detecting and correcting child word drops. While our model parameterization and perfor- mance rely on the largely constrained nature of child language errors, we observe some instances in which it is overly restrictive. For 10% of utterances in our corpus, it is impossible to recover the exact gold-standard adult sentence. These sentences fea- ture errors like reordering or lexical lemma swaps — for example “I talk Mexican” for “I speak Spanish.” While our model may correct other errors in these sentences, a perfect correction is unattainable. Sometimes, our model produces appropriate forms which by happenstance do not conform to the annotators’ decision. For example, in the second row of Table 2, the model corrects “This one have water?” to “This one has water?”, instead of the more verbose correction chosen by the annotators (“Does this one have water?”). Similarly, our model sometimes produces corrections which seem appro- priate in isolation, but do not preserve the meaning implied by the larger conversational context. For ex- ample, in row three of Table 2, the sentence “Want to read the book.” is recognized both by our hu- man annotators and the system as requiring a pro- noun subject. Unlike the annotators, however, the model has no knowledge of conversational context, so it chooses the highest probability pronoun — in this case “you” — instead of the contextually correct “I.” Error Type Count F1 P R Be Deletions 63 .84 .84 .84 Pronoun Deletions 30 .15 .38 .1 Aux. Deletions 30 .21 .44 .13 Prep. Deletions 26 .65 .82 .54 Det. Deletions 22 .48 .73 .36 Overgen. Past 7 .92 1.0 .86 Table 4: Frequency of the six most common error types in test data, along with our model’s corresponding F- measure, precision and recall. All counts are ±.12 at p = .05 under a binomial normal approximation inter- val. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 40 45 50 55 60 65 .20 .22 .24 .26 .28 .30 .32 % Train Data B L E U W E R Figure 2: Performance with limited training data. WER is drawn as the dashed line, and BLEU as the solid line. Learning Curves In Figure 2, we see that the learning curves for our model initially rise sharply, then remain relatively flat. Using only 10% of our training data (80 sentences), we increase BLEU from 44 (using just the language model) to almost 61. We only reach our reported BLEU score of 62 when adding the final 20% of training data. This result emphasizes the specificity of our parameteri- zation. Because our model is so tailored to the child- language scenario, only a few examples of each er- ror type are needed to find good parameter values. We suspect that more annotated data would lead to a continued but slow increase in performance. Training and Testing across Children We use our system to investigate the hypothesis that lan- guage acquisition follows a similar path across chil- dren (Brown, 1973). To test this hypothesis, we train our model on all children excluding Adam, who alone is responsible for 21% of our sentences. We then test the learned model on the separated Adam 133 Trained on: BLEU WER Adam 72.58 .226 All Others 69.83 .186 Uncorrected 45.54 .278 Table 5: Performance on Adam’s sentences training on other children, versus training on himself. Best perfor- mance under each metric is shown in bold. data. These results are contrasted with performance of 8-fold cross validation training and testing solely on Adam’s utterances. Performance statistics are given in Table 5. We first note that models trained in both scenar- ios lead to large error reductions over the child sen- tences. This provides evidence that our model cap- tures general, and not child-specific, error patterns. Although training exclusively on Adam does lead to increased BLEU score (72.58 vs 69.83), WER is minimized when using the larger volume of train- ing data from other children (.186 vs .226). Taken as a whole, these results suggest that training and testing on separate children does not degrade perfor- mance. This finding supports the general hypothesis of shared developmental paths. Plotting Child Language Errors over Time Af- ter training on annotated data, we predict diver- gences in all available data from the children in Roger Brown’s 1973 study — Adam, Eve and Sarah — as well as Abe (Kuczaj, 1977), a child from a sep- arate study over a similar age-range. We plot each child’s per-utterance frequency of preposition omis- sions in Figure 3. Since we evaluate over 65,000 utterances and reranking has no impact on preposi- tion drop prediction, we skip the reranking step to save computation. In Figure 3, we see that Adam and Sarah’s prepo- sition drops spike early, and then gradually decrease in frequency as their preposition use moves towards that of an adult. Although Eve’s data covers an ear- lier time period, we see that her pattern of prepo- sition drops shows a similar spike and gradual de- crease. This is consistent with Eve’s general lan- guage precocity. Brown’s conclusion — that the lan- guage development of these three children advanced in similar stages at different times — is consistent with our predictions. However, when we examine 18 23 28 33 38 43 48 53 58 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Adam Eve Sarah Abe Age (Months) P e r- U tt e ra n c e F re q u e n c y Figure 3: Automatically detected preposition omissions in un-annotated utterances from four children over time. Assuming perfect model predictions, frequencies are ±.002 at p = .05 under a binomial normal approxima- tion interval. Prediction error is given in Table 4. Abe we do not observe the same pattern.7 This points to a degree of variance across children, and suggests the use of our model as a tool for further empirical refinement of language development hy- potheses. Discussion Our error correction system is de- signed to be more constrained than a full-scale MT system, focusing parameter learning on errors that are known to be common to child language learn- ers. Reorderings are prohibited, lexical word swaps are limited to inflectional changes, and deletions are restricted to function word categories. By highly re- stricting our hypothesis space, we provide an induc- tive bias for our model that matches the child lan- guage domain. This is particularly important since the size of our training set is much smaller than that usually used in MT. Indeed, as Figure 2 shows, very little data is needed to achieve good performance. In contrast, the ESL baseline suffers because its generative model is too restricted for the domain of transcribed child language. As shown above in Table 4, child deletions of function words are the most frequent error types in our data. Since the ESL model does not capture word deletions, and has a more restricted notion of word swaps, 88% of child sentences in our training corpus cannot be translated to their reference adult versions. The result is that the ESL model tends to rely too heavily on the lan- guage model. For example, on the sentence “I com- 7Though it is of course possible that a similar spike and drop-off occurred earlier in Abe’s development. 134 ing to you,” the ESL model improves n-gram prob- ability by producing “I came to you” instead of the correct “I am coming to you”. This increases error over the child sentence itself. In addition to the domain-specific generative model, our approach has the advantage of long- range syntactic information encoded by reranking features. Although the perceptron algorithm places high weight on the generative model probability, it alters the predictions in 17 out of 201 test sentences, in all cases an improvement. Three of these rerank- ing changes add a noun subject, five enforce ques- tion structure, and nine add a main verb. 7 Conclusion and Future Work In this paper we introduce a corpus of divergent child sentences with corresponding adult forms, en- abling the systematic computational modeling of child language by relating it to adult grammar. We propose a child-to-adult translation task as a means to investigate child language development, and pro- vide an initial model for this task. Our model is based on a noisy-channel assump- tion, allowing for the deletion and corruption of in- dividual words, and is trained using FST techniques. Despite the debatable cognitive plausibility of our setup, our results demonstrate that our model cap- tures many standard divergences and reduces the average error of child sentences by approximately 20%, with high performance on specific frequently occurring error types. The model allows us to chart aspects of language development over time, without the need for addi- tional human annotation. Our experiments show that children share common developmental stages in lan- guage learning, while pointing to child-specific sub- tleties in preposition use. In future work, we intend to dynamically model child language ability as it grows and shifts in re- sponse to internal processes and external stimuli. We also plan to develop and train models specializ- ing in the detection of specific error categories. By explicitly shifting our model’s objective from child- adult translation to the detection of some particular error, we hope to improve our analysis of child di- vergences over time. Acknowledgments The authors thank the reviewers and acknowledge support by the NSF (grant IIS-1116676) and a re- search gift from Google. Any opinions, findings, or conclusions are those of the authors, and do not nec- essarily reflect the views of the NSF. References A. Alishahi. 2010. Computational modeling of human language acquisition. Synthesis Lectures on Human Language Technologies, 3(1):1–107. C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri. 2007. OpenFst: A general and efficient weighted finite-state transducer library. Implementa- tion and Application of Automata, pages 11–23. R.H. Baayen, R. Piepenbrock, and L. Gulikers. 1996. CELEX2 (CD-ROM). Linguistic Data Consortium. E. Bates, I. Bretherton, and L. Snyder. 1988. From first words to grammar: Individual differences and disso- ciable mechanisms. Cambridge University Press. D.C. Bellinger and J.B. Gleason. 1982. Sex differences in parental directives to young children. Sex Roles, 8(11):1123–1139. L. Bliss. 1988. The development of modals. Journal of Applied Developmental Psychology, 9:253–261. L. Bloom, L. Hood, and P. Lightbown. 1974. Imitation in language development: If, when, and why. Cognitive Psychology, 6(3):380–420. L. Bloom, P. Lightbown, L. Hood, M. Bowerman, M. Maratsos, and M.P. Maratsos. 1975. Structure and variation in child language. Monographs of the Soci- ety for Research in Child Development, pages 1–97. L. Bloom. 1973. One word at a time: The use of single word utterances before syntax. Mouton. P. Bloom. 1990. Subjectless sentences in child language. Linguistic Inquiry, 21(4):491–504. J.N. Bohannon III and A.L. Marquis. 1977. Chil- dren’s control of adult speech. Child Development, 48(3):1002–1008. R. Brown. 1973. A first language: The early stages. Harvard University Press. V. Carlson-Luden. 1979. Causal understanding in the 10-month-old. Ph.D. thesis, University of Colorado at Boulder. E.C. Carterette and M.H. Jones. 1974. Informal speech: Alphabetic & phonemic texts with statistical analyses and tables. University of California Press. M. Chodorow and C. Leacock. 2000. An unsupervised method for detecting grammatical errors. In Proceed- ings of the North American Chapter of the Association for Computational Linguistics, pages 140–147. 135 M. Collins and B. Roark. 2004. Incremental parsing with the perceptron algorithm. In Proceedings of the Asso- ciation for Computational Linguistics, pages 111–118, Barcelona, Spain, July. M. Connor, Y. Gertner, C. Fisher, and D. Roth. 2008. Baby SRL: Modeling early language acquisition. In Proceedings of the Conference on Computational Nat- ural Language Learning, pages 81–88. R. Dale and A. Kilgarriff. 2011. Helping our own: The HOO 2011 pilot shared task. In Proceedings of the Eu- ropean Workshop on Natural Language Generation, pages 242–249. M.C. de Marneffe, B. MacCartney, and C.D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of The In- ternational Conference on Language Resources and Evaluation, volume 6, pages 449–454. M.J. Demetras, K.N. Post, and C.E. Snow. 1986. Feed- back to first language learners: The role of repetitions and clarification questions. Journal of Child Lan- guage, 13(2):275–292. M.J. Demetras. 1989. Working parents’ conversational responses to their two-year-old sons. M. Dreyer, J.R. Smith, and J. Eisner. 2008. Latent- variable modeling of string transductions with finite- state methods. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1080–1089. A. Echihabi and D. Marcu. 2003. A noisy-channel ap- proach to question answering. In Proceedings of the Association for Computational Linguistics, pages 16– 23. J. Eisner. 2001. Expectation semirings: Flexible EM for learning finite-state transducers. In Proceedings of the ESSLLI workshop on finite-state methods in NLP. M. Gamon. 2011. High-order sequence modeling for language learner error detection. In Proceedings of the Workshop on Innovative Use of NLP for Building Ed- ucational Applications, pages 180–189. L.C.G. Haggerty. 1930. What a two-and-one-half-year- old child said in one day. The Pedagogical Seminary and Journal of Genetic Psychology, 37(1):75–101. W.S. Hall, W.C. Tirre, A.L. Brown, J.C. Campoine, P.F. Nardulli, HO Abdulrahman, MA Sozen, W.C. Schno- brich, H. Cecen, J.G. Barnitz, et al. 1979. The communicative environment of young children: Social class, ethnic, and situational differences. Bulletin of the Center for Children’s Books, 32:08. W.S. Hall, W.E. Nagy, and R.L. Linn. 1980. Spoken words: Effects of situation and social group on oral word usage and frequency. University of Illinois at Urbana-Champaign, Center for the Study of Reading. W.S. Hall, W.E. Nagy, and G. Nottenburg. 1981. Sit- uational variation in the use of internal state words. Technical report, University of Illinois at Urbana- Champaign, Center for the Study of Reading. H. Hamburger and S. Crain. 1984. Acquisition of cogni- tive compiling. Cognition, 17(2):85–136. R.P. Higginson. 1987. Fixing: Assimilation in language acquisition. University Microfilms International. M.H. Jones and E.C. Carterette. 1963. Redundancy in children’s free-reading choices. Journal of Verbal Learning and Verbal Behavior, 2(5-6):489–493. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceed- ings of the Association for Computational Linguis- tics (Interactive Poster and Demonstration Sessions), pages 177–180. S. A. Kuczaj. 1977. The acquisition of regular and irreg- ular past tense forms. Journal of Verbal Learning and Verbal Behavior, 16(5):589–600. J. Lee and S. Seneff. 2006. Automatic grammar cor- rection for second-language learners. In Proceedings of the International Conference on Spoken Language Processing. X. Lu. 2009. Automatic measurement of syntactic com- plexity in child language acquisition. International Journal of Corpus Linguistics, 14(1):3–28. B. MacWhinney. 2000. The CHILDES project: Tools for analyzing talk, volume 2. Psychology Press. B. MacWhinney. 2007. The TalkBank project. Cre- ating and digitizing language corpora: Synchronic Databases, 1:163–180. M. Mohri. 2008. System and method of epsilon removal of weighted automata and transducers, June 3. US Patent 7,383,185. E. Morley and E. Prud’hommeaux. 2012. Using con- stituency and dependency parse features to identify er- rorful words in disordered language. In Proceedings of the Workshop on Child, Computer and Interaction. A. Ninio, C.E. Snow, B.A. Pan, and P.R. Rollins. 1994. Classifying communicative acts in children’s interactions. Journal of Communication Disorders, 27(2):157–187. F.J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. F.J. Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Association for Computational Linguistics, pages 160–167. R.E. Owens. 2008. Language development: An intro- duction. Pearson Education, Inc. 136 K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Association for Computational Linguistics, pages 311–318. Y.A. Park and R. Levy. 2011. Automated whole sentence grammar correction using a noisy channel model. Pro- ceedings of the Association for Computational Lin- guistics, pages 934–944. A.M. Peters. 1987. The role of imitation in the devel- oping syntax of a blind child in perspectives on repeti- tion. Text, 7(3):289–311. K. Post. 1992. The language learning environment of laterborns in a rural Florida community. Ph.D. thesis, Harvard University. C. Quirk, C. Brockett, and W. Dolan. 2004. Monolin- gual machine translation for paraphrase generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 142–149. T. Regier. 2005. The emergence of words: Attentional learning in form and meaning. Cognitive Science, 29(6):819–865. A. Ritter, C. Cherry, and W.B. Dolan. 2011. Data-driven response generation in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 583–593. B. Roark, R. Sproat, C. Allauzen, M. Riley, J. Sorensen, and T. Tai. 2012. The OpenGrm open-source finite- state grammar software libraries. In Proceedings of the Association for Computational Linguistics (System Demonstrations), pages 61–66. A. Rozovskaya, M. Sammons, J. Gioja, and D. Roth. 2011. University of Illinois system in HOO text cor- rection shared task. In Proceedings of the European Workshop on Natural Language Generation, pages 263–266. J. Sachs. 1983. Talking about the there and then: The emergence of displaced reference in parent-child dis- course. Children’s Language, 4. K. Sagae, A. Lavie, and B. MacWhinney. 2005. Auto- matic measurement of syntactic development in child language. In Proceedings of the Association for Com- putational Linguistics, pages 197–204. S. Sahakian and B. Snyder. 2012. Automatically learn- ing measures of child language development. Pro- ceedings of the Association for Computational Lin- guistics (Volume 2: Short Papers), pages 95–99. C.E. Snow, F. Shonkoff, K. Lee, and H. Levin. 1986. Learning to play doctor: Effects of sex, age, and ex- perience in hospital. Discourse Processes, 9(4):461– 473. E.L. Stine and J.N. Bohannon. 1983. Imitations, inter- actions, and language acquisition. Journal of Child Language, 10(03):589–603. X. Sun, J. Gao, D. Micol, and C. Quirk. 2010. Learning phrase-based spelling error models from clickthrough data. In Proceedings of the Association for Computa- tional Linguistics, pages 266–274. P. Suppes. 1974. The semantics of children’s language. American Psychologist, 29(2):103. T.Z. Tardif. 1994. Adult-to-child speech and language acquisition in Mandarin Chinese. Ph.D. thesis, Yale University. V. Valian. 1991. Syntactic subjects in the early speech of American and Italian children. Cognition, 40(1-2):21– 81. L. Van Houten. 1986. Role of maternal input in the acquisition process: The communicative strategies of adolescent and older mothers with their language learning children. In Boston University Conference on Language Development. A. Warren-Leubecker and J.N. Bohannon III. 1984. Into- nation patterns in child-directed speech: Mother-father differences. Child Development, 55(4):1379–1385. A. Warren. 1982. Sex differences in speech to children. Ph.D. thesis, Georgia Institute of Technology. B. Wilson and A.M. Peters. 1988. What are you cookin’ on a hot?: A three-year-old blind child’s ‘violation’ of universal constraints on constituent movement. Lan- guage, 64:249–273. 137 138