Online Adaptor Grammars with Hybrid Inference Ke Zhai Computer Science and UMIACS University of Maryland College Park, MD USA zhaike@cs.umd.edu Jordan Boyd-Graber Computer Science University of Colorado Boulder, CO USA jordan.boyd.graber@colorado.edu Shay B. Cohen School of Informatics University of Edinburgh Edinburgh, Scotland, UK scohen@inf.ed.ac.uk Abstract Adaptor grammars are a flexible, powerful formalism for defining nonparametric, un- supervised models of grammar productions. This flexibility comes at the cost of expensive inference. We address the difficulty of infer- ence through an online algorithm which uses a hybrid of Markov chain Monte Carlo and variational inference. We show that this in- ference strategy improves scalability without sacrificing performance on unsupervised word segmentation and topic modeling tasks. 1 Introduction Nonparametric Bayesian models are effective tools to discover latent structure in data (Müller and Quin- tana, 2004). These models have had great success in text analysis, especially syntax (Shindo et al., 2012). Nonparametric distributions provide support over a countably infinite long-tailed distributions common in natural language (Goldwater et al., 2011). We focus on adaptor grammars (Johnson et al., 2006), syntactic nonparametric models based on probabilistic context-free grammars. Adaptor gram- mars weaken the strong statistical independence as- sumptions PCFGs make (Section 2). The weaker statistical independence assumptions that adaptor grammars make come at the cost of ex- pensive inference. Adaptor grammars are not alone in this trade-off. For example, nonparametric exten- sions of topic models (Teh et al., 2006) have substan- tially more expensive inference than their parametric counterparts (Yao et al., 2009). A common approach to address this compu- tational bottleneck is through variational infer- ence (Wainwright and Jordan, 2008). One of the advantages of variational inference is that it can be easily parallelized (Nallapati et al., 2007) or trans- formed into an online algorithm (Hoffman et al., 2010), which often converges in fewer iterations than batch variational inference. Past variational inference techniques for adap- tor grammars assume a preprocessing step that looks at all available data to establish the support of these nonparametric distributions (Cohen et al., 2010). Thus, these past approaches are not directly amenable to online inference. Markov chain Monte Carlo (MCMC) inference, an alternative to variational inference, does not have this disadvantage. MCMC is easier to implement, and it discovers the support of nonparametric mod- els during inference rather than assuming it a priori. We apply stochastic hybrid inference (Mimno et al., 2012) to adaptor grammars to get the best of both worlds. We interleave MCMC inference inside vari- ational inference. This preserves the scalability of variational inference while adding the sparse statis- tics and improved exploration MCMC provides. Our inference algorithm for adaptor grammars starts with a variational algorithm similar to Cohen et al. (2010) and adds hybrid sampling within varia- tional inference (Section 3). This obviates the need for expensive preprocessing and is a necessary step to create an online algorithm for adaptor grammars. Our online extension (Section 4) processes exam- ples in small batches taken from a stream of data. As data arrive, the algorithm dynamically extends the underlying approximate posterior distributions as more data are observed. This makes the algo- rithm flexible, scalable, and amenable to datasets that cannot be examined exhaustively because of their size—e.g., terabytes of social media data ap- pear every second—or their nature—e.g., speech ac- quisition, where a language learner is limited to the bandwidth of the human perceptual system and can- not acquire data in a monolithic batch (Börschinger and Johnson, 2012). We show our approach’s scalability and effective- ness by applying our inference framework in Sec- tion 5 on two tasks: unsupervised word segmenta- tion and infinite-vocabulary topic modeling. 2 Background In this section, we review probabilistic context-free grammars and adaptor grammars. 2.1 Probabilistic Context-free Grammars Probabilistic context-free grammars (PCFG) de- fine probability distributions over derivations of a context-free grammar. We define a PCFG G to be a tuple 〈W ,N,R,S,θ〉: a set of terminals W , a set of nonterminals N, productions R, start sym- bol S ∈ N and a vector of rule probabilities θ. The rules that rewrite nonterminal c is R(c). For a more complete description of PCFGs, see Manning and Schütze (1999). PCFGs typically use nonterminals with a syntactic interpretation. A sequence of terminals (the yield) is generated by recursively rewriting nonterminals as sequences of child symbols (either a nonterminal or a symbol). This builds a hierarchical phrase-tree structure for every yield. For example, a nonterminal VP represents a verb phrase, which probabilistically rewrites into a se- quence of nonterminals V, N (corresponding to verb and noun) using the production rule VP → V N. Both nonterminals can be further rewritten. Each nonterminal has a multinomial distribution over ex- pansions; for example, a multinomial for nonter- minal N would rewrite as “cake”, with probability θN→cake = 0.03. Rewriting terminates when the derivation has reached a terminal symbol such as “cake” (which does not rewrite). While PCFGs are used both in the supervised set- ting and in the unsupervised setting, in this paper we assume an unsupervised setting, in which only terminals are observed. Our goal is to predict the underlying phrase-structure tree. 2.2 Adaptor Grammars PCFGs assume that the rewriting operations are in- dependent given the nonterminal. This context- freeness assumption often is too strong for modeling natural language. Adaptor grammars break this independence as- sumption by transforming a PCFG’s distribution over Algorithm 1 Generative Process 1: For nonterminals c ∈ N, draw rule probabilities θc ∼ Dir(αc) for PCFG G. 2: for adapted nonterminal c in c1, . . . ,c|M| do 3: Draw grammaton Hc ∼ PYGEM(ac,bc,Gc) according to Equation 1, where Gc is defined by the PCFG rules R. 4: For i ∈ {1, . . . ,D}, generate a phrase-structure tree tS,i using the PCFG rules R(e) at non-adapted nonterminal e and the grammatons Hc at adapted nonterminals c. 5: The yields of trees t1, . . . , tD are observations x1, . . . ,xD. trees Gc rooted at nonterminal c into a richer distri- bution Hc over the trees headed by a nonterminal c, which is often referred to as the grammaton. A Pitman-Yor Adaptor grammar (PYAG) forms the adapted tree distributions Hc using a Pitman-Yor process (Pitman and Yor, 1997, PY), a generalization of the Dirichlet process (Ferguson, 1973, DP).1 A draw Hc ≡ (πc,zc) is formed by the stick break- ing process (Sudderth and Jordan, 2008, PYGEM) parametrized by scale parameter a, discount factor b, and base distribution Gc: π′k ∼Beta(1 − b,a + kb), zk ∼Gc, πk ≡π′k ∏k−1 j=1 (1 −π ′ j), H ≡ ∑ k πkδzk. (1) Intuitively, the distribution Hc is a discrete recon- struction of the atoms sampled from Gc—hence, reweights Gc. Grammaton Hc assigns non-zero stick-breaking weights π to a countably infinite number of parse trees z. We describe learning these grammatons in Section 3. More formally, a PYAG is a quintuple A = 〈G,M,a,b,α〉 with: a PCFG G; a set of adapted nonterminals M ⊆ N; Pitman-Yor process param- eters ac,bc at each adaptor c ∈ M and Dirichlet parameters αc for each nonterminal c ∈ N. We also assume an order on the adapted nonterminals, c1, . . . ,c|M| such that cj is not reachable from ci in a derivation if j > i.2 Algorithm 1 describes the generative process of an adaptor grammar on a set of D observed sen- tences x1, . . . ,xD. 1Adaptor grammars, in their general form, do not have to use the Pitman-Yor process but have only been used with the Pitman-Yor process. 2This is possible because we assume that recursive nonter- minals are not adapted. Given a PYAG A, the joint probability for a set of sentences X and its collection of trees T is p(X,T ,π,θ,z|A) = ∏ c∈M p(πc|ac,bc)p(zc|Gc) · ∏ c∈N p(θc|αc) ∏ xd∈X p(xd, td|θ,π,z), where xd and td represent the dth observed string and its corresponding parse. The multinomial PCFG parameter θc is drawn from a Dirichlet distribution at nonterminal c ∈ N. At each adapted nontermi- nal c ∈ M, the stick-breaking weights πc are drawn from a PYGEM (Equation 1). Each weight has an as- sociated atom zc,i from base distribution Gc, a sub- tree rooted at c. The probability p(xd, td |θ,π,z) is the PCFG likelihood of yield xd with parse tree td. Adaptor grammars require a base PCFG such that it does not have recursive adapted nonterminals, i.e., there cannot be a path in a derivation from a given adapted nonterminal to a second appearance of that adapted nonterminal. 3 Hybrid Variational-MCMC Inference Discovering the latent variables of the model—trees, adapted probabilities, and PCFG rules—is a problem of posterior inference given observed data. Previ- ous approaches use MCMC (Johnson et al., 2006) or variational inference (Cohen et al., 2010). MCMC discovers the support of nonparametric models during the inference, but does not scale to larger datasets (due to tight coupling of variables). Variational inference, however, is inherently paral- lel and easily amendable to online inference, but re- quires preprocessing to discover the adapted produc- tions. We combine the best of both worlds and pro- pose a hybrid variational-MCMC inference algorithm for adaptor grammars. Variational inference posits a variational distribu- tion over the latent variables in the model; this in turn induces an “evidence lower bound” (ELBO, L) as a function of a variational distribution q, a lower bound on the marginal log-likelihood. Variational inference optimizes this objective function with re- spect to the parameters that define q. In this section, we derive coordinate-ascent up- dates for these variational parameters. A key math- ematical component is taking expectations with re- spect to the variational distribution q. We strategi- cally use MCMC sampling to compute the expecta- tion of q over parse trees z. Instead of explicitly computing the variational distribution for all param- eters, one can sample from it. This produces a sparse approximation of the variational distribution, which improves both scalability and performance. Sparse distributions are easier to store and transmit in im- plementations, which improves scalability. Mimno et al. (2012) also show that sparse representations improve performance. Moreover, because it can flexibly adjust its support, it is a necessary prereq- uisite to online inference (Section 4). 3.1 Variational Lower Bound We posit a mean-field variational distribution: q(π,θ,T |γ,ν,φ) = ∏ c∈M ∏∞ i=1 q(π ′ c,i|ν 1 c,i,ν 2 c,i) · ∏ c∈N q(θc|γc) ∏ xd∈X q(td|φd), (2) where π′c,i is drawn from a variational Beta distri- bution parameterized by ν1c,i,ν 2 c,i; and θc is from a variational Dirichlet prior γc ∈ R |R(c)| + . Index i ranges over a possibly infinite number of adapted rules. The parse for the dth observation, td is mod- eled by a multinomial φd, where φd,i is the proba- bility generating the ith phrase-structure tree td,i. The variational distribution over latent variables induces the following ELBO on the likelihood: L(z,π,θ,T ,D; a,b,α) = H[q(θ,π,T )] + ∑ c∈N Eq[log p(θc|αc)] (3) + ∑ c∈M ∑∞ i=1 Eq[log p(π ′ c,i|ac,bc)] + ∑ c∈M ∑∞ i=1 Eq[log p(zc,i |π,θ)] + ∑ xd∈X Eq[log p(xd, td |π,θ,z)], where H[•] is the entropy function. To make this lower bound tractable, we truncate the distribution over π to a finite set (Blei and Jor- dan, 2005) for each adapted nonterminal c ∈ M, i.e., π′c,Kc ≡ 1 for some index Kc. Because the atom weights πk are deterministically defined by Equa- tion 1, this implies that πc,i is zero beyond index Kc. Each weight πc,i is associated with an atom zc,i, a subtree rooted at c. We call the ordered set of zc,i the truncated nonterminal grammaton (TNG ). Each adapted nonterminal c ∈ M has its own TNGc. The ith subtree in TNGc is denoted TNGc(i). In the rest of this section, we describe approxi- mate inference to maximize L. The most impor- tant update is φd,i, which we update using stochastic MCMC inference (Section 3.2). Past variational ap- proaches for adaptor grammars (Cohen et al., 2010) rely on a preprocessing step and heuristics to define a static TNG . In contrast, our model dynamically discovers trees. The TNG grows as the model sees more data, allowing online updates (Section 4). The remaining variational parameters are opti- mized using expected counts of adaptor grammar rules. These expected counts are described in Sec- tion 3.3, and the variational updates for the vari- ational parameters excluding φd,i are described in Section 3.4. 3.2 Stochastic MCMC Inference Each observation xd has an associated variational multinomial distribution φd over trees td that can yield observation xd with probability φd,i. Hold- ing all other variational parameters fixed, the coordinate-ascent update (Mimno et al., 2012; Bishop, 2006) for φd,i is φd,i ∝ exp{E ¬φd q [log p(td,i|xd,π,θ,z)]}, (4) where φd,i is the probability generating the ith phrase-structure tree td,i and E ¬φd q [•] is the expec- tation with respect to the variational distribution q, excluding the value of φd. Instead of computing this expectation explicitly, we turn to stochastic variational inference (Mimno et al., 2012; Hoffman et al., 2013) to sample from this distribution. This produces a set of sampled trees σd ≡ {σd,1, . . . ,σd,k}. From this set of trees we can approximate our variational distribution over trees φ using the empirical distribution σd, i.e., φd,i ∝ I[σd,j = td,i,∀σd,j ∈ σd]. (5) This leads to a sparse approximation of variational distribution φ.3 Previous inference strategies (Johnson et al., 2006; Börschinger and Johnson, 2012) for adaptor grammars have used sampling. The adaptor gram- mar inference methods use an approximate PCFG to emulate the marginalized Pitman-Yor distributions 3In our experiments, we use ten samples. at each nonterminal. Given this approximate PCFG, we can then sample a derivation z for string x from the possible trees (Johnson et al., 2007). Sampling requires a derived PCFG G′ that approx- imates the distribution over tree derivations condi- tioned on a yield. It includes the original PCFG rules R = {c → β} that define the base distribution and the new adapted productions R′ = {c ⇒ z,z ∈ TNGc}. Under G′, the probability θ′ of adapted pro- duction c ⇒ z is log θ′c⇒z =   Eq[log πc,i], if TNGc(i) = z Eq[log πc,Kc ] + Eq[log θc⇒z], otherwise (6) where Kc is the truncation level of TNGc and πc,Kc represents the left-over stick weights in the stick- breaking process for adaptor c ∈ M. θc⇒z repre- sents the probability of generating tree c ⇒ z under the base distribution. See also Cohen (2011). The expectation of the Pitman-Yor multinomial πc,i under the truncated variational stick-breaking distribution is Eq[log πa,i] = Ψ(ν1a,i) − Ψ(ν 1 a,i + ν 2 a,i) (7) + ∑i−1 j=1(Ψ(ν 2 a,j) − Ψ(ν 1 a,j + ν 2 a,j)), and the expectation of generating the phrase- structure tree a ⇒ z based on PCFG productions under the variational Dirichlet distribution is Eq[log θa⇒z] = ∑ c→β∈a⇒z ( Ψ(γc→β) (8) − Ψ( ∑ c→β′∈Rc γc→β′ ) ) where Ψ(•) is the digamma function, and c → β ∈ a ⇒ z represents all PCFG productions in the phrase-structure tree a ⇒ z. This PCFG can compose arbitrary subtrees and thus discover new trees that better describe the data, even if those trees are not part of the TNG . This is equivalent to creating a “new table” in MCMC in- ference and provides truncation-free variational up- dates (Wang and Blei, 2012) by sampling a unseen subtree with adapted nonterminal c ∈ M at the root. This frees our model from preprocessing to initial- ize truncated grammatons in Cohen et al. (2010). This stochastic approach has the advantage of creat- ing sparse distributions (Wang and Blei, 2012): few unique trees will be represented. S→AB B→{a,b,c} A→B B a B b Grammar Seating Assignments (nonterminal A) Yield Parse Counts ca B cA S B a New Seating B a B b B c h(A →c) +=1 g(B →c) +=1 g(B →a) +=1 ab B aA S B b B a B b B a h(A →a) +=1 g(B →a) +=1 g(B →b) +=1 ba B bA S B a B a B b f(A →b) +=1 g(B →a) +=1 Figure 1: Given an adaptor grammar, we sample derivations given an approximate PCFG and show how these affect counts. The sampled derivations can be understood via the Chinese restaurant metaphor (Johnson et al., 2006). Existing cached rules (elements in the TNG ) can be thought of as occupied ta- bles; this happens in the case of the yield “ba”, which increases counts for unadapted rules g and for entries in TNGA, f. For the yield “ca”, there is no appropriate entry in the TNG , so it must use the base distribution, which corresponds to sitting at a new table. This generates counts for g, as it uses the unadapted rule and for h, which represents entries that could be included in the TNG in the future. The final yield, “ab”, shows that even when compatible entries are in the TNG , it might still create a new table, changing the underlying base distribution. Parallelization As noted in Cohen et al. (2010), the inside-outside algorithm dominates the runtime of every iteration, both for sampling and variational inference. However, unlike MCMC, variational in- ference is highly parallelizable and requires fewer synchronizations per iteration (Zhai et al., 2012). In our approach, both inside algorithms and sampling process can be distributed, and those counts can be aggregated afterwards. In our implementation, we use multiple threads to parallelize tree sampling. 3.3 Calculating Expected Rule Counts For every observation xd, the hybrid approach pro- duces a set of sampled trees, each of which contains three types of productions: adapted rules, original PCFG rules, and potentially adapted rules. The last set is most important, as these are new rules dis- covered by the sampler. These are explained using the Chinese restaurant metaphor in Figure 1. The multiset of all adapted productions is M(td,i) and the multiset of non-adapted productions that gener- ate tree td,i is N(td,i). We compute three counts: 1: f is the expected number of productions within the TNG . It is the sum over the probability of a tree td,k times the number of times an adapted production appeared in td,k, fd(a ⇒ za,i) =∑ k ( φd,k |a ⇒ za,i : a ⇒ za,i ∈ M(td,k)|︸ ︷︷ ︸ count of rule a ⇒ za,i in tree td,k ) . 2: g is the expected counts of PCFG productions R that defines the base distribution of the adaptor grammar, gd(a → β) =∑ k (φd,k |a → β : a → β ∈ N(td,k)|) . 3: Finally, a third set of productions are newly dis- covered by the sampler and not in the TNG . These subtrees are rules that could be adapted, with expected counts hd(c ⇒ zc,i) =∑ k (φd,k |c ⇒ zc,i : c ⇒ zc,i /∈ M(td,k)|) . These subtrees—lists of PCFG rules sampled from Equation 6—correspond to adapted pro- ductions not yet present in the TNG . 3.4 Variational Updates Given the sparse vectors φ sampled from the hybrid MCMC step, we update all variational parameters as γa→β =αa→β + ∑ xd∈X gd(a → β) + ∑ b∈M ∑Kb i=1 n(a → β,zb,i), ν1a,i =1 − ba + ∑ xd∈X fd(a ⇒ za,i) + ∑ b∈M ∑Kb k=1 n(a ⇒ za,i,zb,k), ν2a,i =aa + iba + ∑ xd∈X ∑Ka j=1 fd(a ⇒ za,j) + ∑ b∈M ∑Kb k=1 ∑Ka j=1 n(a ⇒ za,j,zb,k), where n(r,t) is the expected number of times pro- duction r is in tree t, estimated during sampling. Hyperparameter Update We update our PCFG hyperparameter α, PYGEM hyperparameters a and b as in Cohen et al. (2010). 4 Online Variational Inference Online inference for probabilistic models requires us to update our posterior distribution as new observa- tions arrive. Unlike batch inference algorithms, we do not assume we always have access to the entire dataset. Instead, we assume that observations arrive in small groups called minibatches. The advantage of online inference is threefold: a) it does not re- quire retaining the whole dataset in memory; b) each online update is fast; and c) the model usually con- verges faster. All of these make adaptor grammars scalable to larger datasets. Our approach is based on the stochastic varia- tional inference for topic models (Hoffman et al., 2013). This inference strategy uses a form of stochastic gradient descent (Bottou, 1998): using the gradient of the ELBO, it finds the sufficient statistics necessary to update variational parameters (which are mostly expected counts calculated using the inside-outside algorithm), and interpolates the result with the current model. We assume data arrive in minibatches B (a set of sentences). We accumulate expected counts f̃(l)(a ⇒ za,i) =(1 − �) · f̃(l−1)(a ⇒ za,i) (9) + � · |X||Bl| ∑ xd∈Bl fd(a ⇒ za,i), g̃(l)(a → β) =(1 − �) · g̃(l−1)(a → β) (10) + � · |X||Bl| ∑ xd∈Bl gd(a → β), with decay factor � ∈ (0, 1) to guarantee conver- gence. We set it to � = (τ + l)−κ, where l is the minibatch counter. The decay inertia τ prevents pre- mature convergence, and decay rate κ controls the speed of change in sufficient statistics (Hoffman et al., 2010). We recover batch variational approach when B = D and κ = 0. The variables f̃(l) and g̃(l) are accumulated suffi- cient statistics of adapted and unadapted productions after processing minibatch Bl. They update the ap- proximate gradient. The updates for variational pa- rameters become γa→β =αa→β + g̃ (l)(a → β) (11) + ∑ b∈M ∑Kb i=1 n(a → β,zb,i), ν1a,i =1 − ba + f̃ (l)(a ⇒ za,i) (12) + ∑ b∈M ∑Kb k=1 n(a ⇒ za,i,zb,k), ν2a,i =aa + iba + ∑Ka j=1 f̃ (l)(a ⇒ za,j) (13) + ∑ b∈M ∑Kb k=1 ∑Ka j=1 n(a ⇒ za,j,zb,k), where Ka is the size of the TNG at adaptor a ∈ M. 4.1 Refining the Truncation As we observe more data during inference, our TNG s need to change. New rules should be added, useless rules should be removed, and derivations for existing rules should be updated. In this section, we describe heuristics for performing each of these operations. Adding Productions Sampling can identify pro- ductions that are not adapted but were instead drawn from the base distribution. These are candidates for the TNG . For every nonterminal a, we add these potentially adapted productions to TNGa after each minibatch. The count associated with candidate pro- ductions is now associated with an adapted produc- tion, i.e., the h count contributes to the relevant f count. This mechanism dynamically expands TNGa. Sorting and Removing Productions Our model does not require a preprocessing step to initialize the TNG s, rather, it constructs and expands all TNG s on the fly. To prevent the TNG from growing unwieldy, we prune TNG after every u minibatches. As a re- sult, we need to impose an ordering over all the parse trees in the TNG . The underlying PYGEM distribu- tion implicitly places an ranking over all the atoms according to their corresponding sufficient statis- tics (Kurihara et al., 2007), as shown in Equation 9. It measures the “usefulness” of every adapted pro- duction throughout inference process. In addition to accumulated sufficient statistics, Cohen et al. (2010) add a secondary term to discour- age short constituents (Mochihashi et al., 2009). We impose a reward term for longer phrases in addition to f̃ and sort all adapted productions in TNGa using the ranking score Λ(a ⇒ za,i) = f̃(l)(a ⇒ za,i) · log(� · |s| + 1), where |s| is the number of yields in production a ⇒ za,i. Because � decreases each minibatch, the reward for long phrases diminishes. This is similar to an annealed version of Cohen et al. (2010)—where the reward for long phrases is fixed, see also Mochihashi et al. (2009). After sorting, we remove all but the top Ka adapted productions. Rederiving Adapted Productions For MCMC in- ference, Johnson and Goldwater (2009) observe that atoms already associated with a yield may have trees Algorithm 2 Online inference for adaptor grammars 1: Random initialize all variational parameters. 2: for minibatch of l = 1,2, . . . do 3: Construct approximate PCFG θ′ of A (Equation 6). 4: for input sentence d = 1,2, . . . ,Dl do 5: Accumulate inside probabilities from approximate PCFG θ′. 6: Sample phrase-structure trees σ and update the tree distribution φ (Equation 5). 7: For every adapted nonterminal c, append adapted pro- ductions to TNGc. 8: Accumulate sufficient statistics (Equations 9 and 10). 9: Update γ, ν1, and ν2 (Equations 11-13). 10: Refine and prune the truncation every u minibatches. that do not explain their yield well. They propose ta- ble label resampling to rederive yields. In our approach this is equivalent to “mutating” some derivations in a TNG . After pruning rules ev- ery u minibatches, we perform table label resam- pling for adapted nonterminals from general to spe- cific (i.e., a topological sort). This provides better expected counts n(r,•) for rules used in phrase- structure subtrees. Empirically, we find table la- bel resampling only marginally improves the word- segmentation result. Initialization Our inference begins with random variational Dirichlets and empty TNG s, which obvi- ates the preprocessing step in Cohen et al. (2010). Our model constructs and expands all TNG s on the fly. It mimics the incremental initialization of John- son and Goldwater (2009). Algorithm 2 summarizes the pseudo-code of our online approach. 4.2 Complexity Inside and outside calls dominate execution time for adaptor grammar inference. Variational ap- proaches compute inside-outside algorithms and es- timate the expected counts for every possible tree derivation (Cohen et al., 2010). For a dataset with D observations, variational inference requires O ( DI ) calls to inside-outside algorithm, where I is the number of iterations, typically in the tens. In contrast, MCMC only needs to accumulate in- side probabilities, and then sample a tree deriva- tion (Chappelier and Rajman, 2000). The sampling step is negligible in processing time compared to the inside algorithm. MCMC inference requires O ( DI ) calls to the inside algorithm—hence every iteration co ll oc at io n SENT 7→ COLLOC SENT 7→ COLLOC SENT COLLOC 7→ WORDS un ig ra m WORDS 7→ WORD WORDS 7→ WORD WORDS WORD 7→ CHARS CHARS 7→ CHAR CHARS 7→ CHAR CHARS CHAR 7→ ? In fV oc L D A SENT 7→ DOCj j=1, 2, . . . D DOCj 7→−j TOPICi i=1, 2, . . . K TOPICi 7→ WORD WORD 7→ CHARS CHARS 7→ CHAR CHARS 7→ CHAR CHARS CHAR 7→ ? Table 1: Grammars used in our experiments. The nonterminal CHAR is a non-adapted rule that expands to all characters used in the data, sometimes called pre-terminals. Adapted nonter- minals are underlined. For the unigram grammar, only nonter- minal WORD is adapted; whereas for the collocation grammar, both nonterminals WORD and COLLOC are adapted. For the IN- FVOC LDA grammar, D is the total number of documents and K is the number of topics. Therefore, j ranges over {1, . . . ,D} and i ranges over {1, . . . ,K}. is much faster than variational approach—but I is usually on the order of thousands. Likewise, our hybrid approach also only needs the less expensive inside algorithm to sample trees. And while each iteration is less expensive, our ap- proach can achieve reasonable results with only a single pass through the data. And thus only requires O(D) calls to the inside algorithm. Because the inside-outside algorithm is funda- mental to each of these algorithms, we use it as a common basis for comparison across different im- plementations. This is over-generous to variational approaches, as the full inside-outside computation is more expensive than the inside probability computa- tion required for sampling in MCMC and our hybrid approach. 5 Experiments and Discussion We implement our online adaptor grammar model (ONLINE) in Python4 and compare it against both MCMC (Johnson and Goldwater, 2009, MCMC) and the variational inference (Cohen et al., 2010, VARI- ATIONAL). We use the latest implementation of MCMC sampler for adaptor grammars5 and simulate the variational approach using our implementation. For MCMC approach, we use the best settings re- ported in Johnson and Goldwater (2009) with incre- mental initialization and table label resampling. 4Available at http://www.umiacs.umd.edu/˜zhaike/. 5 http://web.science.mq.edu.au/˜mjohnson/code/ py-cfg-2013-02-25.tgz Model and Settings ctb7 pku cityu unigram collocation unigram collocation unigram collocation M C M C 500 iter 72.70 (2.81) 50.53 (2.82) 72.01 (2.82) 49.06 (2.81) 74.19 (3.55) 63.14 (3.53) 1000 iter 72.65 (2.83) 62.27 (2.79) 71.81 (2.81) 62.47 (2.77) 74.37 (3.54) 70.62 (3.51) 1500 iter 72.17 (2.80) 69.65 (2.77) 71.46 (2.80) 70.20 (2.73) 74.22 (3.54) 72.33 (3.50) 2000 iter 71.75 (2.79) 71.66 (2.76) 71.04 (2.79) 72.55 (2.70) 74.01 (3.53) 73.15 (3.48) O N L IN E κ τ KWord = 30k KColloc = 100k KWord = 40k KColloc = 120k KWord = 50k KColloc = 150K 0.6 32 70.17 (2.84) 68.43 (2.77) 69.93 (2.89) 68.09 (2.71) 72.59 (3.62) 69.27 (3.61) 128 72.98 (2.72) 65.20 (2.81) 72.26 (2.63) 65.57 (2.83) 74.73 (3.40) 64.83 (3.62) 512 72.76 (2.78) 56.05 (2.85) 71.99 (2.74) 58.94 (2.94) 73.68 (3.60) 60.40 (3.70) 0.8 32 71.10 (2.77) 70.84 (2.76) 70.31 (2.78) 70.91 (2.71) 73.12 (3.60) 71.89 (3.50) 128 72.79 (2.64) 70.93 (2.63) 72.08 (2.62) 72.02 (2.63) 74.62 (3.45) 72.28 (3.51) 512 72.82 (2.58) 68.53 (2.76) 72.14 (2.58) 70.07 (2.69) 74.71 (3.37) 72.58 (3.49) 1.0 32 69.98 (2.87) 70.71 (2.63) 69.42 (2.84) 71.45 (2.67) 73.18 (3.59) 72.42 (3.45) 128 71.84 (2.72) 71.29 (2.58) 71.29 (2.67) 72.56 (2.61) 73.23 (3.39) 72.61 (3.41) 512 72.68 (2.62) 70.67 (2.60) 71.86 (2.63) 71.39 (2.66) 74.45 (3.41) 72.88 (3.38) VARIATIONAL 69.83 (2.85) 67.78 (2.75) 67.82 (2.80) 66.97 (2.75) 70.47 (3.72) 69.06 (3.69) Table 2: Word segmentation accuracy measured by word token F1 scores and negative log-likelihood on held-out test dataset in the brackets (lower the better, on the scale of 106) for our ONLINE model against MCMC approach (Johnson et al., 2006) on various dataset using the unigram and collocation grammar.7 40 50 60 70 80 1e+03 1e+04 1e+05 1e+06 1e+07 # of inside−outside function calls f− 1 s co re model mcmc online variational (a) unigram grammar. 30 40 50 60 70 80 1e+03 1e+04 1e+05 1e+06 1e+07 # of inside−outside function calls f− 1 s co re model mcmc online variational (b) collocation grammar. Figure 2: Word segmentation accuracy measured by word token F1 scores on brent corpus of three approaches against number of inside-outside function call using unigram (upper) and collo- cation (lower) grammars in Table 1.6 6Our ONLINE settings are batch size B = 20, decay inertia τ = 128, decay rate κ = 0.6 for unigram grammar; and mini- batch size B = 5, decay inertia τ = 256, decay rate κ = 0.8 for collocation grammar. TNG s are refined at interval u = 50. Truncation size is set to KWord = 1.5k and KColloc = 3k. The settings are chosen from cross validation. We observe simi- lar behavior under κ = {0.7,0.9,1.0}, τ = {32,64,512}, B = {10,50} and u = {10,20,100}. 7For ONLINE inference, we parallelize each minibatch with four threads with settings: batch size B = 100 and TNG refine- ment interval u = 100. ONLINE approach runns for two passes over datasets. VARIATIONAL runs fifty iterations, with the same truncation level as in ONLINE. For negative log-likelihood eval- uation, we train the model on a random 70% of the data, and hold out the rest for testing. We observe similar behavior for 5.1 Word Segmentation We evaluate our online adaptor grammar on the task of word segmentation, which focuses on identify- ing word boundaries from a sequence of characters. This is especially the case for Chinese, since char- acters are written in sequence without word bound- aries. We first evaluate all three models on the stan- dard Brent version of the Bernstein-Ratner cor- pus (Bernstein-Ratner, 1987; Brent and Cartwright, 1996, brent). The dataset contains 10k sentences, 1.3k distinct words, and 72 distinct characters. We compare the results on both unigram and colloca- tion grammars introduced in Johnson and Goldwater (2009) as listed in Table 1. Figure 2 illustrates the word segmentation ac- curacy in terms of word token F1-scores on brent against the number of inside-outside function calls for all three approaches using unigram and colloca- tion grammars. In both cases, our ONLINE approach converges faster than MCMC and VARIATIONAL ap- proaches, yet yields comparable or better perfor- mance when seeing more data. In addition to the brent corpus, we also evalu- ate three approaches on three other Chinese datasets compiled by Xue et al. (2005) and Emerson (2005):8 • Chinese Treebank 7.0 (ctb7): 162k sentences, 57k distinct words, 4.5k distinct characters; our model under κ = {0.7,0.9} and τ = {64,256}. 8We use all punctuation as natural delimiters (i.e., words cannot cross punctuation). • Peking University (pku): 183k sentences, 53k distinct words, 4.6k distinct characters; and • City University of Hong Kong (cityu): 207k sentences, 64k distinct words, and 5k distinct characters. We compare our inference method against other approaches on F1 score. While other unsupervised word segmentation systems are available (Mochi- hashi et al. (2009), inter alia),9 our focus is on a di- rect comparison of inference techniques for adaptor grammar, which achieve competitive (if not state-of- the-art) performance. Table 2 shows the word token F1-scores and neg- ative likelihood on held-out test dataset of our model against MCMC and VARIATIONAL. We randomly sample 30% of the data for testing and the rest for training. We compute the held-out likelihood of the most likely sampled parse trees out of each model.10 Our ONLINE approach consistently better segments words than VARIATIONAL and achieves comparable or better results than MCMC. For MCMC, Johnson and Goldwater (2009) show that incremental initialization—or online updates in general—results in more accurate word segmenta- tion, even though the trees have lower posterior probability. Similarly, our ONLINE approach initial- izes and learns them on the fly, instead of initializing the grammatons and parse trees for all data upfront as for VARIATIONAL. This uniformly outperforms batch initialization on the word segmentation tasks. 5.2 Infinite Vocabulary Topic Modeling Topic models often can be replicated using a care- fully crafted PCFG (Johnson, 2010). These pow- erful extensions can capture topical collocations and sticky topics; these embelishments could fur- ther improve NLP applications of simple unigram topic models such as word sense disambigua- tion (Boyd-Graber and Blei, 2007), part of speech 9Their results are not directly comparable: they use different subsets and assume different preprocessing. 10Note that this is only an approximation to the true held-out likelihood, since it is impossible to enumerate all the possible parse trees and hence compute the likelihood for a given sen- tence under the model. 11We train all models with 5 topics with settings: TNG re- finement interval u = 100, truncation size KTopic = 3k, and the mini-batch size B = 50. We observe a similar behavior under κ ∈{0.7,0.9} and τ ∈{64,256}. 32 128 512 50 60 70 50 60 70 50 60 70 0.6 0.8 1 1 10 10 0 10 00 1 10 10 0 10 00 1 10 10 0 10 00 # of passes over the dataset pm i inference infvoc mcmc online variational ⌧ : ⌧ : ⌧ :  70 60 50 70 60 50 70 60 50 32 128 512 50 60 70 50 60 70 50 60 70 0.6 0.8 1 1 10 10 0 10 00 1 10 10 0 10 00 1 10 10 0 10 00 # of passes over the dataset pm i inference infvoc mcmc online variational co he re nc e Figure 3: The average coherence score of topics on de-news datasets against INFVOC approach and other inference tech- niques (MCMC, VARIATIONAL) under different settings of de- cay rate κ and decay inertia τ using the InfVoc LDA grammar in Table 1. The horizontal axis shows the number of passes over the entire dataset.11 tagging (Toutanova and Johnson, 2008) or dialogue modeling (?). However, expressing topic models in adaptor grammars is much slower than traditional topic models, for which fast online inference (Hoff- man et al., 2010) is available. Zhai and Boyd-Graber (2013) argue that online inference and topic models violate a fundamental as- sumption in online algorithms: new words are intro- duced as more data are streamed to the algorithm. Zhai and Boyd-Graber (2013) introduce an infer- ence framework, INFVOC, to discover words from a Dirichlet process with a character n-gram base dis- tribution. We show that their complicated model and on- line inference can be captured and extended via an appropriate PCFG grammar and our online adap- tor grammar inference algorithm. Our extension to INFVOC generalizes their static character n-gram model, learning the base distribution (i.e., how words are composed from characters) from data. In contrast, their base distribution was learned from a dictionary as a preprocessing step and held fixed. This is an attractive testbed for our online infer- ence. Within a topic, we can verify that the words we discover are relevant to the topic and that new words rise in importance in the topic over time if they are relevant. For these experiments, we treat each token (with its associated document pseudo-word −j) as a single sentence, and each minibatch contains only one sentence (token). 12The plot is generated with truncation size KTopic = 2k, mini-batch size B = 1, truncation pruning interval u = 50, decay inertia τ = 256, and decay rate κ = 0.8. All PY hyper- parameters are optimized. new words added at corresponding minibatch minibatch-3k ... 2-union 3-wage ... 16-minist ... 18-year ... 21-bill ... 32-increas 33-tax ... 48-reform ... 58-lower ... 82-percent ... 95-committe ... 180-pension ... minibatch-8k 1-year 2-minist 3-tax 4-pension 5-reform ... 10-committe ... 12-percent ... 16-lower ... 19-increas ... 25-bill ... 42-union 43-wage ... 181-schroeder ... 436-deduct ... minibatch-19k 1-deduct 2-tax 3-year 4-pension 5-reform ... 7-minist ... 9-increas ... 11-committe ... 13-schroeder 14-percent ... 17-lower ... 23-bill ... 49-union ... 92-wage ... minibatch-20k 1-tax 2-year 3-reform 4-pension 5-minist 6-increas ... 9-schroeder ... 11-committe ... 19-percent ... 31-lower ... 49-bill ... 51-union ... 53-deduct ... 127-wage ... minibatch-10k 1-tax 2-reform 3-pension 4-year 5-minist 6-increas ... 8-lower ... 13-percent ... 30-committe ... 33-bill ... 106-wage ... 115-union ... 120-schroeder ... 530-deduct ... minibatch-1k 1-reform ... 5-increas ... 10-union ... 13-wage ... 47-percent ... 53-year ... 67-tax ... 70-minist ... 108-bill ... 164-lower pension committe ... schroeder affair ... minibatch-4k ... 4-percent 5-tax 6-reform ... 12-year 13-increas ... 16-wage ... 19-minist ... 22-union ... 49-lower ... 82-schroeder ... 90-bill ... 106-committe ... 229-pension ... deduct shop ... recess ... primarili ... minibatch-15k 1-tax 2-schroeder 3-year 4-reform 5-minist 6-pension ... 8-increas ... 13-lower ... 16-percent 17-committe ... 20-union ... 235-bill ... 272-wage ... 306-deduct ... recipi ... minibatch-17k 1-tax 2-year 3-reform 4-schroeder 5-increas 6-minist ... 9-pension ... 11-percent ... 15-lower ... 19-bill ... 28-committe ... 51-union ... 78-wage ... 382-deduct ... alloc ... club ... Figure 4: The evolution of one topic—concerning tax policy—out of five topics learned using online adaptor grammar inference on the de-news dataset. Each minibatch represents a word processed by this online algorithm; time progresses from left to right. As the algorithm encounters new words (bottom) they can make their way into the topic. The numbers next to words represent their overall rank in the topic. For example, the word “pension” first appeared in mini-batch 100, was ranked at 229 after minibatch 400 and became one of the top 10 words in this topic after 2000 minibatches (tokens).12 Quantitatively, we evaluate three different infer- ence schemes and the INFVOC approach13 on a col- lection of English daily news snippets (de-news).14 We used the InfVoc LDA grammar (Table 1). For all approaches, we train the model with five topics, and evaluate topic coherence (Newman et al., 2009), which correlates well with human ratings of topic interpretability (Chang et al., 2009). We collect the co-occurrence counts from Wikipedia and compute the average pairwise pointwise mutual information (PMI) score between the top 10 ranked words of ev- ery topic. Figure 3 illustrates the PMI score for both approaches. Our approach yields comparable or bet- ter results against all other approaches under most conditions. Qualitatively, Figure 4 shows an example of a topic evolution using online adaptor grammar for the de-news dataset. The topic is about “tax pol- icy”. The topic improves over time; words like “year”, “tax” and “minist(er)” become more promi- nent. More importantly, the online approach discov- 13Available at http://www.umiacs.umd.edu/˜zhaike/. 14The de-news dataset is randomly selected subset of 2.2k English documents from http://homepages.inf.ed.ac. uk/pkoehn/publications/de-news/. It contains 6.5k unique types and over 200k word tokens. Tokenization and stemming provided by NLTK (Bird et al., 2009). ers new words and incorporates them into the topic. For example, “schroeder” (former German chancel- lor) first appeared in minibatch 300, was success- fully picked up by our model, and became one of the top ranked words in the topic. 6 Conclusion Probabilistic modeling is a useful tool in understand- ing unstructured data or data where the structure is latent, like language. However, developing these models is often a difficult process, requiring signifi- cant machine learning expertise. Adaptor grammars offer a flexible and quick way to prototype and test new models. Despite ex- pensive inference, they have been used for topic modeling (Johnson, 2010), discovering perspec- tive (Hardisty et al., 2010), segmentation (Johnson and Goldwater, 2009), and grammar induction (Co- hen et al., 2010). We have presented a new online, hybrid inference scheme for adaptor grammars. Unlike previous ap- proaches, it does not require extensive preprocess- ing. It is also able to faster discover useful structure in text; with further development, these algorithms could further speed the development and application of new nonparametric models to large datasets. Acknowledgments We would like to thank the anonymous reviewers, Kristina Toutanova, Mark Johnson, and Ke Wu for insightful discussions. This work was supported by NSF Grant CCF-1018625. Boyd-Graber is also supported by NSF Grant IIS-1320538. Any opin- ions, findings, conclusions, or recommendations ex- pressed here are those of the authors and do not nec- essarily reflect the view of the sponsor. References Nan Bernstein-Ratner. 1987. The phonology of parent child speech. Children’s language, 6:159–174. Steven Bird, Ewan Klein, and Edward Loper. 2009. Nat- ural Language Processing with Python. O’Reilly Me- dia. Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., Secaucus, NJ, USA. David M. Blei and Michael I. Jordan. 2005. Variational inference for Dirichlet process mixtures. Journal of Bayesian Analysis, 1(1):121–144. Benjamin Börschinger and Mark Johnson. 2012. Using rejuvenation to improve particle filtering for bayesian word segmentation. In Proceedings of the Association for Computational Linguistics. Léon Bottou. 1998. Online algorithms and stochastic approximations. In Online Learning and Neural Net- works. Cambridge University Press, Cambridge, UK. Jordan Boyd-Graber and David M. Blei. 2007. PUTOP: Turning predominant senses into a topic model for WSD. In 4th International Workshop on Semantic Evaluations. Michael R. Brent and Timothy A. Cartwright. 1996. Dis- tributional regularity and phonotactic constraints are useful for segmentation. volume 61, pages 93–125. Jonathan Chang, Jordan Boyd-Graber, and David M. Blei. 2009. Connections between the lines: Augment- ing social networks with text. In Knowledge Discovery and Data Mining. Jean-Cédric Chappelier and Martin Rajman. 2000. Monte-Carlo sampling for NP-hard maximization problems in the framework of weighted parsing. In Natural Language Processing, pages 106–117. Shay B. Cohen, David M. Blei, and Noah A. Smith. 2010. Variational inference for adaptor grammars. In Conference of the North American Chapter of the As- sociation for Computational Linguistics. Shay B. Cohen. 2011. Computational Learning of Prob- abilistic Grammars in the Unsupervised Setting. Ph.D. thesis, Carnegie Mellon University. Thomas Emerson. 2005. The second international chi- nese word segmentation bakeoff. In Fourth SIGHAN Workshop on Chinese Language, Jeju, Korea. Thomas S. Ferguson. 1973. A Bayesian analysis of some nonparametric problems. The Annals of Statis- tics, 1(2). Sharon Goldwater, Thomas L. Griffiths, and Mark John- son. 2011. Producing power-law distributions and damping word frequencies with two-stage language models. Journal of Machine Learning Research, pages 2335–2382, July. Eric Hardisty, Jordan Boyd-Graber, and Philip Resnik. 2010. Modeling perspective using adaptor grammars. In Proceedings of Emperical Methods in Natural Lan- guage Processing. Matthew Hoffman, David M. Blei, and Francis Bach. 2010. Online learning for latent Dirichlet allocation. In Proceedings of Advances in Neural Information Processing Systems. Matthew Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. In Journal of Machine Learning Research. Mark Johnson and Sharon Goldwater. 2009. Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor gram- mars. In Conference of the North American Chapter of the Association for Computational Linguistics. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa- ter. 2006. Adaptor grammars: A framework for speci- fying compositional nonparametric Bayesian models. In Proceedings of Advances in Neural Information Processing Systems. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa- ter. 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics. Mark Johnson. 2010. PCFGs, topic models, adaptor grammars and learning topical collocations and the structure of proper names. In Proceedings of the As- sociation for Computational Linguistics. Kenichi Kurihara, Max Welling, and Yee Whye Teh. 2007. Collapsed variational Dirichlet process mixture models. In International Joint Conference on Artifi- cial Intelligence. Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Process- ing. The MIT Press, Cambridge, MA. David Mimno, Matthew Hoffman, and David Blei. 2012. Sparse stochastic inference for latent Dirichlet alloca- tion. In Proceedings of the International Conference of Machine Learning. Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested pitman-yor language modeling. In Proceedings of the Association for Computational Linguistics. Peter Müller and Fernando A. Quintana. 2004. Non- parametric Bayesian data analysis. Statistical Science, 19(1). Ramesh Nallapati, William Cohen, and John Lafferty. 2007. Parallelized variational EM for latent Dirichlet allocation: An experimental evaluation of speed and scalability. In ICDMW. David Newman, Sarvnaz Karimi, and Lawrence Cave- don. 2009. External evaluation of topic models. In Proceedings of the Aurstralasian Document Comput- ing Symposium. J. Pitman and M. Yor. 1997. The two-parameter Poisson- Dirichlet distribution derived from a stable subordina- tor. Annals of Probability, 25(2):855–900. Hiroyuki Shindo, Yusuke Miyao, Akinori Fujino, and Masaaki Nagata. 2012. Bayesian symbol-refined tree substitution grammars for syntactic parsing. In Pro- ceedings of the Association for Computational Lin- guistics. Erik B. Sudderth and Michael I. Jordan. 2008. Shared segmentation of natural scenes using depen- dent Pitman-Yor processes. In Proceedings of Ad- vances in Neural Information Processing Systems. Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical Dirichlet pro- cesses. Journal of the American Statistical Associa- tion, 101(476):1566–1581. Kristina Toutanova and Mark Johnson. 2008. A Bayesian LDA-based model for semi-supervised part- of-speech tagging. In Proceedings of Advances in Neural Information Processing Systems, pages 1521– 1528. Martin J. Wainwright and Michael I. Jordan. 2008. Graphical models, exponential families, and varia- tional inference. Foundations and Trends in Machine Learning, 1(1–2):1–305. Chong Wang and David M. Blei. 2012. Truncation-free online variational inference for Bayesian nonparamet- ric models. In Proceedings of Advances in Neural In- formation Processing Systems. Naiwen Xue, Fei Xia, Fu-dong Chiou, and Marta Palmer. 2005. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engi- neering. Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Knowledge Dis- covery and Data Mining. Ke Zhai and Jordan Boyd-Graber. 2013. Online latent Dirichlet allocation with infinite vocabulary. In Pro- ceedings of the International Conference of Machine Learning. Ke Zhai and Jason D. Williams. 2014. Discovering latent structure in task-oriented dialogues. In Proceedings of the Association for Computational Linguistics. Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mo- hamad Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational infer- ence in mapreduce. In Proceedings of World Wide Web Conference.