Parsing Algebraic Word Problems into Equations Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal†, Oren Etzioni†, and Siena Dumas Ang University of Washington, †Allen Institute for AI {kedzior,hannaneh,sienaang}@uw.edu, {ashishs,orene}@allenai.org Abstract This paper formalizes the problem of solv- ing multi-sentence algebraic word problems as that of generating and scoring equation trees. We use integer linear programming to gener- ate equation trees and score their likelihood by learning local and global discriminative mod- els. These models are trained on a small set of word problems and their answers, without any manual annotation, in order to choose the equation that best matches the problem text. We refer to the overall system as ALGES. We compare ALGES with previous work and show that it covers the full gamut of arithmetic operations whereas Hosseini et al. (2014) only handle addition and subtraction. In addition, ALGES overcomes the brittleness of the Kush- man et al. (2014) approach on single-equation problems, yielding a 15% to 50% reduction in error. 1 Introduction Grade-school algebra word problems are brief nar- ratives (see Figure 1). A typical problem first de- scribes a partial world state consisting of characters, entities, and quantities. Next it updates the condition of an entity or explicates the relationship between entities. Finally, it poses a question about a quantity in the narrative. An ordinary child has to learn the required alge- bra, but will easily grasp the narrative utilizing ex- tensive world knowledge, large vocabulary, word- sense disambiguation, coreference resolution, mas- tery of syntax, and the ability to combine individual Oceanside Bike Rental Shop charges 17 dollars plus 7 dollars an hour for renting a bike. Tom paid 80 dollars to rent a bike. How many hours did he pay to have the bike checked out? = +$ 17$ ∗$ 7$ xh 80$ solution : 9 17 + (7∗x) = 80 Figure 1: Example problem and solution sentences into a coherent mental model. In contrast, the challenge for an NLP system is to “make sense” of the narrative, which may refer to arbitrary activ- ities like renting bikes, collecting coins, or eating cookies. Previous work coped with the open-domain as- pect of algebraic word problems by relying on deter- ministic state transitions based on verb categoriza- tion (Hosseini et al., 2014) or by learning templates that cover equations of particular forms (Kushman et al., 2014). We have discovered, however, that both approaches are brittle, particularly as training data is scarce in this domain, and the space of equations grows exponentially with the number of quantities mentioned in the math problem. We introduce ALGES,1 which maps an unseen multi-sentence algebraic word problem into a set of possible equation trees. Figure 1 shows an equation tree alongside the word problem it represents. ALGES generates the space of trees via Integer Linear Programming (ILP), which allows it to con- 1The code and data is publicly available at https://gitlab.cs.washington.edu/ALGES/TACL2015 . strain the space of trees to represent type-consistent algebraic equations satisfying as many desirable properties as possible. ALGES learns to map spans of text to arithmetic operators, to combine them given the global context of the problem, and to choose the “best” tree corresponding to the problem. The training set for ALGES consists of unannotated algebraic word problems and their solution. Solv- ing the equation represented by such a tree is trivial. ALGES is described in detail in Section 4. ALGES is able to solve word problems with single-variable equations like the ones in Figure 1. In contrast to Hosseini et al. (2014), ALGES covers +,−,∗, and /. The work of Kushman et al. (2014) has broader scope but we show that it relies heav- ily on overlap between training and test data. When that overlap is reduced, ALGES is 15% to 50% more accurate than this system. Our contributions are as follows: (1) We formal- ize the problem of solving multi-sentence algebraic word problems as that of generating and ranking equation trees; (2) We show how to score the like- lihood of equation trees by learning discriminative models trained from a small number of word prob- lems and their solutions – without any manual an- notation; and (3) We demonstrate empirically that ALGES has broader scope than the system of Hos- seini et al. (2014), and overcomes the brittleness of the method of Kushman et al. (2014). 2 Previous Work Our work is related to situated semantic interpre- tation, which aims to map natural language sen- tences to formal meaning representations (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Ge and Mooney, 2006; Kwiatkowski et al., 2010). More closely related is work on language grounding, whose goal is the interpretation of a sentence in the context of a world representation (Branavan et al., 2009; Liang et al., 2009; Chen et al., 2010; Bordes et al., 2010; Feng and Lapata, 2010; Hajishirzi et al., 2011; Matuszek et al., 2012; Hajishirzi et al., 2012; Artzi and Zettlemoyer, 2013; Koncel-Kedziorski et al., 2014; Yatskar et al., 2014; Seo et al., 2014; Hixon et al., 2015). However, while most previ- ous work considered individual sentences in isola- tion, solving word problems often requires reason- ing across the multi-sentence discourse of the prob- lem text. Recent efforts in the math domain have studied number word problems (Shi et al., 2015), logic puzzle problems (Mitra and Baral, 2015), arithmetic word problems (Hosseini et al., 2014; Roy et al., 2015a), algebra word problems (Kush- man et al., 2014; Zhou et al., 2015), and geometry word problems (Seo et al., 2015). Roy et al. (2015b) introduce a system for reason- ing about quantities, which they extend to arithmetic word problems that can be solved by choosing only two values from the text and applying an arithmetic operator. By comparison, our method learns to solve complex problems with many operands where the space of possible solutions is larger. Hosseini et al. (2014) solve elementary addition and subtraction problems by learning verb cate- gories. They ground the problem text to a seman- tics of entities and containers, and decide if quanti- ties are increasing or decreasing in a container based upon the learned verb categories. While relying only on verb categories works well for +,−, model- ing ∗,/ requires going beyond verbs. For instance, “Tina has 2 cats. John has 3 more cats than Tina. How many cats do they have together?” and “Tina has 2 cats. John has 3 times as many cats as Tina. How many cats do they have together?” have identi- cal verbs, but the indicated operation (+ and * resp.) is different. ALGES makes use of a richer seman- tic representation which facilitates deeper learning and a wider scope of application, solving problems involving the +,−,/, and ∗ operators (see Table 6). Kushman et al. (2014) introduce a general method for solving algebra problems. This work can align a word problem to a system of equations with one or two unknowns. They learn a mapping from word problems to equation templates using global and lo- cal features from the problem text. However, the large space of equation templates makes it challeng- ing for this model to learn to find the best equation directly, as a sufficiently similar template may not have been observed during training. Instead, our method maps word problems to equation trees, tak- ing advantage of a richer representation of quanti- fied nouns and their properties, as well as the recur- sive nature of equation trees. These allow ALGES to use a bottom-up approach to learn the correspon- dence between spans of texts and arithmetic oper- ators (corresponding to intermediate nodes in the tree). ALGES then scores equations using global structure of the problem to produce the final result. Our work is also related to research in using ILP to enforce global constraints in NLP appli- cations (Roth and Yih, 2004). Most previous work (Srikumar and Roth, 2011; Goldwasser and Roth, 2011; Berant et al., 2014; Liu et al., 2015) uti- lizes ILP as an inference procedure to find the best global prediction over initially trained local classi- fiers. Similarly, we use ILP to enforce global and domain specific constraints. We, however, use ILP to form candidate equations which are then used to generate training data for our classifiers. Our work is also related to parser re-ranking (Collins, 2005; Ge and Mooney, 2005), where a re-ranker model at- tempts to improve the output of an existing proba- bilistic parser. Similarly, the global equation model designed in ALGES attempts to re-rank equations based on global problem structure. 3 Setup and Problem Definition Given numeric quantities V and an unknown x whose value is the answer we seek, an equation over V and x is any valid mathematical expression formed by combining elements of V ∪{x} using bi- nary operators from O = {+,−,∗,/,=} such that x appears exactly once. When each element of V appears at most once in the equation, it may natu- rally be represented as an equation tree where each operator is a node with edges to its two operands.2 T denotes the set of all equation trees over V and x. Problem Formulation. We address the problem of solving grade-school algebra word problems that map to single equations. Solving such a word prob- lem w amounts to selecting an equation tree t repre- senting the mathematical computation implicit in w. Figure 1 shows an example of w with quantities un- derlined, and the corresponding tree t. Formally, we use a joint probability distribution p(t,w) that de- fines how “well” an equation tree t ∈ T captures the mathematical computation expressed in w. Given a word problem w as input, our goal is to compute t̃ = arg maxt∈T p(t|w). 2Problems involving simultaneous equations require com- bining multiple equation trees, one per equation. 375−(7*x)= 4 375= (7*x)+4 375= (x*7)+4  3.  Train  local  model  (sec(on  7.1)   On  Monday,  375  students  went  on  a  trip  to  the  zoo.  All  7  buses  were  filled   and  4  students  had  to  travel  in  cars.    How  many  students  were  in  each  bus  ?   Qnt: 375 Ent: Student Qnt: 7 Ent: Bus Qnt: 4 Ent: Student Qnt: x Ent: Student Ctr: Bus 1.  Ground  text  w  into  base  Qsets  (sec(on  5)   :  subset  of  T(w)  yielding  correct  solu(on   375s   *s   -s   4s   7b   xs   =   375s   =   +s   *s   4s   7b   xs   375s   +s   -s   4s   =   7b   xs   (7b,xs) (375s,combine(7b,xs)) (7b,xs) (combine(7b,xs),4s) 2.  Use  ILP  to  generate  M  equa(on  trees  T(w)  (Sec(on  6)        4.  Train  global  model  (sec(on  7.2)                                        :  problem-­‐tree  pairs   375+(7*x)= 4 375= (7/ x)+4 375−(x+7)= 4 Trlocal Trglobal Tl(w) :  operator  nodes  in  Tl(w) T(w) \Tl(w) Training  example   Label   * − * + Posi>ve  examples   (from                            )   Nega>ve  examples   (from                                                      )  Tl(w) T(w) \Tl(w) Figure 2: An overview of the process of learning for a word problem and its Qsets. An exhaustive enumeration over T quickly be- comes impractical as problem complexity increases and n = |V ∪ {x}| grows. Specifically, |T| > h(n) = n! (n−1)! (n−1) 2n−4, h(4) = 432,h(6) > 1.7M,h(8) > 22B, etc. This vast search space makes it challenging for a discriminative model to learn to find t̃ directly, as a sufficiently similar tree may not have been observed during training. In- stead, our method first generates syntactically valid equation trees, and then uses a bottom-up approach to score equations with a local model trained to map spans of text to math operators, and a global model trained for coherence of the entire equation w.r.t. global problem text structure. 4 Overview of the Approach Figure 2 gives an overview of our method, also detailed in Figure 3. In order to build equation trees, we use a compact representation for each node called a Quantified Set or Qset to model natural lan- guage text quantities and their properties (e.g., ‘375 students’ in ‘7 buses’). Qsets are used for tracking and combining quantities when learning the corre- spondence between equation trees and text. Definition 1. Given a math word problem w, let S Learning (word problems W , corresponding solutions L): 1. For every problem-solution pair (wi,`i) with wi ∈ W,`i ∈ L (a) S ← Base Qsets obtained by Grounding text wi and Reordering the resulting Qsets (Section 5) (b) Ti ← Top M type-consistent equation tree candidates generated by ILP(wi) (Section 6) (c) T`i ← Subset of Ti that yields the correct numerical solution `i (d) Add to Trlocal features 〈s1,s2〉 with label op for each operator op combining Qsets s1,s2 in trees in T`i (e) Add to Trglobal features 〈w,t〉 labeled positive for each t ∈ T`i and labeled negative for each t ∈ T \T`i 2. Llocal ← Train a local Qset relationship model on Trlocal (Section 7.1) 3. Gglobal ← Train a global equation model on Trglobal (Section 7.2) 4. Output local and global models (Llocal,Gglobal) Inference (word problem w, local set relation model Llocal , global equation model Gglobal ): 1. S ← Base Qsets obtained by Grounding text wi and Reordering the resulting Qsets (Section 5) 2. T ← Top M type-consistent equation tree candidates generated by ILP(w) (Section 6) 3. t∗ ← arg maxti∈T (∏ tj∈t Llocal(tj|w) ) ×Gglobal(t|w), scoring each tree ti ∈ T based on Equation 1 4. ` ← Numeric solution to w obtained by solving equation tree t∗ for the unknown 5. Output (t∗,`) Figure 3: Overview of our method for solving algebraic word problems. be the set of all possible spans of text in w, φ denote the empty span, and Sφ = S∪{φ}. A Qset for w is either a base Qset or a compound Qset. A base Qset is a tuple (ent,qnt,adj, loc,vrb,syn,ctr) with: • ent ∈ S: entity or quantity noun (e.g., ‘student’); • qnt ∈ R∪{x}: number or quantity (e.g., 4 or x); • adj ⊆ Sφ: adjectives for ent in w; • loc ∈ Sφ: location of ent (e.g., ‘in the drawer’); • vrb ∈ Sφ: governing verb for ent (e.g., ‘fill’); • syn: syntactic and positional information for ent (e.g., ‘buses’ is in subject position) ; • ctr ⊆ Sφ: containers of ent (e.g., ‘Bus’ is a con- tainer for the ‘students’ Qset). Properties being φ indicates these optional proper- ties are unspecified. A compound Qset is formed by combining two Qsets with a non-equality binary op- erator as discussed in section 5. Qsets can be further combined with the equality operator to yield a semantically augmented equation tree.3 The example in Figure 2 has four base Qsets extracted from problem text. Each possible equation tree corresponds to a different recursive combination of these four Qsets. Given w, ALGES first extracts a list of n base Qsets S = {s1, . . . ,sn} (Section 5). It then uses an ILP-based optimization method to combine ex- tracted Qsets into a list of type-consistent candidate 3Inspired by Semantically Augmented Parse Trees (Ge and Mooney, 2005) adapted to equational logic. equation trees (Section 6). Finally, ALGES uses dis- criminative models to score each candidate equation, using both local and global features (Section 7). Specifically, the recursive nature of our represen- tation allows us to decompose the likelihood func- tion p(t,w) into local scoring functions for each in- ternal node of t followed by scoring the root node: p(t|w) ∝  ∏ tj∈t Llocal(tj|w)  ×Gglobal(t|w) (1) where the local function Llocal(tj|w) scores the like- lihood of the subtree tj, modeling pairwise Qset re- lationships, while the global function Gglobal(t|w) scores the likelihood of the root of t, modeling the equation in its entirety. Learning. ALGES learns in a weakly supervised fashion, using word problems wi and only their cor- rect answer `i (not the corresponding equation tree) as training data {(wi,`i)}i∈{1,...,N}. We ground each wi into ordered Qsets and generate a list of type-consistent candidate training equations T`i that yield the correct answer `i. We build a local discriminative model Llocal to score the likelihood that a math operator op ∈ O can correctly combine two Qsets s1 and s2 based on their semantics and intertextual relationships. For example, in Figure 2 this model learns that ∗ has a high likelihood score for ‘7 buses’ and ‘x students’. The training data consists of feature vectors 〈s1,s2〉 labeled with op, derived from the equation trees that yield the correct solution. We also build a global discriminative model that scores equation trees based on the global problem structure: Gglobal = ψᵀfglobal(w,t) where fglobal represents global features of w and t, and φ are pa- rameters to be learned. The training data consists of feature vectors 〈w,t〉 for equation trees that yield the correct solution as positive examples, and the rest as negatives (Figure 2). The details of learning and in- ference steps are described in Section 7. 5 Grounding and Combining Qsets We discuss how word problem text is grounded into an ordered list of Qsets. A Qset is a compact rep- resentation of the properties of a quantity as de- scribed in a single sentence. The use of Qsets facil- itates the building of semantically augmented equa- tion trees. Additionally, by tracking certain proper- ties of text quantities, ALGES can resolve pronomi- nal references or elided nouns to properties of previ- ous Qsets. It can also combine information about quantities referenced in different sentences into a single semantic structure for further use. Grounding. ALGES translates the text of the prob- lem w into interrelated base Qsets {s1, . . . ,sn}, each associated with a quantity in the problem text w. The properties of each Qset (Definition 1) are ex- tracted from the dependency parse relations present in the sentence where the quantity is referred to ac- cording to the rules described in Table 1. Additionally, ALGES assigns a single target Qset sx corresponding to the question sentence. The properties of the target Qset are also extracted ac- cording to the rules of the Table 1. In particular, the qnt property is set to unknown, the ent is set to the noun appearing after the words what, many or much in the target sentence, and the other properties are extracted as listed in Table 1. Reordering. In order to reduce the space of possible equation trees, ALGES reorders Qsets {s1, . . . ,sn} according to semantic and textual information and enforces a constraint that Qsets can only combine with adjacent Qsets in the equation tree. In Fig- ure 2, the target Qset corresponding to the unknown (x ‘students’) is moved from its textual location at For each quantity mentioned in the text, properties (qnt,ent,ctr,adj,vrb, loc) of the corresponding Qset are extracted as follows: 1. qnt (quantity) is a numerical value or determiner found in the problem text, or a variable. 2. ent (entity) is a noun related to the qnt in the depen- dency parse tree. If qnt is a numerical value, ent is the noun related by the num, number, or prep of rela- tions. If qnt is a determiner, ent is the noun related via the det relation. When such a noun does not exist due to parse failure or pragmatic recoverability, ent is the noun that is closest to qnt in the same sentence or the ent associated with the most recent Qset. 3. ctr (container) is the subject of the verb govern- ing ent, except in two cases: when this subject is a pronominal reference, the ctr is set to the ctr of the closest previous Qset; if ent is related to another Qset whose qnt is one of each, every, a, an, per, or one, ctr is set to the ent of that Qset. 4. adj (adjectives) is a list of adjectives related to ent by the amod relation. 5. vrb (verb) is a governing verb, either related to ent by nsubj or dobj 6. loc (location) is a noun related to ent by prep on, prep in, or prep at relations. Table 1: The process of forming a single Qset. the end of the problem and placed adjacent to the Qset with entity ‘buses’. This move is triggered by the relationship between the target entity ‘student’ and its container ‘bus’ that is quantified by each in the last sentence. In addition to the container match rule, we employ three other rules to move the target Qset as described in Table 2.4 Combining. Two Qsets and an arithmetic operator can be combined via the combine function to form a third Qset, alternately referred to as a compound. Because of this, we can represent intermediate nodes in the equation tree as Qsets themselves. The recur- sive combination of Qsets allows us to effectively decompose equation trees into a collection of local operations over identical abstractions. This enables learning features of Qsets and text that indicate par- ticular operations from both leaf and intermediate nodes. The mechanics of c ← combine(a,b,op) are detailed below. 4These reordering rules are intentionally minimal, but do provide some gain over both preserving the text ordering of quantities or setting ordering as a soft constraint. See Table 7. 1. Move Qset si to immediately after Qset sj if the con- tainer of si is the entity of sj and is quantified by each. 2. Move target Qset to the front of the list if the ques- tion statement includes keywords start or begin. 3. Move target Qset to the end of the list if the question statement includes keywords left, remain, and finish. 4. Move target Qset to the textual location of an inter- mediate reference with the same ent if its num prop- erty is the determiner some. Table 2: Rules for reordering Qsets. For op = +, the properties of either Qset a or b suffice to define c. ALGES always forms c using the properties of b in these situations. For op = −, the properties of the left operand a define the resul- tant set, as evidenced by the subtraction operations present in the first problem in Table 9. To determine the stickers in Luke’s possession, we need to track stickers related to the left Qset with the verb ‘got’. For op = ∗, the Qset relationship is captured by the container and entity properties: the one whose properties preserve after multiplication has the other’s entity as its container. In Figure 2, the ‘bus’ Qset is the container of ‘students’. When these are combined with the ∗ operator, the result is of en- tity type ‘student’. For op = /, we use the prop- erties of the left operand to encourage a distinction between division and multiplication. 6 Generating Equation Trees with ILP We use an ILP optimization model to generate equa- tion trees involving n base Qsets. These equation trees are used for both learning and inference steps. ALGES generates an ordered list of M of the most desirable candidate equations for a given word prob- lem w using an ILP, which models global consider- ations such as type consistency and appropriate low expression complexity. To facilitate generation of equation trees, we represent them in parenthesis-free postfix or reverse Polish notation, where a binary op- erator immediately follows the two operands it op- erates on (e.g., abc+∗x=). Given a word problem w with n base Qsets (cf. Table 3 for notation), we build an optimization model ILP(w) over the space of postfix equations E = e1e2 . . .eL of length L involving k numeric constants, k′ = n − k unknowns, r possible binary operators, and q “types” of Qsets, where type cor- responds to the entity property of Qsets and deter- mines which binary relationships are permitted be- tween two given Qsets. For single variable equations over binary operators O, k′ = 1,r = |O| = 5, and L = 2n− 1. For brevity, define m = n + r and let [j] denote {1, . . . ,j}. Expression E can be evalu- ated by considering e1,e2, . . . ,eL in order, pushing non-operator symbols on to a stack σ, and, for op- erator symbols, popping the top two elements of σ, applying the operator to them, and pushing the result back on to σ. The stack depth of the ei is the stack size after ei has been processed this way. INPUT w input math word problem n number of base Qsets k number of numeric constants k′ number of unknowns (1 for single-var. eqns.) r number of binary operators (r = |O| = 5) m number of possible symbols (n + r) typej type of j-th base Qset M desired number of candidate equation trees L desired length of postfix equations (2n−1) OUTPUT E postfix equation to be generated ei i-th element of E; i ∈ [L] VARIABLES for i ∈ [L] xi main ILP variable for i-th symbol of E ci indicator variable: ei is a numeric constant ui indicator variable: ei is an unknown oi indicator variable: ei is an operator di postfix stack depth of ei; di ∈ [L] ti type of ei (corresponds to Qset entity); ti ∈ [q] Table 3: ILP notation for candidate equations model Variables. Integer variables x1,x2, . . . ,xL encode which symbol each ei refers to. Their domain, [m], represents the k numeric constants in the same order as their respective Qsets, followed by the k′ unknowns, and finally operators in the order +,−,∗,/,=. Binary variables ci,ui, and oi indicate whether ei is a numeric constant, unknown, or oper- ator, resp. Variables di with domain [L] equal the postfix stack depth of ei. Finally, variables ti with domain [q] indicate the type of ei. For j ∈ [n] , i.e., for the k constants and k′ unknowns, typej ∈ [q] denotes the respective Qset entity. Uncertainty in object types may be incorporated easily by treating typej as a (potentially weighted) subset of [q]. Constraints and Objective Function. Constraints in ILP(w) include syntactic validity, type consis- tency, and domain specific simplicity considera- tions. We describe them briefly here, leaving details to the Appendix. The objective function minimizes the sum of the weights of violated soft constraints. Below, (H) denotes hard constraints, (W) weighted soft constraints, and (P) post-processing steps. Definitional Constraints (H): Constraints over in- dicator variables ci,ui, and oi ensure they repre- sent their intended meaning, including the invariant ci + ui + oi = 1. For stack depth variables, we add d1 = 1 and di = di−1 −2oi + 1 for i > 1. Syntactic Validity (H): Validity of the postfix ex- pression is enforced easily through constraints o1 = 0 and dL = 1. In addition, we add xL = m and xi < m for i < L to ensure equality occurs exactly once and as the top-level operator. Operand Access (H): The second operand of an operator symbol ei is always ei−1. Its first operand, however, is defined instead by the stack-based eval- uation process. ILP(w) encodes it using an alterna- tive characterization: the first operand of ei is ej iff j ≤ i−2 and j is the largest index such that di = dj. Type Consistency (W): Suppose T1 and T2 are the types of the two operands of an operator o, whose type is To. Addition and subtraction preserve the type of their operands, i.e., if o is + or −, then To = T1 = T2. Multiplication inherits the type of one of its operands, and division inherits the type of its first operand. In both cases, the two operands must be of different types. Formally, if o is ∗, then To ∈ {T1,T2} and T1 6= T2; if o is /, then To = T1 6= T2. Domain Considerations (H,W): We add a few do- main specific constraints based on patterns observed in a small subset of the questions. These include an upper bound on the stack depth, which helps avoid overly complex expressions unsuitable for grade- school algebra, and reducing redundancy by, e.g., disallowing the numeric constant 0 to be an operand of + or − or the second operand of /. Symmetry Breaking (H,W): If a commutative op- erator is preceded by two numeric constants (e.g., ab+), we require the constants to respect their Qset ordering. Every other pair of constants that disre- spects its Qset ordering incurs a small penalty. Negative and Fractional Answers (P): Rather than imposing non-negativity as a complex constraint in ILP(w), we filter out candidate expressions yielding a negative answer as a post-processing step. Sim- ilarly, when all numeric constants are integers, we filter out expressions yielding a fractional answer, again based on typical questions in our datasets. 7 Learning Our goal is to learn a scoring function that identifies the best equation tree t∗ corresponding to an unseen word problem w. Since our dataset consists only of problem-solution pairs {(wi,`i)}i=1,...,N , train- ing our scoring models requires producing equa- tion trees matching `i. For every training in- stance (wi,`i), we use ILP(wi) to generate M type- consistent equation tree candidates Ti. To train our local model (section 7.1), we filter out trees from Ti that do not evaluate to `i, extract all (s1,s2,op) triples from the remaining trees, and use feature vec- tors capturing (s1,s2) and labeled with op as train- ing data (see Figure 2). For the global model, we use for training data a subset of Ti with an equal number of correct and incorrect equation trees (section 7.2). Once trained, we use Equation 1 to combine these models to compute a score for each candidate equa- tion tree generated for an unseen word problem at inference time (see Figure 3). 7.1 Local Qset Relationship Model We train a local model of a probability distribu- tion over the math operators that may be used to combine a pair of Qsets. The idea is to learn the correspondence between spans of texts and math op- erators by examining such texts and the Qsets of the involved operands. Given Qsets s1 and s2, the lo- cal scoring function scores the probability of each op ∈ {+,−,∗,/}, i.e., Llocal = θᵀflocal(s1,s2) where flocal is a feature vector for s1 and s2. Note that either Qset may be a compound (the result of a combine procedure). The goal is to learn parameters θ by maximizing the likelihood of the operators be- tween every two Qsets that we observe in the train- ing data. We model this as a multi-class SVM with an RBF kernel. Features. Given the richness of the textual possi- bilities for indicating a math operation, the features are designed over semantic and intertextual relation- ships between Qsets, as well as domain-specific lex- 1. Single Qset Features (repeated for B) • what argument of its governing verb is A? • is A a subset of another set? • is A a compound? • math keywords found in context of A • verb Lin distance from known verb categories (B only) 2. Relational features between Qsets A and B • entity match • adjective overlap • location match • distance in text • Lin similarity between verbs governing A and B • is one a subtype of the other? • does one contain the other? 3. Target Quantity features • A/B is target Qset • A/B entity matches target entity • math keywords in target context 4. Root node features • # of ILP constraints violated by equation • Scores of left and right subtrees of root Figure 4: Features used for local and global models, for left Qset A and right Qset B ical features. The feature vector includes three main feature categories (Table 4). First, single set features include syntactic and po- sitional features of individual Qsets. For example, they include indicator features for whether elements of a short lexicon of math-specific terms such as ‘add’ and ‘times’ appear in the vicinity of the set reference in the text. Also, following Hosseini et al. (2014), we include a vector that captures the dis- tance between the verbs associated with each Qset and a small collection of verbs found to be useful in categorizing arithmetic operations in that work, based upon their Lin Similarity (Lin, 1998). Second, relationships between Qsets are de- scribed w.r.t. various Qset properties described in section 4. These include binary features like whether one Qset’s container property matches the other Qset’s entity (a strong indicator of multiplication), or the distance between the verbs associated with each set based upon their Lin Similarity. Third, target quantity features check the matching between the target Qset and the current Qset as well as math keywords in the target sentence. 7.2 Global Equation Model We also train a global model that scores equation trees based on the global structure of the tree and the problem text. The global model scores the com- patibility of the tree with the soft constraints intro- duced in Section 6 as well as its correspondence with the problem text. We use a discriminative model: Gglobal = ψᵀfglobal(w,t) where fglobal are the fea- tures capturing trees and their correspondences with the problem text. We train a global classifier to relate these features through parameters ψ. Features fglobal are explained in Table 4. They include the number of violated soft constraints in the ILP, the probabilities of the left and right subtrees of the root as provided by the local model, and global lexical features. Additionally, the three local feature sets are applied to the left and right Qsets. 7.3 Inference For an unseen problem w, we first extract base Qsets from w. The goal is to find the most likely equation tree with minimum violation of hard and soft con- straints. Using ILP(w) over these Qsets, we gener- ate M candidate equation trees ordered by the sum of the weights of the constraints they violate. We compute the likelihood score given by Eqn. (1) for each candidate equation tree t, use this as an esti- mate of the likelihood p(t|w), and return the can- didate tree t∗ with the highest score. In Eqn. (1), the score of t is the product of the likelihood scores given by the local classifier for each operand in t and the Qsets over which it operates, multiplied by the likelihood score given by the global classifier for the correctness of t. If the resulting equation provides the correct answer for w, we consider inference suc- cessful. 8 Experiments This section reports on three experiments: a com- parison of ALGES with Kushman et al. (2014)’s template-based method, a comparison of ALGES with Hosseini et al. (2014)’s verb-categorization methods, and ablation studies. The experiments are complicated by the fact that ALGES is limited to sin- gle equations, and the verb categorization method can only handle single-equations without multipli- cation or division. Our main experimental result is to show an improvement over the template-based method on single-equation algebra word problems. We further show that the template-based method de- pends on lexical and template overlap between its training and test sets. When these overlaps are re- duced, the method’s accuracy drops sharply. In con- trast, ALGES is quite robust to changes in lexical and template overlap (see Tables 4 and 5). Experimental Setup. We use the Stanford De- pendency Parser in CoreNLP 3.4 (De Marneffe et al., 2006) to obtain syntactic information used for grounding and feature computation. For the ILP model, we use CPLEX 12.6.1 (IBM ILOG, 2014) to generate the top M = 100 equation trees with a maximum stack depth of 10, aborting exploration upon hitting 10K feasible solutions or 30 seconds.5 We use Python’s SymPy package for solving equa- tions for the unknown. For the local and global mod- els, we use the LIBSVM package to train SVM clas- sifiers (Chang and Lin, 2011) with RBF kernels that return likelihood estimates as the score. Dataset. This work deals with grade-school alge- bra word problems that map to single equations with varying length. Every equation may involve mul- tiple math operations including multiplication, di- vision, subtraction, and addition over non-negative rational numbers and one variable. The data is gathered from math-aids.com, k5learning. com, and ixl.com websites and a subset of the data from Kushman et al. (2014) that maps word problems to single equations. We refer to this dataset as SINGLEEQ (see Table 9 for example problems). The SINGLEEQ dataset consists of 508 problems, 1,117 sentences, and 15,292 words. Baselines. We compare our method with the template-based method (Kushman et al., 2014) and the verb-categorization method (Hosseini et al., 2014). For the template-based method, we use the fully supervised setting, providing equations for each training example. 8.1 Comparison with Template-based Method We first compare ALGES with the template-based method over SINGLEEQ. We evaluate both systems 5These hyper-parameters were chosen based on experimen- tation with a small subset of the questions. A more systematic choice may improve overall performance. Template Overlap 10.4 7.7 6.3 2.1 ALGES 0.72 0.66 0.66 0.63 Template-based 0.67 0.60 0.46 0.26 Error reduction 15% 15% 33% 50% Table 4: Decreasing template overlap: Accuracy of ALGES versus the template-based method on single- equation algebra word problems. The first column corre- sponds to the SINGLEEQ dataset, and the other columns are for subsets with decreasing template overlap. Lexical Overlap 4.3 3.3 2.6 2.5 ALGES 0.72 0.66 0.66 0.63 Template-based 0.67 0.60 0.46 0.26 Error reduction 15% 15% 33% 50% Table 5: Decreasing lexical overlap: Accuracy of ALGES versus the template-based method on single-equation al- gebra word problems. The first column corresponds to the SINGLEEQ dataset, and the other columns are for sub- sets with decreasing lexical overlap. on the number of correct answers provided and re- port the average of a 5-fold cross validation. ALGES achieves 72% accuracy whereas the template-based method achieves 67% accuracy, a 15% relative re- duction in errors (first columns in Tables 4 and 5). This result is statistically significant with a p-value of 0.018 under a paired t-test. Lexical Overlap. By further analyzing SINGLEEQ, we noted that there is substantial overlap between the content words (common noun, adjective, adverb, and verb lemmas) in different problems. For ex- ample, many problems ask for the total number of seashells collected by two people on a beach, with only the names of the people and the number of seashells that each found changed. To analyze the effect of this repetition on the learning methods eval- uated, we define a lexical overlap parameter as the total number of content words in a dataset divided by the number of unique content words. The two “seashell problems” have a high lexical overlap. Template Overlap. We also noted that many prob- lems in SINGLEEQ can be solved using the same template, or equation tree structure above the leaf nodes. For example, a problem which corresponds to the equation (9 ∗ 3) + 7 and a different problem that maps to (4 ∗ 5) + 2 share the same template. We introduce a template overlap parameter defined as the average number of problems with the same template in a dataset. Results. In our data, template overlap and lexi- cal overlap co-vary. To demonstrate the brittleness of the template-based method simply, we picked three subsets of SINGLEEQ where both parame- ters were substantially lower than in SINGLEEQ and recorded the relative performance of the template- based method and of ALGES in Tables 4 and 5. The data used in both tables is the same, but the ta- bles are separated for readability. The first column reports results for the SINGLEEQ dataset, and the other columns report results for the subsets with de- creasing template and lexical overlaps. The subsets consist of 254, 127, and 63 questions respectively. We see that as the lexical overlap drops from 4.3 to 2.5 and as the template overlap drops from 10.4 to 2.1, the relative advantage of ALGES over the tem- plate methods goes up from 15% to 50%. While the template-based method is able to solve a wider range of problems than ALGES, its accu- racy falls off significantly when faced with fewer re- peated templates or less spurious lexical overlap be- tween problems (from 0.67 to 0.26). The accuracy of ALGES also declines from 0.72 to 0.63 across the table, which needs to be investigated further. In future work, we also need to investigate additional settings for the two parameters and to attempt to “break” their co-variance. Nevertheless, we have uncovered an important brittleness in the template- based method and have shown that ALGES is sub- stantially more robust. 8.2 Comparison with Verb-Categorization The verb-categorization method learns to solve ad- dition and subtraction problems, while ALGES is ca- pable of solving multiplication and division prob- lems as well. We compare against their method over our dataset as well as the dataset provided by that work, here referred to as ADDSUB. ADDSUB consists of addition and subtraction word problems with the possibility of irrelevant distractor quanti- ties in the problem text. The verb categorization method uses rules for handling irrelevant informa- tion. An example rule is to remove a Qset whose ad- jective is not consistent with the adjective of the tar- get Qset. We augment ALGES with rules introduced in this method for handling irrelevant information in ADDSUB. Results, reported in Table 6, show comparable accuracy between both methods on Hosseini et al. (2014) data. Our method shows a significant im- provement versus theirs on the SINGLEEQ dataset due to the presence of multiplication and division operators, as 40% of the problems in our dataset in- clude these operators. Method ADDSUB SINGLEEQ ALGES 0.77 0.72 Verb-categorization 0.78 0.48 Error reduction - 53% Table 6: Accuracy of ALGES compared to verb catego- rization method. 8.3 Ablation Study In order to determine the effect of various compo- nents of our system on its overall performance, we perform the following ablations: No Local Model: Here, we test our method absent the local information (Section 7.1). That is, we gen- erate equations using all ILP constraints, and score trees solely on information provided by the global model: p(t|w) ∝Gglobal(w,t). No Global Model: Here, we test our method with- out the global information (Section 7.2). That is, we generate equations using only the hard constraints of ILP and score trees solely on information provided by the local model: p(t|w) ∝ ∏ ti∈tLlocal(w,ti). No Qset Reordering: We test our method without the deterministic Qset reordering rules outlined in Section 5. Instead, we allow the ILP to choose the top M equations regardless of order. Results in Table 7 show that each component of ALGES contributes to its overall performance on the SINGLEEQ corpus. We find that both the Global and Local models contribute significantly to the overall system, demonstrating the significance of a bottom- up approach to building equation trees. Importance of Features. We also evaluate the ac- curacy of the local Qset relationship model (Sec- Method Accuracy ALGES 0.72 No Local Model 0.50 No Global Model 0.49 No Qset Reordering 0.68 Table 7: Ablation study of each component of ALGES. Method Accuracy Local classifier: Full Feature set 0.84 No Single Set Features 0.81 No Set Relation Features 0.75 No Target Features 0.79 Table 8: Accuracy of local classifier in predicting the cor- rect operator between two Qsets and ablating feature sets. tion 7.1) on the task of predicting the correct op- erator for a pair of Qsets 〈s1,s2〉 over the SIN- GLEEQ dataset using a 5-fold cross validation. Ta- ble 8 shows the value of each feature group used in the local classifier, and thus the importance of details of the Qset representation. 8.4 Qualitative Examples and Error Analysis. Table 9 shows some examples of problems solved by our method. We analyzed 72 errors made by ALGES on the SINGLEEQ dataset. Table 10 summarizes five major categories of errors. Problems and equations Luke had 20 stickers. He bought 12 stickers from a store in the mall and got 20 stickers for his birthday. Then Luke gave 5 of the stickers to his sister and used 8 to decorate a greeting card. How many stickers does Luke have left? ((20 + ((12 + 20)−8))−5) = x Maggie bought 4 packs of red bouncy balls, 8 packs of yellow bouncy balls, and 4 packs of green bouncy balls. There were 10 bouncy balls in each package. How many bouncy balls did Maggie buy in all? x = (((4 + 8) + 4)∗10) Sam had 79 dollars to spend on 9 books. After buying them he had 16 dollars. How much did each book cost? 79 = ((9∗x) + 16) Fred loves trading cards. He bought 2 packs of football cards for $2.73 each, a pack of Pokemon cards for $4.01, and a deck of baseball cards for $8.95. How much did Fred spend on cards? ((2∗2.73) + (4.01 + 8.95)) = x Table 9: Examples of problems solved by ALGES to- gether with the returned equation. Parsing errors cause a wrong grounding into the Error type Example Parsing Issues (12%) Randy needs 53 cupcakes for a birthday party. He already has 7 chocolate cupcakes and 19 vanilla cupcakes. How many more cup- cakes should Randy buy? Grounding & Ordering (19%) There are 24 bicycles and 14 tricycles in the storage at Danny’s apartment building. Each bicycle has 2 wheels and each tricycle has 3 wheels. How many wheels are there in all? Semantic Limitation (19%) The sum of three consecutive even numbers is 162. What is the smallest of these numbers? Lack of Knowledge (32%) A restaurant sold 63 hamburgers last week. How many hamburgers on average were sold each day? Inferring quantities (18%) Sara, Keith, Benny, and Alyssa each have 96 baseball cards. How many dozen baseball cards do they have in all? Table 10: Examples of different error categories and rel- ative frequencies. Sources of errors are underlined. designed representation. For example, the parser treats ‘vanilla’ as a noun modified by the number ‘19’, leading our system to treat ‘vanilla’ as the en- tity of a Qset rather than ‘cupcake’. Despite the improvements that come from ALGES, a portion of errors are attributed to grounding and ordering is- sues. For instance, the system fails to correctly dis- tinguish between the sets of wheels, and so does not get the movement-triggering container relationships right. Semantic limitations are another source of er- rors. For example, ALGES does not model the se- mantics of ‘three consecutive numbers’. The fourth category refers to errors caused due to lack of world knowledge (e.g., ‘week’ corresponds to ‘7 days’). Finally, ALGES is not able to infer quantities when they are not explicitly mentioned in the text. For ex- ample, the number of people should be inferred by counting the proper names in the problem. 9 Conclusion In this work we have outlined a method for solv- ing grade school algebra word problems. We have empirically demonstrated the value of our approach versus state-of-the-art word problem solving tech- niques. Our method grounds quantity references, utilizes type-consistency constraints to prune the search space, learns which algebraic operators are indicated by text, and ranks equations according to a global objective function. ALGES is a hybrid of pre- vious template-based and verb categorization state- based methods for solving such problems. By learn- ing correspondences between text and mathematical operators, we extend the method of state updates based on verb categories. By learning to re-rank equation trees using a global likelihood model, we extend the method of mapping word problems to equation templates. Different components of ALGES can be adapted to other domains of language grounding that re- quire cross-sentence reasoning. Future work in- volves extending ALGES to solve higher grade math word problems including simultaneous equations. This can be accomplished by extending the vari- able grounding step to allow multiple variables, and training the global equation model to recog- nize which quantities belong to which equation. The code and data for ALGES are publicly available. Acknowledgments: This research was supported by the Allen Institute for AI (66-9175), Allen Distinguished Investigator Award, and NSF (IIS- 1352249). We thank Regina Barzilay, Luke Zettle- moyer, Aria Haghighi, Mark Hopkins, Ali Farhadi, and the anonymous reviewers for their helpful com- ments. References Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su- pervised learning of semantic parsers for mapping in- structions to actions. TACL, 1(1):49–62. Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby Vander Linden, Brittany Harding, Brad Huang, Peter Clark, and Christopher D. Manning. 2014. Modeling biological processes for reading comprehen- sion. In EMNLP. Antoine Bordes, Nicolas Usunier, and Jason Weston. 2010. Label ranking under ambiguous supervision for learning semantic correspondences. In ICML, pages 103–110. S. R. K. Branavan, Harr Chen, Luke Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learning for mapping instructions to actions. In ACL/AFNLP, pages 82–90. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transac- tions on Intelligent Systems and Technology, 2:27:1– 27:27. David L. Chen, Joohyun Kim, and Raymond J. Mooney. 2010. Training a multilingual sportscaster: Using per- ceptual context to learn language. JAIR, 37:397–435. Michael Collins. 2005. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1):25–70. Marie-Catherine De Marneffe, Bill MacCartney, Christo- pher D Manning, et al. 2006. Generating typed de- pendency parses from phrase structure parses. In Pro- ceedings of LREC, volume 6, pages 449–454. Yansong Feng and Mirella Lapata. 2010. How many words is a picture worth? automatic caption generation for news images. In ACL, pages 1239–1249. Ruifang Ge and Raymond J Mooney. 2005. A statisti- cal semantic parser that integrates syntax and seman- tics. In Conference on Computational Natural Lan- guage Learning, pages 9–16. Ruifang Ge and Raymond J. Mooney. 2006. Discrimina- tive reranking for semantic parsing. In ACL. D. Goldwasser and D. Roth. 2011. Learning from natural instructions. In IJCAI. Hannaneh Hajishirzi, Julia Hockenmaier, Erik T. Mueller, and Eyal Amir. 2011. Reasoning about robocup soccer narratives. In UAI, pages 291–300. Hannaneh Hajishirzi, Mohammad Rastegari, Ali Farhadi, and Jessica K Hodgins. 2012. Semantic understand- ing of professional soccer commentaries. In UAI. Ben Hixon, Peter Clark, and Hannaneh Hajishirzi. 2015. Learning knowledge graphs for question answering through conversational dialog. In NAACL. Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In EMNLP, pages 523–533. IBM ILOG. 2014. IBM ILOG CPLEX Optimization Studio 12.6. R Koncel-Kedziorski, Hannaneh Hajishirzi, and Ali Farhadi. 2014. Multi-resolution language grounding with weak supervision. In EMNLP. Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Regina Barzilay. 2014. Learning to automatically solve algebra word problems. In ACL, pages 271–281. Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2010. Inducing probabilistic ccg grammars from logical form with higher-order unifica- tion. In EMNLP, pages 1223–1233. Percy Liang, Michael I. Jordan, and Dan Klein. 2009. Learning semantic correspondences with less supervi- sion. In ACL/AFNLP, pages 91–99. Dekang Lin. 1998. An information-theoretic definition of similarity. In ICML, volume 98, pages 296–304. Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh, and Noah A. Smith. 2015. Toward abstractive sum- marization using semantic representations. In NAACL. Cynthia Matuszek, Evan Herbst, Luke Zettlemoyer, and Dieter Fox. 2012. Learning to parse natural lan- guage commands to a robot control system. In Proc. of the 13th International Symposium on Experimental Robotics (ISER), June. Arindam Mitra and Chitta Baral. 2015. Learning to au- tomatically solve logic grid puzzles. In EMNLP. D. Roth and W. Yih. 2004. A linear programming formu- lation for global inference in natural language tasks. In Hwee Tou Ng and Ellen Riloff, editors, CoNLL, pages 1–8. Association for Computational Linguistics. Subhro Roy, Urbana Champaign, and Dan Roth. 2015a. Solving general arithmetic word problems. In EMNLP. Subhro Roy, Tim Vieira, and Dan Roth. 2015b. Reason- ing about quantities in natural language. TACL. Min Joon Seo, Hannaneh Hajishirzi, Ali Farhadi, and Oren Etzioni. 2014. Diagram understanding in ge- ometry questions. In AAAI. Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Et- zioni, and Clint Malcolm. 2015. Solving geometry problems: Combining text and diagram interpretation. In EMNLP. Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang Liu, and Yong Rui. 2015. Automatically solving num- ber word problems by semantic parsing and reasoning. In EMNLP. V. Srikumar and D. Roth. 2011. A joint model for ex- tended semantic role labeling. In EMNLP, Edinburgh, Scotland. Mark Yatskar, Lucy Vanderwende, and Luke Zettle- moyer. 2014. See no evil, say no evil: Description generation from densely labeled images. Lexical and Computational Semantics (* SEM 2014), page 110. John M Zelle and Raymond J Mooney. 1996. Learn- ing to parse database queries using inductive logic pro- gramming. In AAAI, pages 1050–1055. Luke S. Zettlemoyer and Michael Collins. 2005. Learn- ing to map sentences to logical form: Structured clas- sification with probabilistic categorial grammars. In UAI, pages 658–666. Lipu Zhou, Shuaixiang Dai, and Liwei Chen. 2015. Learn to solve algebra word problems using quadratic programming. In EMNLP. Appendix: ILP Model Details Figure 5 summarizes various constraints of our ILP model for generating candidate equations. op1 idxi is an auxilary variable whose value, when xi is an op- erator, is the index in the postfix expression of the first operand of xi. If op1 idxi = j, auxiliary vari- ables op1xi ,op1 t i,op1 o i , and op1 u i mirror xj, tj,oj, and uj, respectively. se denotes the corresponding constant or operator symbol e (e.g., ‘+’, ‘=’, ‘0’, etc.) in the postfix expression being constructed. H and W, as before, represent hard and weighted soft constraints. Definitional Constraints (H) : ci = I(xi ≤ k), i ∈ [L] oi = I(xi > n), i ∈ [L] ci + ui + oi = 1, i ∈ [L] d1 = 1; di = di−1 −2oi + 1, 2 ≤ i ≤ L op1 idx i = max j≤i−2 {j | dj = di}, 3 ≤ i ≤ L I(op1 idxi = j) ⇒ I(op1 x i = xj), I(op1 t i = tj), I(op1 oi = oj), I(op1 u i = uj), i, j ∈ [L] I(xi = j) ⇒ I(ti = typej), i ∈ [L], j ∈ [q] o1 = 0; dL = 1, Postfix validity (H) xL = m; xi < m, 1 ≤ i < L, Equation tree structure (H) I(xi = xj) ≤ oi, 1 ≤ i < j < L, Single use of constants (H) ci ⇒ I(xi < xj), 1 ≤ i < j < L, Perserve text ordering (W)∑ i∈L ui = 1, Single unknown (H) Type consistency (W) : I(xi ∈{s+, s−}) ⇒ I(ti = ti−1 = op1 ti), i ∈ [L] I(xi = s∗) ⇒ I(ti ∈{ti−1, op1 ti}), i ∈ [L] I(xi = s/) ⇒ I(ti = op1 t i), i ∈ [L] I(xi ∈{s∗, s/}) ⇒ I(ti−1 6= op1ti), i ∈ [L] Non-redundancy (H), Symmetry breaking (H) : I(xi ∈{s+, s−}) ⇒ I(xi−1 6= s0, op1 xi 6= s0), i ∈ [L] I(xi = s/ ⇒ I(xi−1 6∈ {s0, s1}), i ∈ [L] I(xi ∈{s+, s−}, ci−1 = ci−2 = 1) ⇒ I(xi−2 < xi−1), 3 ≤ i ≤ L Simplicity (H), Equality/Unknown first or last (W) : di ≤ maxStackDepth, i ∈ [L] op1 o L + oL−1,≤ 1 uL−1 + I(u1 = 1,∀i ∈{2, . . . , L−1} : di ≥ 2) ≥ 1 Equality next to unknown (W) : I(xi = s=) ≤ ui−1 + op1 ui , 3 ≤ i ≤ L Figure 5: ILP model for generating candidate equations