Learning to translate with products of novices: a suite of open-ended challenge problems for teaching MT Adam Lopez1, Matt Post1, Chris Callison-Burch1,2, Jonathan Weese, Juri Ganitkevitch, Narges Ahmidi, Olivia Buzek, Leah Hanson, Beenish Jamil, Matthias Lee, Ya-Ting Lin, Henry Pao, Fatima Rivera, Leili Shahriyari, Debu Sinha, Adam Teichert, Stephen Wampler, Michael Weinberger, Daguang Xu, Lin Yang, and Shang Zhao∗ Department of Computer Science, Johns Hopkins University 1Human Language Technology Center of Excellence, Johns Hopkins University 2Computer and Information Science Department, University of Pennsylvania Abstract Machine translation (MT) draws from several different disciplines, making it a complex sub- ject to teach. There are excellent pedagogical texts, but problems in MT and current algo- rithms for solving them are best learned by doing. As a centerpiece of our MT course, we devised a series of open-ended challenges for students in which the goal was to im- prove performance on carefully constrained instances of four key MT tasks: alignment, decoding, evaluation, and reranking. Students brought a diverse set of techniques to the prob- lems, including some novel solutions which performed remarkably well. A surprising and exciting outcome was that student solutions or their combinations fared competitively on some tasks, demonstrating that even newcom- ers to the field can help improve the state-of- the-art on hard NLP problems while simulta- neously learning a great deal. The problems, baseline code, and results are freely available. 1 Introduction A decade ago, students interested in natural lan- guage processing arrived at universities having been exposed to the idea of machine translation (MT) primarily through science fiction. Today, incoming students have been exposed to services like Google Translate since they were in secondary school or ear- lier. For them, MT is science fact. So it makes sense to teach statistical MT, either on its own or as a unit ∗ The first five authors were instructors and the remaining au- thors were students in the worked described here. This research was conducted while Chris Callison-Burch was at Johns Hop- kins University. in a class on natural language processing (NLP), ma- chine learning (ML), or artificial intelligence (AI). A course that promises to show students how Google Translate works and teach them how to build some- thing like it is especially appealing, and several uni- versities and summer schools now offer such classes. There are excellent introductory texts—depending on the level of detail required, instructors can choose from a comprehensive MT textbook (Koehn, 2010), a chapter of a popular NLP textbook (Jurafsky and Martin, 2009), a tutorial survey (Lopez, 2008), or an intuitive tutorial on the IBM Models (Knight, 1999b), among many others. But MT is not just an object of academic study. It’s a real application that isn’t fully perfected, and the best way to learn about it is to build an MT sys- tem. This can be done with open-source toolkits such as Moses (Koehn et al., 2007), cdec (Dyer et al., 2010), or Joshua (Ganitkevitch et al., 2012), but these systems are not designed for pedagogy. They are mature codebases featuring tens of thousands of source code lines, making it difficult to focus on their core algorithms. Most tutorials present them as black boxes. But our goal is for students to learn the key techniques in MT, and ideally to learn by doing. Black boxes are incompatible with this goal. We solve this dilemma by presenting students with concise, fully-functioning, self-contained com- ponents of a statistical MT system: word alignment, decoding, evaluation, and reranking. Each imple- mentation consists of a naı̈ve baseline algorithm in less than 150 lines of Python code. We assign them to students as open-ended challenges in which the goal is to improve performance on objective eval- uation metrics as much as possible. This setting mirrors evaluations conducted by the NLP research community and by the engineering teams behind high-profile NLP projects such as Google Translate and IBM’s Watson. While we designate specific al- gorithms as benchmarks for each task, we encour- age creativity by awarding more points for the best systems. As additional incentive, we provide a web- based leaderboard to display standings in real time. In our graduate class on MT, students took a va- riety of different approaches to the tasks, in some cases devising novel algorithms. A more exciting re- sult is that some student systems or combinations of systems rivaled the state of the art on some datasets. 2 Designing MT Challenge Problems Our goal was for students to freely experiment with different ways of solving MT problems on real data, and our approach consisted of two separable com- ponents. First, we provided a framework that strips key MT problems down to their essence so students could focus on understanding classic algorithms or invent new ones. Second, we designed incentives that motivated them to improve their solutions as much as possible, encouraging experimentation with approaches beyond what we taught in class. 2.1 Decoding, Reranking, Evaluation, and Alignment for MT (DREAMT) We designed four assignments, each corresponding to a real subproblem in MT: alignment, decoding, evaluation, and reranking.1 From the more general perspective of AI, they emphasize the key problems of unsupervised learning, search, evaluation design, and supervised learning, respectively. In real MT systems, these problems are highly interdependent, a point we emphasized in class and at the end of each assignment—for example, that alignment is an exer- cise in parameter estimation for translation models, that model choice is a tradeoff between expressivity and efficient inference, and that optimal search does not guarantee optimal accuracy. However, present- ing each problem independently and holding all else constant enables more focused exploration. For each problem we provided data, a naı̈ve solu- tion, and an evaluation program. Following Bird et al. (2008) and Madnani and Dorr (2008), we imple- mented the challenges in Python, a high-level pro- 1 http://alopez.github.io/dreamt gramming language that can be used to write very concise programs resembling pseudocode.2,3 By de- fault, each baseline system reads the test data and generates output in the evaluation format, so setup required zero configuration, and students could be- gin experimenting immediately. For example, on re- ceipt of the alignment code, aligning data and eval- uating results required only typing: > align | grade Students could then run experiments within minutes of beginning the assignment. Three of the four challenges also included unla- beled test data (except the decoding assignment, as explained in §4). We evaluated test results against a hidden key when assignments were submitted. 2.2 Incentive Design We wanted to balance several pedagogical goals: un- derstanding of classic algorithms, free exploration of alternatives, experience with typical experimental design, and unhindered collaboration. Machine translation is far from solved, so we ex- pected more than reimplementation of prescribed al- gorithms; we wanted students to really explore the problems. To motivate exploration, we made the as- signments competitive. Competition is a powerful force, but must be applied with care in an educa- tional setting.4 We did not want the consequences of ambitious but failed experiments to be too dire, and we did not want to discourage collaboration. For each assignment, we guaranteed a passing grade for matching the performance of a specific tar- get algorithm. Typically, the target was important but not state-of-the-art: we left substantial room for improvement, and thus competition. We told stu- dents the exact algorithm that produced the target ac- curacy (though we expected them to derive it them- selves based on lectures, notes, or literature). We did not specifically require them to implement it, but the guarantee of a passing grade provided a power- ful incentive for this to be the first step of each as- signment. Submissions that beat this target received additional credit. The top five submissions received full credit, while the top three received extra credit. 2 http://python.org 3 Some well-known MT systems have been implemented in Python (Chiang, 2007; Huang and Chiang, 2007). 4 Thanks to an anonymous reviewer for this turn of phrase. This scheme provided strong incentive to continue experimentation beyond the target algorithm.5 For each assignment, students could form teams of any size, under three rules: each team had to pub- licize its formation to the class, all team members agreed to receive the same grade, and teams could not drop members. Our hope was that these require- ments would balance the perceived competitive ad- vantage of collaboration against a reluctance to take (and thus support) teammates who did not contribute to the competitive effort.6 This strategy worked: out of sixteen students, ten opted to work collaboratively on at least one assignment, always in pairs. We provided a web-based leaderboard that dis- played standings on the test data in real time, iden- tifying each submission by a pseudonymous han- dle known only to the team and instructors. Teams could upload solutions as often as they liked before the assignment deadline. The leaderboard displayed scores of the default and target algorithms. This in- centivized an early start, since teams could verify for themselves when they met the threshold for a passing grade. Though effective, it also detracted from realism in one important way: it enabled hill- climbing on the evaluation metric. In early assign- ments, we observed a few cases of this behavior, so for the remaining assignments, we modified the leaderboard so that changes in score would only be reflected once every twelve hours. This strategy trades some amount of scientific realism for some measure of incentive, a strategy that has proven effective in other pedagogical tools with real-time feedback (Spacco et al., 2006). To obtain a grade, teams were required to sub- mit their results, share their code privately with the instructors, and publicly describe their experimen- tal process to the class so that everyone could learn from their collective effort. Teams were free (but not required) to share their code publicly at any time. 5 Grades depend on institutional norms. In our case, high grades in the rest of class combined with matching all assignment tar- get algorithms would earn a B+; beating two target algorithms would earn an A-; top five placement on any assignment would earn an A; and top three placement compensated for weaker grades in other course criteria. Everyone who completed all four assignments placed in the top five at least once. 6 The equilibrium point is a single team, though this team would still need to decide on a division of labor. One student contem- plated organizing this team, but decided against it. Some did so after the assignment deadline. 3 The Alignment Challenge The first challenge was word alignment: given a par- allel text, students were challenged to produce word- to-word alignments with low alignment error rate (AER; Och and Ney, 2000). This is a variant of a classic assignment not just in MT, but in NLP gen- erally. Klein (2005) describes a version of it, and we know several other instructors who use it.7 In most of these, the object is to implement IBM Model 1 or 2, or a hidden Markov model. Our version makes it open-ended by asking students to match or beat an IBM Model 1 baseline. 3.1 Data We provided 100,000 sentences of parallel data from the Canadian Hansards, totaling around two million words.8 This dataset is small enough to align in a few minutes with our implementation—enabling rapid experimentation—yet large enough to obtain reasonable results. In fact, Liang et al. (2006) report alignment accuracy on data of this size that is within a fraction of a point of their accuracy on the com- plete Hansards data. To evaluate, we used manual alignments of a small fraction of sentences, devel- oped by Och and Ney (2000), which we obtained from the shared task resources organized by Mihal- cea and Pedersen (2003). The first 37 sentences of the corpus were development data, with manual alignments provided in a separate file. Test data con- sisted of an additional 447 sentences, for which we did not provide alignments.9 3.2 Implementation We distributed three Python programs with the data. The first, align, computes Dice’s coefficient (1945) for every pair of French and English words, then aligns every pair for which its value is above an adjustable threshold. Our implementation (most of 7 Among them, Jordan Boyd-Graber, John DeNero, Philipp Koehn, and Slav Petrov (personal communication). 8 http://www.isi.edu/natural-language/download/hansard/ 9 This invited the possibility of cheating, since alignments of the test data are publicly available on the web. We did not adver- tise this, but as an added safeguard we obfuscated the data by distributing the test sentences randomly throughout the file. Listing 1 The default aligner in DREAMT: thresh- olding Dice’s coefficient. for (f, e) in bitext: for f_i in set(f): f_count[f_i] += 1 for e_j in set(e): fe_count[(f_i,e_j)] += 1 for e_j in set(e): e_count[e_j] += 1 for (f_i, e_j) in fe_count.keys(): dice[(f_i,e_j)] = \ 2.0 * fe_count[(f_i, e_j)] / \ (f_count[f_i] + e_count[e_j]) for (f, e) in bitext: for (i, f_i) in enumerate(f): for (j, e_j) in enumerate(e): if dice[(f_i,e_j)] >= cutoff: print "%i-%i " % (i,j) which is shown in Listing 1) is quite close to pseu- docode, making it easy to focus on the algorithm, one of our pedagogical goals. The grade program computes AER and optionally prints an alignment grid for sentences in the development data, showing both human and automatic alignments. Finally the check program verifies that the results represent a valid solution, reporting an error if not—enabling students to diagnose bugs in their submissions. The default implementation enabled immediate experimentation. On receipt of the code, students were instructed to align the first 1,000 sentences and compute AER using a simple command. > align -n 1000 | grade By varying the number of input sentences and the threshold for an alignment, students could immediately see the effect of various parameters on alignment quality. We privately implemented IBM Model 1 (Brown et al., 1993) as the target algorithm for a passing grade. We ran it for five iterations with English as the target language and French as the source. Our implementation did not use null alignment or symmetrization—leaving out these common im- provements offered students the possibility of dis- covering them independently, and thereby rewarded. A E R × 10 0 20 30 40 50 60 -1 6 d a y s -1 4 d a y s -1 2 d a y s -1 0 d a y s -8 d a y s -6 d a y s -4 d a y s -2 d a y s d u e Figure 1: Submission history for the alignment challenge. Dashed lines represent the default and baseline system performance. Each colored line represents a student, and each dot represents a submission. For clarity, we show only submissions that improved the student’s AER. 3.3 Challenge Results We received 209 submissions from 11 teams over a period of two weeks (Figure 1). Everyone eventually matched or exceeded IBM Model 1 AER of 31.26. Most students implemented IBM Model 1, but we saw many other solutions, indicating that many truly experimented with the problem: • Implementing heuristic constraints to require alignment of proper names and punctuation. • Running the algorithm on stems rather than sur- face words. • Initializing the first iteration of Model 1 with parameters estimated on the observed align- ments in the development data. • Running Model 1 for many iterations. Most re- searchers typically run Model 1 for five itera- tions or fewer, and there are few experiments in the literature on its behavior over many iter- ations, as there are for hidden Markov model taggers (Johnson, 2007). Our students carried out these experiments, reporting runs of 5, 20, 100, and even 2000 iterations. No improve- ment was observed after 20 iterations. • Implementing various alternative approaches from the literature, including IBM Model 2 (Brown et al., 1993), competitive linking (Melamed, 2000), and smoothing (Moore, 2004). One of the best solutions was competitive linking with Dice’s coefficient, modified to incorporate the observation that alignments tend to be monotonic by restricting possible alignment points to a window of eight words around the diagonal. Although simple, it acheived an AER of 18.41, an error reduction over Model 1 of more than 40%. The best score compares unfavorably against a state-of-the-art AER of 3.6 (Liu et al., 2010). But under a different view, it still represents a significant amount of progress for an effort taking just over two weeks: on the original challenge from which we ob- tained the data (Mihalcea and Pedersen, 2003) the best student system would have placed fifth out of fifteen systems. Consider also the combined effort of all the students: when we trained a perceptron clas- sifier on the development data, taking each student’s prediction as a feature, we obtained an AER of 15.4, which would have placed fourth on the original chal- lenge. This is notable since none of the systems incorporated first-order dependencies on the align- ments of adjacent words, long noted as an impor- tant feature of the best alignment models (Och and Ney, 2003). Yet a simple system combination of stu- dent assignments is as effective as a hidden Markov Model trained on a comparable amount of data (Och and Ney, 2003). It is important to note that AER does not neces- sarily correlate with downstream performance, par- ticularly on the Hansards dataset (Fraser and Marcu, 2007). We used the conclusion of the assignment as an opportunity to emphasize this point. 4 The Decoding Challenge The second challenge was decoding: given a fixed translation model and a set of input sentences, stu- dents were challenged to produce translations with the highest model score. This challenge introduced the difficulties of combinatorial optimization under a deceptively simple setup: the model we provided was a simple phrase-based translation model (Koehn et al., 2003) consisting only of a phrase table and tri- gram language model. Under this simple model, for a French sentence f of length I, English sentence e of length J, and alignment a where each element consists of a span in both e and f such that every word in both e and f is aligned exactly once, the conditional probability of e and a given f is as fol- lows.10 p(e,a|f) = ∏ 〈i,i′,j,j′〉∈a p(fi ′ i |e j′ j ) J+1∏ j=1 p(ej|ej−1,ej−2) (1) To evaluate output, we compute the conditional probability of e as follows. p(e|f) = ∑ a p(e,a|f) (2) Note that this formulation is different from the typ- ical Viterbi objective of standard beam search de- coders, which do not sum over all alignments, but approximate p(e|f) by maxa p(e,a|f). Though the computation in Equation 2 is intractable (DeNero and Klein, 2008), it can be computed in a few min- utes via dynamic programming on reasonably short sentences. We ensured that our data met this crite- rion. The corpus-level probability is then the prod- uct of all sentence-level probabilities in the data. The model includes no distortion limit or distor- tion model, for two reasons. First, leaving out the distortion model slightly simplifies the implementa- tion, since it is not necessary to keep track of the last word translated in a beam decoder; we felt that this detail was secondary to understanding the difficulty of search over phrase permutations. Second, it actu- ally makes the problem more difficult, since a simple distance-based distortion model prefers translations with fewer permutations; without it, the model may easily prefer any permutation of the target phrases, making even the Viterbi search problem exhibit its true NP-hardness (Knight, 1999a; Zaslavskiy et al., 2009). Since the goal was to find the translation with the highest probability, we did not provide a held-out test set; with access to both the input sentences and 10 For simplicity, this formula assumes that e is padded with two sentence-initial symbols and one sentence-final symbol, and ignores the probability of sentence segmentation, which we take to be uniform. the model, students had enough information to com- pute the evaluation score on any dataset themselves. The difficulty of the challenge lies simply in finding the translation that maximizes the evaluation. In- deed, since the problem is intractable, even the in- structors did not know the true solution.11 4.1 Data We chose 48 French sentences totaling 716 words from the Canadian Hansards to serve as test data. To create a simple translation model, we used the Berkeley aligner to align the parallel text from the first assignment, and extracted a phrase table using the method of Lopez (2007), as implemented in cdec (Dyer et al., 2010). To create a simple language model, we used SRILM (Stolcke, 2002). 4.2 Implementation We distributed two Python programs. The first, decode, decodes the test data monotonically— using both the language model and translation model, but without permuting phrases. The imple- mentation is completely self-contained with no ex- ternal dependencies: it implements both models and a simple stack decoding algorithm for monotonic translation. It contains only 122 lines of Python— orders of magnitude fewer than most full-featured decoders. To see its similarity to pseudocode, com- pare the decoding algorithm (Listing 2) with the pseudocode in Koehn’s (2010) popular textbook (re- produced here as Algorithm 1). The second pro- gram, grade, computes the log-probability of a set of translations, as outline above. We privately implemented a simple stack decoder that searched over permutations of phrases, similar to Koehn (2004). Our implementation increased the codebase by 44 lines of code and included param- eters for beam size, distortion limit, and the maxi- mum number of translations considered for each in- put phrase. We posted a baseline to the leaderboard using values of 50, 3, and 20 for these, respectively. 11 We implemented a version of the Lagrangian relaxation algo- rithm of Chang and Collins (2011), but found it difficult to obtain tight (optimal) solutions without iteratively reintroduc- ing all of the original constraints. We suspect this is due to the lack of a distortion penalty, which enforces a strong pref- erence towards translations with little reordering. However, the solution found by this algorithm is only approximates the objective implied by Equation 2, which sums over alignments. We also posted an oracle containing the most prob- able output for each sentence, selected from among all submissions received so far. The intent of this oracle was to provide a lower bound on the best pos- sible output, giving students additional incentive to continue improving their systems. 4.3 Challenge Results We received 71 submissions from 10 teams (Fig- ure 2), again exhibiting variety of solutions. • Implementation of greedy decoder which at each step chooses the most probable translation from among those reachable by a single swap or retranslation (Germann et al., 2001; Langlais et al., 2007). • Inclusion of heuristic estimates of future cost. • Implementation of a private oracle. Some stu- dents observed that the ideal beam setting was not uniform across the corpus. They ran their decoder under different settings, and then se- lected the most probable translation of each sentence. Many teams who implemented the standard stack decoding algorithm experimented heavily with its pruning parameters. The best submission used ex- tremely wide beam settings in conjunction with a reimplementation of the future cost estimate used in Moses (Koehn et al., 2007). Five of the submissions beat Moses using its standard beam settings after it had been configured to decode with our model. We used this assignment to emphasize the im- portance of good models: the model score of the submissions was generally inversely correlated with BLEU, possibly because our simple model had no distortion limits. We used this to illustrate the differ- ence between model error and search error, includ- ing fortuitous search error (Germann et al., 2001) made by decoders with less accurate search. 5 The Evaluation Challenge The third challenge was evaluation: given a test cor- pus with reference translations and the output of sev- eral MT systems, students were challenged to pro- duce a ranking of the systems that closely correlated with a human ranking. Listing 2 The default decoder in DREAMT: a stack decoder for monotonic translation. stacks = [{} for _ in f] + [{}] stacks[0][lm.begin()] = initial_hypothesis for i, stack in enumerate(stacks[:-1]): for h in sorted(stack.itervalues(),key=lambda h: -h.logprob)[:alpha]: for j in xrange(i+1,len(f)+1): if f[i:j] in tm: for phrase in tm[f[i:j]]: logprob = h.logprob + phrase.logprob lm_state = h.lm_state for word in phrase.english.split(): (lm_state, word_logprob) = lm.score(lm_state, word) logprob += word_logprob logprob += lm.end(lm_state) if j == len(f) else 0.0 new_hypothesis = hypothesis(logprob, lm_state, h, phrase) if lm_state not in stacks[j] or \ stacks[j][lm_state].logprob < logprob: stacks[j][lm_state] = new_hypothesis winner = max(stacks[-1].itervalues(), key=lambda h: h.logprob) def extract_english(h): return "" if h.predecessor is None else "%s%s " % (extract_english(h.predecessor), h.phrase.english) print extract_english(winner) Algorithm 1 Basic stack decoding algorithm, adapted from Koehn (2010), p. 165. place empty hypothesis into stack 0 for all stacks 0...n− 1 do for all hypotheses in stack do for all translation options do if applicable then create new hypothesis place in stack recombine with existing hypothesis prune stack if too big 5.1 Data We chose the English-to-German translation sys- tems from the 2009 and 2011 shared task at the an- nual Workshop for Machine Translation (Callison- Burch et al., 2009; Callison-Burch et al., 2011), pro- viding the first as development data and the second as test data. We chose these sets because BLEU (Papineni et al., 2002), our baseline metric, per- formed particularly poorly on them; this left room for improvement in addition to highlighting some lo g 1 0 p (e |f ) − C -1200 -1250 -1300 -1350 -1400 -2 0 d a y s -1 8 d a y s -1 6 d a y s -1 4 d a y s -1 2 d a y s -1 0 d a y s -8 d a y s -6 d a y s -4 d a y s -2 d a y s d u e Figure 2: Submission history for the decoding challenge. The dotted green line represents the oracle over submis- sions. deficiencies of BLEU. For each dataset we pro- vided the source and reference sentences along with anonymized system outputs. For the development data we also provided the human ranking of the sys- tems, computed from pairwise human judgements according to a formula recommended by Bojar et al. (2011).12 5.2 Implementation We provided three simple Python programs: evaluate implements a simple ranking of the sys- tems based on position-independent word error rate (PER; Tillmann et al., 1997), which computes a bag- of-words overlap between the system translations and the reference. The grade program computes Spearman’s ρ between the human ranking and an output ranking. The check program simply ensures that a submission contains a valid ranking. We were concerned about hill-climbing on the test data, so we modified the leaderboard to report new results only twice a day. This encouraged students to experiment on the development data before posting new submissions, while still providing intermittent feedback. We privately implemented a version of BLEU, which obtained a correlation of 38.6 with the human rankings, a modest improvement over the baseline of 34.0. Our implementation underperforms the one reported in Callison-Burch et al. (2011) since it per- forms no tokenization or normalization of the data. This also left room for improvement. 5.3 Evaluation Challenge Results We received 212 submissions from 12 teams (Fig- ure 3), again demonstrating a wide range of tech- niques. • Experimentation with the maximum n-gram length and weights in BLEU. • Implementation of smoothed versions of BLEU (Lin and Och, 2004). • Implementation of weighted F-measure to bal- ance both precision and recall. • Careful normalization of the reference and ma- chine translations, including lowercasing and punctuation-stripping. 12 This ranking has been disputed over a series of papers (Lopez, 2012; Callison-Burch et al., 2012; Koehn, 2012). The paper which initiated the dispute, written by the first author, was di- rectly inspired by the experience of designing this assignment. S p e a rm a n ’s ρ 0.8 0.6 0.4 -7 d a y s -6 d a y s -5 d a y s -4 d a y s -3 d a y s -2 d a y s -1 d a y s d u e Figure 3: Submission history for the evaluation chal- lenge. • Implementation of several techniques used in AMBER (Chen and Kuhn, 2005). The best submission, obtaining a correlation of 83.5, relied on the idea that the reference and ma- chine translation should be good paraphrases of each other (Owczarzak et al., 2006; Kauchak and Barzi- lay, 2006). It employed a simple paraphrase sys- tem trained on the alignment challenge data, us- ing the pivot technique of Bannard and Callison- Burch (2005), and computing the optimal alignment between machine translation and reference under a simple model in which words could align if they were paraphrases. When compared with the 20 systems submitted to the original task from which the data was obtained (Callison-Burch et al., 2011), this system would have ranked fifth, quite near the top-scoring competitors, whose correlations ranged from 88 to 94. 6 The Reranking Challenge The fourth challenge was reranking: given a test cor- pus and a large N-best list of candidate translations for each sentence, students were challenged to select a candidate translation for each sentence to produce a high corpus-level BLEU score. Due to an error our data preparation, this assignment had a simple solution that was very difficult to improve on. Nev- ertheless, it featured several elements that may be useful for future courses. 6.1 Data We obtained 300-best lists from a Spanish-English translation system built with the Joshua toolkit (Ganitkevitch et al., 2012) using data and resources from the 2011 Workshop on Machine Translation (Callison-Burch et al., 2011). We provided 1989 training sentences, consisting of source and refer- ence sentences along with the candidate translations. We also included a test set of 250 sentences, for which we provided only the source and candidate translations. Each candidate translation included six features from the underlying translation system, out of an original 21; our hope was that students might rediscover some features through experimentation. 6.2 Implementation We conceived of the assignment as one in which stu- dents could apply machine learning or feature engi- neering to the task of reranking the systems, so we provided several tools. The first of these, learn, was a simple program that produced a vector of feature weights using pairwise ranking optimization (PRO; Hopkins and May, 2011), with a perceptron as the underlying learning algorithm. A second, rerank, takes a weight vector as input and reranks the sentences; both programs were designed to work with arbitrary numbers of features. The grade pro- gram computed the BLEU score on development data, while check ensured that a test submission is valid. Finally, we provided an oracle program, which computed a lower bound on the achievable BLEU score on the development data using a greedy approximation (Och et al., 2004). The leaderboard likewise displayed an oracle on test data. We did not assign a target algorithm, but left the assignment fully open-ended. 6.3 Reranking Challenge Outcome For each assignment, we made an effort to create room for competition above the target algorithm. However, we did not accomplish this in the rerank- ing challenge: we had removed most of the features from the candidate translations, in hopes that stu- dents might reinvent some of them, but we left one highly predictive implicit feature in the data: the rank order of the underlying translation system. Stu- dents discovered that simply returning the first can- didate earned a very high score, and most of them quickly converged to this solution. Unfortunately, the high accuracy of this baseline left little room for additional competition. Nevertheless, we were en- couraged that most students discovered this by acci- dent while attempting other strategies to rerank the translations. • Experimentation with parameters of the PRO algorithm. • Substitution of alternative learning algorithms. • Implementation of a simplified minimum Bayes risk reranker (Kumar and Byrne, 2004). Over a baseline of 24.02, the latter approach ob- tained a BLEU of 27.08, nearly matching the score of 27.39 from the underlying system despite an im- poverished feature set. 7 Pedagogical Outcomes Could our students have obtained similar results by running standard toolkits? Undoubtedly. However, our goal was for students to learn by doing: they obtained these results by implementing key MT al- gorithms, observing their behavior on real data, and improving them. This left them with much more in- sight into how MT systems actually work, and in this sense, DREAMT was a success. At the end of class, we requested written feedback on the design of the assignments. Many commented positively on the motivation provided by the challenge problems: • The immediate feedback of the automatic grad- ing was really nice. • Fast feedback on my submissions and my rela- tive position on the leaderboard kept me both motivated to start the assignments early and to constantly improve them. Also knowing how well others were doing was a good way to gauge whether I was completely off track or not when I got bad results. • The homework assignments were very engag- ing thanks to the clear yet open-ended setup and their competitive aspects. Students also commented that they learned a lot about MT and even research in general: Question 1 2 3 4 5 N/A Feedback on my work for this course is useful - - - 4 9 3 This course enhanced my ability to work effectively in a team 1 - 5 8 2 - Compared to other courses at this level, the workload for this course is high - 1 7 6 1 1 Table 1: Response to student survey questions on a Likert scale from 1 (strongly disagree) to 5 (strongly agree). • I learned the most from the assignments. • The assignments always pushed me one step more towards thinking out loud how the par- ticular task can be completed. • I appreciated the setup of the homework prob- lems. I think it has helped me learn how to set up and attack research questions in an or- ganized way. I have a much better sense for what goes into an MT system and what prob- lems aren’t solved. We also received feedback through an anonymous survey conducted at the end of the course before posting final grades. Each student rated aspects of the course on a five point Likert scale, from 1 (strongly disagree) to 5 (strongly agree). Several questions pertained to assignments (Table 1), and al- lay two possible concerns about competition: most students felt that the assignments enhanced their col- laborative skills, and that their open-endedness did not result in an overload of work. For all survey questions, student satisfaction was higher than av- erage for courses in our department. 8 Discussion DREAMT is inspired by several different ap- proaches to teaching NLP, AI, and computer sci- ence. Eisner and Smith (2008) teach NLP using a competitive game in which students aim to write fragments of English grammar. Charniak et al. (2000) improve the state-of-the-art in a reading com- prehension task as part of a group project. Christo- pher et al. (1993) use NACHOS, a classic tool for teaching operating systems by providing a rudimen- tary system that students then augment. DeNero and Klein (2010) devise a series of assignments based on Pac-Man, for which students implement several classic AI techniques. A crucial element in such ap- proaches is a highly functional but simple scaffold- ing. The DREAMT codebase, including grading and validation scripts, consists of only 656 lines of code (LOC) over four assignments: 141 LOC for align- ment, 237 LOC for decoding, 86 LOC for evalua- tion, and 192 LOC for reranking. To simplify imple- mentation further, the optional leaderboard could be delegated to Kaggle.com, a company that organizes machine learning competitions using a model sim- ilar to the Netflix Challenge (Bennet and Lanning, 2007), and offers pro bono use of its services for educational challenge problems. A recent machine learning class at Oxford hosted its assignments on Kaggle (Phil Blunsom, personal communication). We imagine other uses of DREAMT. It could be used in an inverted classroom, where students view lecture material outside of class and work on prac- tical problems in class. It might also be useful in massive open online courses (MOOCs). In this for- mat, course material (primarily lectures and quizzes) is distributed over the internet to an arbitrarily large number of interested students through sites such as coursera.org, udacity.com, and khanacademy.org. In many cases, material and problem sets focus on spe- cific techniques. Although this is important, there is also a place for open-ended problems on which stu- dents apply a full range of problem-solving skills. Automatic grading enables them to scale easily to large numbers of students. On the scientific side, the scale of MOOCs might make it possible to empirically measure the effec- tiveness of hands-on or competitive assignments, by comparing course performance of students who work on them against that of those who do not. Though there is some empirical work on competi- tive assignments in the computer science education literature (Lawrence, 2004; Garlick and Akl, 2006; Regueras et al., 2008; Ribeiro et al., 2009), they generally measure student satisfaction and retention rather than the more difficult question of whether such assignments actually improve student learning. However, it might be feasible to answer such ques- tions in large, data-rich virtual classrooms offered by MOOCs. This is an interesting potential avenue for future work. Because our class came within reach of state-of- the-art on each problem within a matter of weeks, we wonder what might happen with a very large body of competitors. Could real innovation oc- cur? Could we solve large-scale problems? It may be interesting to adopt a different incentive struc- ture, such as one posed by Abernethy and Frongillo (2011) for crowdsourcing machine learning prob- lems: rather than competing, everyone works to- gether to solve a shared task, with credit awarded proportional to the contribution that each individual makes. In this setting, everyone stands to gain: stu- dents learn to solve problems as they are found in the real world, instructors learn new insights into the problems they pose, and, in the long run, users of AI technology benefit from overall improvements. Hence it is possible that posing open-ended, real- world problems to students might be a small piece of the puzzle of providing high-quality NLP tech- nologies. Acknowledgments We are grateful to Colin Cherry and Chris Dyer for testing the assignments in different settings and providing valuable feedback, and to Jessie Young for implementing a dual decomposition solution to the decoding assignment. We thank Jason Eis- ner, Frank Ferraro, Yoav Goldberg, Matt Gormley, Ann Irvine, Rebecca Knowles, Ben Mitchell, Court- ney Napoles, Michael Rushanan, Joanne Selinski, Svitlana Volkova, and the anonymous reviewers for lively discussion and helpful comments on previous drafts of this paper. Any errors are our own. References J. Abernethy and R. M. Frongillo. 2011. A collaborative mechanism for crowdsourcing prediction problems. In Proc. of NIPS. C. Bannard and C. Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proc. of ACL. J. Bennet and S. Lanning. 2007. The netflix prize. In Proc. of the KDD Cup and Workshop. S. Bird, E. Klein, E. Loper, and J. Baldridge. 2008. Multidisciplinary instruction with the natural language toolkit. In Proc. of Workshop on Issues in Teaching Computational Linguistics. O. Bojar, M. Ercegovčević, M. Popel, and O. Zaidan. 2011. A grain of salt for the WMT manual evaluation. In Proc. of WMT. P. E. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Lin- guistics, 19(2). C. Callison-Burch, P. Koehn, C. Monz, and J. Schroeder. 2009. Findings of the 2009 workshop on statistical machine translation. In Proc. of WMT. C. Callison-Burch, P. Koehn, C. Monz, and O. Zaidan. 2011. Findings of the 2011 workshop on statistical machine translation. In Proc. of WMT. C. Callison-Burch, P. Koehn, C. Monz, M. Post, R. Sori- cut, and L. Specia. 2012. Findings of the 2012 work- shop on statistical machine translation. In Proc. of WMT. Y.-W. Chang and M. Collins. 2011. Exact decoding of phrase-based translation models through Lagrangian relaxation. In Proc. of EMNLP. E. Charniak, Y. Altun, R. de Salvo Braz, B. Garrett, M. Kosmala, T. Moscovich, L. Pang, C. Pyo, Y. Sun, W. Wy, Z. Yang, S. Zeiler, and L. Zorn. 2000. Read- ing comprehension programs in a statistical-language- processing class. In Proc. of Workshop on Read- ing Comprehension Tests as Evaluation for Computer- Based Language Understanding Systems. B. Chen and R. Kuhn. 2005. AMBER: A modified BLEU, enhanced ranking metric. In Proc. of WMT. D. Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33(2). W. A. Christopher, S. J. Procter, and T. E. Anderson. 1993. The nachos instructional operating system. In Proc. of USENIX. J. DeNero and D. Klein. 2008. The complexity of phrase alignment problems. In Proc. of ACL. J. DeNero and D. Klein. 2010. Teaching introductory articial intelligence with Pac-Man. In Proc. of Sym- posium on Educational Advances in Artificial Intelli- gence. L. R. Dice. 1945. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302. C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman, and P. Resnik. 2010. cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In Proc. of ACL. J. Eisner and N. A. Smith. 2008. Competitive grammar writing. In Proc. of Workshop on Issues in Teaching Computational Linguistics. A. Fraser and D. Marcu. 2007. Measuring word align- ment quality for statistical machine translation. Com- putational Linguistics, 33(3). J. Ganitkevitch, Y. Cao, J. Weese, M. Post, and C. Callison-Burch. 2012. Joshua 4.0: Packing, PRO, and paraphrases. In Proc. of WMT. R. Garlick and R. Akl. 2006. Intra-class competitive assignments in CS2: A one-year study. In Proc. of International Conference on Engineering Education. U. Germann, M. Jahr, K. Knight, D. Marcu, and K. Ya- mada. 2001. Fast decoding and optimal decoding for machine translation. In Proc. of ACL. L. Huang and D. Chiang. 2007. Forest rescoring: Faster decoding with integrated language models. In Proc. of ACL. M. Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In Proc. of EMNLP. D. Jurafsky and J. H. Martin. 2009. Speech and Lan- guage Processing. Prentice Hall, 2nd edition. D. Kauchak and R. Barzilay. 2006. Paraphrasing for automatic evaluation. In Proc. of HLT-NAACL. D. Klein. 2005. A core-tools statistical NLP course. In Proc. of Workshop on Effective Tools and Methodolo- gies for Teaching NLP and CL. K. Knight. 1999a. Decoding complexity in word- replacement translation models. Computational Lin- guistics, 25(4). K. Knight. 1999b. A statistical MT tutorial workbook. P. Koehn, F. J. Och, and D. Marcu. 2003. Statistical phrase-based translation. In Proc. of NAACL. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL. P. Koehn. 2004. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Proc. of AMTA. P. Koehn. 2010. Statistical Machine Translation. Cam- bridge University Press. P. Koehn. 2012. Simulating human judgment in machine translation evaluation campaigns. In Proc. of IWSLT. S. Kumar and W. Byrne. 2004. Minimum bayes-risk decoding for statistical machine translation. In Proc. of HLT-NAACL. P. Langlais, A. Patry, and F. Gotti. 2007. A greedy de- coder for phrase-based statistical machine translation. In Proc. of TMI. R. Lawrence. 2004. Teaching data structures using competitive games. IEEE Transactions on Education, 47(4). P. Liang, B. Taskar, and D. Klein. 2006. Alignment by agreement. In Proc. of NAACL. C.-Y. Lin and F. J. Och. 2004. ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In Proc. of COLING. Y. Liu, Q. Liu, and S. Lin. 2010. Discriminative word alignment by linear modeling. Computational Lin- guistics, 36(3). A. Lopez. 2007. Hierarchical phrase-based translation with suffix arrays. In Proc. of EMNLP. A. Lopez. 2008. Statistical machine translation. ACM Computing Surveys, 40(3). A. Lopez. 2012. Putting human assessments of machine translation systems in order. In Proc. of WMT. N. Madnani and B. Dorr. 2008. Combining open-source with research to re-engineer a hands-on introductory NLP course. In Proc. of Workshop on Issues in Teach- ing Computational Linguistics. I. D. Melamed. 2000. Models of translational equiv- alence among words. Computational Linguistics, 26(2). R. Mihalcea and T. Pedersen. 2003. An evaluation ex- ercise for word alignment. In Proc. on Workshop on Building and Using Parallel Texts. R. C. Moore. 2004. Improving IBM word alignment model 1. In Proc. of ACL. F. J. Och and H. Ney. 2000. Improved statistical align- ment models. In Proc. of ACL. F. J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29. F. J. Och, D. Gildea, S. Khudanpur, A. Sarkar, K. Ya- mada, A. Fraser, S. Kumar, L. Shen, D. Smith, K. Eng, V. Jain, Z. Jin, and D. Radev. 2004. A smorgasbord of features for statistical machine translation. In Proc. of NAACL. K. Owczarzak, D. Groves, J. V. Genabith, and A. Way. 2006. Contextual bitext-derived paraphrases in auto- matic MT evaluation. In Proc. of WMT. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL. L. Regueras, E. Verdú, M. Verdú, M. Pérez, J. de Castro, and M. Muñoz. 2008. Motivating students through on-line competition: An analysis of satisfaction and learning styles. P. Ribeiro, M. Ferreira, and H. Simões. 2009. Teach- ing artificial intelligence and logic programming in a competitive environment. Informatics in Education, (Vol 8 1):85. J. Spacco, D. Hovemeyer, W. Pugh, J. Hollingsworth, N. Padua-Perez, and F. Emad. 2006. Experiences with marmoset: Designing and using an advanced submis- sion and testing system for programming courses. In Proc. of Innovation and technology in computer sci- ence education. A. Stolcke. 2002. SRILM - an extensible language mod- eling toolkit. In Proc. of ICSLP. C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, and H. Sawaf. 1997. Accelerated DP based search for statistical translation. In Proc. of European Conf. on Speech Communication and Technology. M. Zaslavskiy, M. Dymetman, and N. Cancedda. 2009. Phrase-based statistical machine translation as a trav- eling salesman problem. In Proc. of ACL.