Phrase Table Induction Using In-Domain Monolingual Data for Domain Adaptation in Statistical Machine Translation Benjamin Marie Atsushi Fujita National Institute of Information and Communications Technology 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan {bmarie, atsushi.fujita}@nict.go.jp Abstract We present a new framework to induce an in- domain phrase table from in-domain monolin- gual data that can be used to adapt a general- domain statistical machine translation system to the targeted domain. Our method first compiles sets of phrases in source and target languages separately and generates candidate phrase pairs by taking the Cartesian product of the two phrase sets. It then computes in- expensive features for each candidate phrase pair and filters them using a supervised clas- sifier in order to induce an in-domain phrase table. We experimented on the language pair English–French, both translation directions, in two domains and obtained consistently better results than a strong baseline system that uses an in-domain bilingual lexicon. We also con- ducted an error analysis that showed the in- duced phrase tables proposed useful transla- tions, especially for words and phrases unseen in the parallel data used to train the general- domain baseline system. 1 Introduction In phrase-based statistical machine translation (SMT), translation models are estimated over a large amount of parallel data. In general, using more data leads to a better translation model. When no specific domain is targeted, general-domain1 par- allel data from various domains may be used to 1As in Axelrod et al. (2011), in this paper, we use the term general-domain instead of the commonly used out-of-domain because we assume that the parallel data may contain some in- domain sentence pairs. train a general-purpose SMT system. However, it is well-known that, in training a system to trans- late texts from a specific domain, using in-domain parallel data can lead to a significantly better trans- lation quality (Carpuat et al., 2012). Indeed, when only general-domain parallel data are used, it is un- likely that the translation model can learn expres- sions and their translations specific to the targeted domain. Such expressions will then remain untrans- lated in the in-domain texts to translate. So far, in-domain parallel data have been har- nessed to cover domain-specific expressions and their translations in the translation model. However, even if we can assume the availability of a large quantity of general-domain parallel data, at least for resource-rich language pairs, finding in-domain par- allel data specific to a particular domain remains challenging. In-domain parallel data may not exist for the targeted language pairs or may not be avail- able at hand to train a good translation model. In order to circumvent the lack of in-domain par- allel data, this paper presents a new method to adapt an existing SMT system to a specific domain by in- ducing an in-domain phrase table, i.e., a set of phrase pairs associated with features for decoding, from in- domain monolingual data. As we review in Sec- tion 2, most of the existing methods for inducing phrase tables are not designed, and may not perform as expected, to induce a phrase table for a specific domain for which only limited resources are avail- able. Instead of relying on large quantity of parallel data or highly comparable corpora, our method in- duces an in-domain phrase table from unaligned in- domain monolingual data through a three-step pro- 487 Transactions of the Association for Computational Linguistics, vol. 5, pp. 487–500, 2017. Action Editor: Chris Quirk. Submission batch: 3/2017; Published 11/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. cedure: phrase collection, phrase pair scoring, and phrase pair filtering. Incorporating our induced in- domain phrase table into an SMT system achieves substantial improvements in translating in-domain texts over a strong baseline system, which uses an in-domain bilingual lexicon. To achieve this improvement, our proposed method for inducing an in-domain phrase table ad- dresses several limitations of previous work by: • dealing with source and target phrases of arbi- trary length collected from in-domain monolin- gual data, • proposing translations for not only unseen source phrases, but also those already seen in the general-domain parallel data, and • making use of potentially many features com- puted from the monolingual data, as well as from the parallel data, in order to score and fil- ter the candidate phrase pairs. In the remainder of this paper, we first review previous work in Section 2, highlighting the main weaknesses of existing methods for inducing a phrase table for domain adaptation, and our moti- vation. In Section 3, we then present our phrase ta- ble induction method with all the necessary steps: phrase collection (Section 3.1), computing features of each phrase pair (Section 3.2), and pruning the induced phrase tables to keep their size manageable (Section 3.3). In Section 4, we describe our exper- iments to evaluate the impact of the induced phrase tables in translating in-domain texts. Following the description of the data (Section 4.1), we explain the tools and parameters used to induce the phrase tables (Section 4.2), our SMT systems (Section 4.3), and present additional baseline systems (Section 4.4). Our experimental results are given in Section 4.5. Section 5.1 analyzes the error distribution of the translations produced by an SMT system using our induced phrase table, followed by translation exam- ples to further illustrate its impact in Section 5.2. Fi- nally, Section 6 concludes this work and proposes some possible improvements to our approach. 2 Motivation In machine translation (MT), words and phrases that do not appear in the training parallel data, i.e., out- of-vocabulary (OOV) tokens, have been recognized as one of the fundamental issues, regardless of the scenario, such as adapting existing SMT systems to a new specific domain. One straightforward way to find translations of OOV words and phrases consists in enlarging the parallel data used to train the translation model. This can be done by retrieving parallel sentences from comparable corpora. However, these methods heav- ily rely on document-level information (Zhao and Vogel, 2002; Utiyama and Isahara, 2003; Fung and Cheung, 2004; Munteanu and Marcu, 2005) to re- duce their search space by scoring only sentence pairs extracted from each pair of documents. In- deed, scoring all possible sentence pairs from two large monolingual corpora using costly features and a classifier, as proposed by Munteanu and Marcu (2005) for instance, is computationally too expen- sive.2 In many cases, we may not have access to document-level information in the given monolin- gual data for the targeted domain. Furthermore, even without considering computational cost, it is unlikely that a large number of parallel sentences can be retrieved from non-comparable monolingual corpora. Hewavitharana and Vogel (2016) proposed to directly extract phrase pairs from comparable sen- tences. However, the number of retrievable phrase pairs is strongly limited, because one can collect such comparable sentences only on a relatively small scale for the targeted language pairs and domains. When in-domain parallel or comparable sentences can not be easily retrieved, another possibility to find translations for OOV words is bilingual word lexi- con induction using comparable or unaligned mono- lingual corpora (Fung, 1995; Rapp, 1995; Koehn and Knight, 2002; Haghighi et al., 2008; Daumé and Jagarlamudi, 2011; Irvine and Callison-Burch, 2013). This approach is especially useful in finding words and their translations specific to the given cor- pus. A recent and completely different trend of work uses an unsupervised method regarding translation as a decipherment problem to learn a bilingual word lexicon and use it as a translation model (Ravi and Knight, 2011; Dou and Knight, 2012; Nuhn et al., 2012). However, all these methods deal only with 2For instance, using these approaches on source and target monolingual data containing both 5 millions sentences means that we have to evaluate 25×1012 candidate sentence pairs. 488 words, mainly owing to the computational complex- ity of dealing with arbitrary lengths of phrases. Translations of phrases can be induced using bilingual word lexicons and considering permuta- tions of word ordering (Zhang and Zong, 2013; Irvine and Callison-Burch, 2014). However, it is costly to thoroughly investigate all combinations of a large number of word-level translation candi- dates and possible permutations of word ordering. To retain only appropriate phrase pairs, Irvine and Callison-Burch (2014) proposed to exploit a set of features. Some of them, including temporal, contex- tual, and topic similarity features, strongly relied on the comparability of Wikipedia articles and on the availability of news articles annotated with a times- tamp (Klementiev et al., 2012). We may not have such useful resources in large quantity for the tar- geted language pairs and domains. Saluja et al. (2014) and Zhao et al. (2015) also proposed methods to induce a phrase table, focus- ing only on the OOV words and phrases: unigrams and bigrams in the source side of their development and test data that are unseen in the training data. In their approach, no new translation options are proposed for known source phrases. To generate candidate phrase pairs, for a given source phrase, Saluja et al. (2014) uses only phrases from the tar- get side of their parallel data and their morphologi- cal variants ranked and pruned according to the for- ward lexical translation probabilities given by their baseline system’s translation model. Their approach thus strongly relies on the accuracy of the exist- ing translation model. For instance, if the given source phrase contains only OOV tokens, as it may happen when translating a text from a different do- main, their approach cannot retrieve candidate tar- get phrases. Furthermore, they do not make use of external monolingual data to explore unseen target phrases. Their method is consequently inadequate to produce translations for phrases from a different domain than the one of the parallel data. While Saluja et al. (2014) used a costly graph propagation strategy to score the candidate phrase pairs, Zhao et al. (2015) used a method with a much lower computational cost and reported higher BLEU scores using only word embeddings to score and rank many phrase pairs generated from tar- get phrases, unigrams and bigrams, collected from monolingual corpora. The main contribution of Zhao et al. (2015) is the use of a local linear projec- tion strategy (LLP) to obtain a cross-lingual seman- tic similarity score for each phrase pair. It makes the projection of source embeddings to the target em- beddings space by learning a translation matrix for each source phrase embedding, trained on m gold phrase pairs with source phrase embeddings simi- lar to the one to project. After the projection, based only on the similarity over embeddings, the k near- est target phrases of the projected source phrase are retrieved. If the projection for a given source phrase is not accurate enough, very noisy phrase pairs are generated. This may be a problem especially when the given source phrase does not need to be trans- lated (i.e., numbers, dates, molecule names, etc.). The system will translate it, because this source phrase previously OOV is now registered in its in- duced phrase table, but has only wrong translations available (see Section 4.5 for empirical evidences). 3 In-domain phrase table induction To induce an in-domain phrase table, our approach assumes the availability of large general-domain parallel data and in-domain monolingual data of both source and target languages. For some of our configurations, we also assume the availability of an in-domain bilingual lexicon to compute features as- sociated with each candidate phrase pair and to com- pute a reliability score to filter appropriate ones. 3.1 In-domain phrase collection In a standard configuration, SMT systems extract phrases of a length up to six or seven tokens. Col- lecting all the n-grams of such a length from a given large monolingual corpus is feasible, but will pro- vide a large set of source and target phrases, re- sulting in an enormous number of candidate phrase pairs. In the next step, we evaluate each candidate in a given set of phrase pairs; it is thus crucial to get a reasonably small set of phrases. In contrast with previous work, we collect more meaningful phrases than arbitrary short n-grams, us- ing the following formula presented by Mikolov et al. (2013a): score(wiwj) = freq(wiwj)−δ freq(wi)× freq(wj) 489 where wi and wj are two consecutive tokens, freq(·) the frequency of a given word or phrase in the given monolingual corpus, and δ a discounting coefficient that prevents the retrieval of many phrases composed from infrequent words. Each bigram wiwj in the monolingual corpus is scored with this formula and only the bigrams with a score above a predefined threshold θ are regarded as phrases. All the iden- tified phrases are transformed into one token,3 and a new pass is performed over the monolingual corpus to obtain new phrases also using the phrases identi- fied in the previous passes. To further limit the num- ber of collected phrases, we consider only phrases containing words that appear at least K times in the monolingual data. After T passes, we compile a set of phrases with (a) all the single words and (b) all the phrases with a length of up to L tokens identi- fied during each pass. Standard SMT systems for close languages di- rectly output OOV tokens in the translation. To be as good as such systems, our approach must be able to retrieve the right translation, especially for the many domain-specific words and phrases that are identical in both source and target languages. To ensure that a source phrase that must remain untranslated has its identity in the target phrase set, we explicitly add in the target phrase set all the source phrases that also appear in the target monolingual data. 3.2 Feature engineering Given two sets of phrases, for the source and target languages, respectively, we regard all possible com- binations of source and target phrases as candidate phrase pairs. This naive coupling imperatively gen- erates a large number of pairs that are mostly noise. Thus, the challenge here is to effectively estimate the reliability of each pair. This section describes sev- eral features to characterize each phrase pair; they are used for evaluating phrase pairs and also added in the induced phrase table to guide the decoder. 3.2.1 Cross-lingual semantic similarity Many researchers tackled the problem of esti- mating cross-lingual semantic similarity between pairs of words or phrases by using their embeddings (Mikolov et al., 2013a; Chandar et al., 2014; Faruqui 3This transformation is performed by simply replacing the space between the two tokens with an underscore. and Dyer, 2014; Coulmance et al., 2015; Gouws et al., 2015; Duong et al., 2016) in combination with either a seed bilingual lexicon or a set of parallel sentence pairs. We estimate monolingual phrase embeddings via the element-wise addition of the word embeddings composing the phrase. This method performs well to estimate phrase embeddings (Mitchell and Lap- ata, 2010; Mikolov et al., 2013a), despite its simplic- ity and relatively low computational cost compared to state-of-the-art methods based on neural networks (Socher et al., 2013a; Socher et al., 2013b) or rich features (Lazaridou et al., 2015). This low computa- tional cost is crucial in our case, as we need to eval- uate a large number of candidate phrase pairs. In order to make source and target phrase em- beddings comparable, we perform a linear projec- tion (Mikolov et al., 2013a) of the embeddings of source phrases to the target embedding space. To learn the projection, we use the method of Mikolov et al. (2013a) with the only exception that we deal with not only words but also phrases. Given train- ing data, i.e., a gold bilingual lexicon, we obtain a translation matrix Ŵ by solving the following opti- mization problem with stochastic gradient descent: Ŵ = arg min W ∑ i ||Wxi −zi||2 where xi is the source phrase embedding of the i-th training data, zi the target phrase embedding of the corresponding gold translation, and W the transla- tion matrix used to project xi such that Wxi is as close as possible to zi in the target embedding space. One important parameter here is the number of di- mensions of word/phrase embeddings. This can be different for the source and target embeddings, but must be smaller than the number of phrase pairs in the training data; otherwise the equation is not solv- able. See Section 4.1 for the details about the bilin- gual lexicon used in our experiment. Given a phrase pair to evaluate, the source phrase embedding is projected to the target embedding space, using Ŵ . Then, we compute the cosine simi- larity between the projected source phrase embed- ding and the target phrase embedding to evaluate the semantic similarity between these phrases; this seems to give satisfying results in this cross-lingual scenario as shown by Mikolov et al. (2013a). A 490 translation matrix is trained for each translation di- rection f → e and e → f, respectively, so that we have two cross-lingual semantic similarity features for each phrase pair. 3.2.2 Lexical translation probabilities We assume the existence of a large amount of general-domain parallel data, and train a regular translation model with lexical translation proba- bilities in an ordinary way. Although in-domain phrases are likely to contain tokens that are unseen in the general-domain parallel data, lexical transla- tion probabilities may be useful to score candidate pair of source and target phrases that contain tokens seen in the general-domain parallel data. To com- pute a phrase-level score, for a target phrase e given a source phrase f, we consider all possible word alignments as follows: Plex(e|f) = 1 I I∑ i=1 log (1 J J∑ j=1 p(ei|fj) ) where I and J are the lengths of e and f, respec- tively, and p(ei|fj) the lexical translation probability of the i-th target word ei of e given the j-th source word fj of f. Such phrase-level lexical translation probabilities are computed for both translation di- rections giving us two features. 3.2.3 Other features As demonstrated by previous work (Irvine and Callison-Burch, 2014; Irvine and Callison-Burch, 2016), features based on the frequency of the phrases in the monolingual data may help us to bet- ter score a phrase pair. We add as features the in- versed frequency of the source and target phrases in the in-domain monolingual data, along with their relative difference given by the following formula: simf(e,f) = ∣∣∣∣∣log (freq(e) Ne ) − log (freq(f) Nf )∣∣∣∣∣ where Nx stands for the number of tokens in the in- domain monolingual data of the corresponding lan- guage. The surface-level similarity of source and target phrases can also be a strong clue when considering the translation between two languages that are rela- tively close. We investigate two features concerning this: the first feature is the Levenshtein distance be- tween the two phrases calculated regarding words as units,4 while the other is a binary feature that fires if the two phrases are identical. We shall ex- pect both features to be very useful in cases where many domain-specific words and phrases are writ- ten in the same way in two languages; for instance, drug and molecule names in the medical domain in French and English. We also add as features the lengths of the source and target phrases, i.e., I and J, and their ratio. Using all the above 12 features, the overall score for each pair is given by a classifier as described in Section 3.3; this score is also added as a feature in the induced phrase table for decoding. 3.3 Phrase pair filtering As mentioned above, phrase pairs so far generated are mostly noise. To reduce the decoder’s search space when using our induced phrase table, we rad- ically filter out inappropriate pairs. Each candidate phrase pair is assessed by the method proposed in Irvine and Callison-Burch (2013), which predicts whether a pair of words are translations of one an- other using a classifier. As training examples, we use a bilingual lexicon as positive examples and ran- domly associated phrase pairs from our phrase sets as negative examples. For classification, we use all the features presented in Section 3.2. We use the score given by the classifier to rank the target phrases for each source phrase. Only the target phrases with the top n scores are kept in the final induced phrase table. 4 Experiments This section demonstrates the impact of the in- duced phrase tables in translating in-domain texts in three configurations. In the first configuration (Conf. 1), we evaluated whether our induced phrase table improves the translation of in-domain texts over the vanilla SMT system which used only one phrase table trained from general-domain parallel 4Here we did not use the character-level edit distance to measure the orthographic similarity between phrases. Even though such a feature may be useful (Koehn and Knight, 2002), its computational cost is too high to deal efficiently with billions of phrase pairs. 491 data. We then evaluated, in the second configura- tion (Conf. 2), whether our induced phrase table is also beneficial when used in an SMT system that already incorporates an in-domain bilingual lexicon that could be created manually or induced by some of the methods mentioned in Section 2. Finally, we evaluated in complementary experiments (Conf. 3) whether our induced phrase table can also offer use- ful information to improve translation quality even when used in combination with another standard phrase table generated from in-domain parallel data. 4.1 Data Since our approach assumes the availability of large- scale general-domain parallel and monolingual cor- pora, we considered the French–English language pair and both translation directions for our experi- ments. The French–English version of the Europarl parallel corpus5 was regarded as a general-domain, and not strictly out-of-domain, corpus because many debates can be associated to a specific domain and can contain phrases specific to particular domains. As general-domain monolingual data, we used the concatenation of one side of Europarl and the 2007– 2014 editions of News Crawl corpora6 in the same language. We focused on two domains: medical (EMEA) and science (Science). For both domains, we used the development and test sets provided for a work- shop on domain adaptation of MT (Carpuat et al., 2012).7 We also used the provided in-domain par- allel data for training but regarded only the target side as monolingual data. Since our primary ob- jective is the induction of a phrase table without using in-domain parallel data, the source side of the in-domain parallel data was not used as a part of the source in-domain monolingual data, except when training an ordinary in-domain phrase table in Conf. 3. As medical domain monolingual data for the EMEA translation task, we used the French and English monolingual medical data provided for the WMT’14 medical translation task.8 None of 5http://statmt.org/europarl/, release 7 6http://statmt.org/wmt15/ translation-task.html 7http://hal3.name/damt/ 8http://www.statmt.org/wmt14/ medical-task/ Domain Data # sent. # tok. (En-Fr) EMEA development 2,022 28k-32k test 2,045 25k-29k parallel 472k 6M-7M monolingual 275M-255M Science development 1,990 52k-65k test 1,982 52k-65k parallel 66k 2M-2M monolingual 82M-2M General parallel 2M 54M-60M monolingual 2.8B-1.1B Table 1: Statistics on train, development, and test data. the parallel corpora provided for the WMT’14 med- ical translation task was used. As science domain monolingual data for the Science translation task, we used the English side of the ASPEC parallel cor- pus (Nakazawa et al., 2016).9 Unfortunately, we did not find any French monolingual corpora pub- licly available for the Science domain that were suf- ficiently large enough for our experiments. Statistics on the data we used are presented in Table 1. To induce the phrase tables from the monolin- gual data, we compared two bilingual lexicons: a general-domain and an in-domain lexicons. These lexicons are used to train the translation matrices (see Section 3.2.1) and to train the classifier (see Section 3.3). The general-domain lexicon (hence- forth, gen-lex) is a phrase-based one extracted from the phrase table built on the general-domain parallel data (see Section 4.3). We extracted the 5,000 most frequent source phrases and their most probable translation according to the forward trans- lation probability, p(e|f). We adopted this size as it had been proven optimal to learn the mapping between two monolingual embedding spaces (Vulić and Korhonen, 2016). For some experiments, we also simulated the availability of an in-domain bilin- gual lexicon. We automatically generated a lexicon for each domain (henceforth, in-lex) using the entire in-domain parallel data, in the same manner as compiling gen-lex, except that we selected the 5,000 most frequent source words in the in-domain parallel data that were not in the 5,000 most frequent words in the general-domain parallel data in order 9http://orchid.kuee.kyoto-u.ac.jp/ ASPEC/ 492 Side Domain Data w2p w2v source general monolingual √ in-domain monolingual √ √ parallel ( √ ) ( √ ) development √ test √ target general monolingual √ in-domain monolingual √ √ parallel √ √ Table 2: Corpora used for extracting phrases and comput- ing word embeddings: w2p indicates word2phrase, while w2v for word2vec. ( √ ) denotes that the data are used in Conf. 3 only. to ensure that we obtained mostly in-domain word pairs. Note that we did not use phrases but words for in-lex, assuming that humans are not able to manually construct a lexicon comprising phrase pairs similar to those in phrase tables for SMT sys- tems. For Conf. 3, as we assume the availabil- ity of in-domain parallel data, the bilingual lexi- con (para-lex) used was 5,000 phrase pairs ex- tracted from the in-domain phrase table, excluding the source phrases of gen-lex. 4.2 Tools and parameters A summary of the data used to collect phrases and estimate word embeddings is presented by Table 2. For each pair of domain and translation direc- tion, sets of source and target phrases were extracted from the in-domain monolingual data, as described in Section 3.1. As in previous work (Irvine and Callison-Burch, 2014; Saluja et al., 2014; Zhao et al., 2015), we focus on source phrases appearing in the development and test sets in order to maximize the coverage of our induced phrase table for them.10 More precisely, source phrases were collected from 10We are aware that this may not be practical because it requires the knowledge of the development and test sets be- forehand. For instance for the Fr→En EMEA translation task, inducing a phrase table given all the 4.5M collected source phrase would required approximately 3 months using 100 CPU threads. Increasing the value of K to collect less source phrases can be a reasonable alternative to significantly decrease this computation time, even though it will also necessarily decrease the coverage of the phrase table. We leave for our future work the study of a phrase table induction with source phrases ex- tracted from source monolingual data without referring to the development and test sets. Task source target # phrase all dev+test pairs EMEA Fr→En 4.5M 20k 437k 8.7B En→Fr 5.1M 11k 469k 5.2B Science Fr→En 1.1M 28k 216k 6.0B En→Fr 2.3M 24k 18k 432M Table 3: Size of the phrase sets collected from the source and target in-domain monolingual data and the number of phrases appearing only in the concatenation of the source side of the development and test sets (dev+test). “# phrase pairs” denotes the number of phrase pairs as- sessed by the classifier. the concatenation of the development and test sets and the in-domain monolingual data with reliable statistics, and then only phrases appearing in the de- velopment and test sets were filtered.11 We removed phrases containing tokens unseen in the in-domain monolingual data, because we are unable to com- pute all our features for them. On the other hand, target phrases were collected from the in-domain monolingual data, including the target side of in- domain parallel data. To identify phrases, we used the word2phrase tool included in the word2vec package,12 with the default values for δ and θ. We set K = 1 for the source language to ensure that most of the tokens would be translated, and K = 25 for the target language to limit the number of result- ing phrases. We set L = 6 as this is the same max- imal phrase length that we set for the phrase tables trained from the parallel data. We stopped at T = 4 passes as the fifth pass retrieved only a very small number of new phrases compared to the fourth pass. Statistics of the collected phrases for each task are presented in Table 3. To train the word embeddings, we used word2vec with the following parameters: -cbow 1 -window 10 -negative 15 -sample 1e-4 -iter 15 -min-count 1. Mikolov et al. (2013a) observed that better results for cross- lingual semantic similarity were obtained when using word embeddings with higher dimensions 11As we had no French monolingual corpus for the Science domain, the development and test sets for the Science Fr→En task were concatenated with one million sentences randomly extracted from the general-domain monolingual data. 12https://code.google.com/archive/p/ word2vec/ 493 Data LM1 LM2 LM3 Target side of in-domain parallel data √ √ √ In-domain monolingual data √ √ General-domain monolingual data √ Table 4: Source of our three language models. on the source side than on the target side. We therefore chose 800 and 300 dimensions for the source and target embeddings, respectively. The embeddings were trained on the concatenation of all the general-domain and in-domain monolingual data as presented by Table 2. Consequently, for each pair of domain and translation direction, we have four word embedding spaces: those with 300 or 800 dimensions for source and target languages. The reliability of each phrase pair was estimated as described in Section 3.3 to compile phrase tables of reasonable size and quality. We used Vowpal Wabbit13 to perform logistic regression with one pass, default parameters, and --link logistic option to obtain a classification score for each phrase pair. In the final induced phrase table, we kept the 300 best target phrases14 for each source phrase ac- cording to this score. 4.3 SMT systems The Moses toolkit (Koehn et al., 2007)15 was used for training SMT models, parameter tuning, and de- coding. The phrase tables were trained on the par- allel corpus using SyMGIZA++ (Junczys-Dowmunt and Szał, 2012)16 with IBM-2 word alignment and the grow-diag-final-and heuristics. To ob- tain strong baseline systems, all SMT systems used three language models17 built on different sets of corpora as shown in Table 4; each language model is a 4-gram modified Kneser-Ney smoothed one 13https://github.com/JohnLangford/ vowpal_wabbit/ 14As in Irvine and Callison-Burch (2014), we obtained bet- ter results when favoring recall over precision. We chose 300 empirically since we did not observe any further improvements when keeping more target phrases. 15http://statmt.org/moses/, version 2.1.1 16https://github.com/emjotde/symgiza-pp/ 17The one exception is the system for the Science En→Fr task, which uses only two language models as we do not have any in-domain monolingual data in addition to the target side of the in-domain parallel data. Phrase table Conf. 1 Conf. 2 Conf. 3 Phrase table trained from √ √ √ general-domain parallel data Phrase table trained from √ in-domain parallel data In-domain bilingual lexicon √ Phrase table induced from √ √ √ in-domain monolingual data Table 5: Multiple phrase table configurations. trained using lmplz (Heafield et al., 2013).18 To concentrate on the translation model, we did not use the lexical reordering model throughout the experi- ments, while we enabled distance-based reordering up to six words. Our systems used the multiple decoding paths ability of Moses; we used up to three phrase tables in one system, as summarized in Table 5. We did not add the features presented in Section 3.2 to the phrase pairs directly derived from the parallel data.19 Weights of the features were optimized with kb-mira (Cherry and Foster, 2012) using 200-best hypotheses on 15 iterations. The translation out- puts were evaluated with BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2014). The results were averaged over three tuning runs. The statistical significance was measured by ap- proximate randomization (Clark et al., 2011) using MultEval.20 4.4 Additional baseline systems To compare our work with a state-of-the-art phrase table induction method, we implemented the work of Zhao et al. (2015). Even though they did not pro- pose their method to perform domain adaptation of an SMT system, their work is the closest to ours and does not require other external resources than those we used, i.e., parallel data and monolingual data not necessarily comparable. We implemented both global (GLP) and local (LLP) linear projection strategies and collected source and target phrases as they did. The source phrase set contains all uni- 18https://kheafield.com/code/kenlm/ estimation 19As in Irvine and Callison-Burch (2014), we got a drop of up to 0.5 BLEU points when we added our features, derived from monolingual data, to the original phrase table. 20https://github.com/jhclark/multeval/ 494 grams and bigrams in the development and test sets, while the target phrase set contains unigrams and bigrams collected from the in-domain monolingual data. They did not mention any filtering of their phrase sets, but we chose to remove all phrases con- taining digits or punctuation marks, since trying to retrieve the translation of numbers or punctuation marks relying only on word embeddings seems in- appropriate and in fact produced worse results in our preliminary experiments. To highlight the impact of the phrase sets used, we also experimented LLP us- ing our phrase sets collected with word2phrase. Furthermore, to get the best possible results, we did not use the search approximations presented in Zhao et al. (2015), i.e., local sensitive hashing and redun- dant bit vector, and used instead linear search. For the GLP configuration, the translation ma- trix was trained on gen-lex, i.e., 5,000 phrase pairs extracted from the general-domain phrase ta- ble trained on parallel data. For the LLP configura- tions, as in Zhao et al. (2015), we trained the trans- lation matrix for each source phrase on the 500 most similar source phrases, retrieved from the general- domain phrase table, associated to their most prob- able translation. For both GLP and LLP config- urations, we kept the 300 best target phrases for each source phrase. Four features, phrase and lex- ical translation probabilities for both translation di- rections, were approximated using the similarity be- tween source and target phrase embeddings for each phrase pair and included in the induced phrase table as described by Zhao et al. (2015). Since this approach proposes to translate all OOV unigrams and bigrams, it is likely in our scenario that some medical terms, for instance, will have no correct translations in the induced phrase table. For a comparison, we added one more baseline system, which merely uses a vanilla Moses with the -du op- tion of Moses activated to drop all unknown words instead of copying them into the translation. 4.5 Results The experimental results are given in Table 6. In Conf. 1, our results show that both GLP and LLP configurations performed much worse than the vanilla Moses when using phrases naively collected. This is due to the fact that the induced phrase ta- ble contains translations for every OOV unigrams and bigrams, even for those who do not need to be translated, such as molecule names or place names. Word embeddings are well-known to be inaccurate for very infrequent words (Mikolov et al., 2013b); consequently, for some rare source phrases, even if the right translation is in the target phrase set, it is not guaranteed that it will be registered in the in- duced phrase table as one of the 300 best translations for the source phrase, relying only on word embed- dings. The significant improvements over a vanilla Moses observed by Zhao et al. (2015) would poten- tially be because they translated from Arabic, and Urdu, to English. For such language pairs, one can safely try to translate every OOV token of a general- domain text, and it is unlikely to do worse than a vanilla Moses system that will leave the OOV to- kens as is in the translation. As shown by the Moses du configurations, dropping them led to a drop of up to 4.2 BLEU points for the EMEA Fr→En trans- lation task. This suggests that OOV tokens must be carefully translated only when necessary. Many OOV tokens in our translation tasks do not need to be translated into different forms. Hence, we regard the vanilla Moses that copies the OOV tokens in the translation a strong baseline system. Interestingly, using the phrases collected by our method for LLP produced much better translations, even slightly better than the one produced by the vanilla Moses system for the EMEA En→Fr transla- tion task with an improvement of 0.2 BLEU points. This may be due to the fact that our source phrase set is not only made from OOV phrases, mean- ing that new useful translations may be proposed for source phrases that are already registered in the general-domain phrase table. Moreover, with our phrase sets, the decoder also has the possibility to leave some tokens untranslated since we added each source phrase in the target phrase set if it appeared in the target monolingual data. Instead of relying only on word embeddings, the features used in our approach helped significantly to improve the translation quality. When we added our induced phrase table to a vanilla Moses system, we observed consistent and significant improvements in translation quality, with up to 2.1 BLEU and 2.2 METEOR points of improvement for the Science En→Fr translation task. Compared to the LLP method proposed by Zhao 495 Configuration EMEA Science Fr→En En→Fr Fr→En En→Fr BLEU METEOR BLEU METEOR BLEU METEOR BLEU METEOR vanilla Moses du 24.2 27.4 21.7 40.0 22.3 29.1 20.4 42.7 vanilla Moses (Conf. 1) 28.4 30.1 25.4 44.8 24.1 30.4 22.7 45.1 + GLP IPT naive 24.3 27.0 22.3 41.0 22.4 29.0 20.6 42.6 + LLP IPT naive 24.7 27.4 22.0 40.4 22.5 29.3 21.1 43.4 + LLP IPT 27.9 29.6 25.6 45.1 22.7 29.3 21.3 43.5 + our IPT (gen-lex) 30.2 32.1 27.1 46.6 25.4 32.0 24.8 47.3 + in-domain bilingual lexicon (Conf. 2) 32.4 32.8 28.3 48.2 26.6 32.4 24.9 48.0 + our IPT (gen-lex) 33.5 32.6 28.8 48.6 28.5 33.8 25.2 48.4 + our IPT (in-lex) 33.8 32.9 29.2 48.9 26.9 32.7 25.9 49.0 + in-domain phrase table (Conf. 3) 39.1 36.1 33.8 53.1 32.1 36.1 31.0 53.9 + our IPT (para-lex) 39.1 36.1 34.0 53.2 32.1 36.1 31.2 54.1 Table 6: Results (BLEU and METEOR) with an induced phrase table (IPT). The Moses du and vanilla Moses systems use only one phrase table trained from the general-domain parallel data. The translation matrices and the classifiers have been trained with a bilingual lexicon: gen-lex, in-lex, or para-lex. The configurations denoted as “naive” use a phrase table induced from phrases collected as described in Section 4.4. Bold scores indicate the statistical significance (p < 0.01) of the gain over the baseline system (Conf. X) in each configuration. et al. (2015), our approach includes more features and an additional classification step. Thus, the in- duction of a phrase table is much slower. For in- stance, for the EMEA Fr→En translation task, using the phrase sets extracted with word2phrase, our induction method (excluding phrase collection) was nearly 14 times slower (9 hours vs. 38 minutes).21 Phrase collection using word2phrase was much faster than feature computation and phrase pair clas- sification. For instance, it took 72 minutes to col- lect target phrases for the EMEA Fr→En transla- tion task, using four iterations of word2phrase on the English in-domain monolingual data with 1 CPU thread. In Conf. 2, adding an in-domain bilingual lexi- con as a phrase table to the vanilla baseline sys- tem significantly boosted the performance, mainly by reducing the number of OOV tokens. Our in- duced phrase tables had less impact, probably due to the overlap between useful word pairs contained in both the induced phrase table and the added bilin- gual lexicon. However, we still observed significant improvements, which support the usefulness of the 21The experiments were performed with 20 CPU threads. Note also that computational speed was not our primary fo- cus when implementing our approach. Optimizing our imple- mentation may lead to significant gains in speed, while Zhao et al. (2015) have presented a search approximation able to make their approach 18 times faster than linear search. induced phrase table, with up to 1.4 and 1.0 BLEU points of improvements, respectively, for the EMEA Fr→En and Science En→Fr translation tasks for in- stance. In this configuration, the in-lex phrase table led to slight but consistent improvements. It helped more than the gen-lex phrase table, except in the Science Fr→En task, for which the use of the gen-lex phrase table yielded significantly better results than the use of the in-lex phrase table. We can expect such differences when the classifier and the translation matrices are trained using infrequent words. Embeddings for such words are typically not as well estimated as those for frequent words, mean- ing that the features based on the word embeddings are less reliable and thus mislead both the classifier for pruning and the decoder. In Conf. 3, where the baseline system even used a phrase table trained on in-domain parallel data, we obtained contrasted results, with only slight im- provements for the En→Fr translation direction and no improvements for the Fr→En translation direc- tion. This lack of improvement may be due to the more reliable features and more accurate phrase pairs contained in the phrase table directly learned from the parallel data. This may lead the decoder to prefer this table to the induced one and give higher weights of its features according to this preference during tuning. 496 EMEA Science Fr→En En→Fr Fr→En En→Fr w/o w/ w/o w/ w/o w/ w/o w/ correct 53.1 55.1 52.8 54.2 54.0 57.1 55.3 57.2 SEEN 6.6 3.9 7.8 6.1 5.9 2.3 8.5 2.2 SENSE 18.1 13.3 15.6 11.5 18.9 14.0 12.9 12.9 SCORE 22.2 27.7 23.8 28.2 21.2 26.6 23.3 27.7 Table 7: Percentage of the source tokens: comparison of the translations generated with (w/) or without (w/o) our gen-lex induced phrase table (Conf. 1). 5 Error analysis In Section 5.1, we first present an analysis of the dis- tribution of translation errors that our systems pro- duced, using the S4 taxonomy (Irvine et al., 2013). Then, in Section 5.2, we illustrate some translation examples for which our induced phrase tables have produced a better translation. 5.1 Analysis with the S4 taxonomy The S4 taxonomy comprises the following four error types: • SEEN: attempt to translate a word never seen before • SENSE: attempt to translate a word with the wrong sense • SCORE: a good translation for the word is available but another one, giving a better score to the hypothesis, is chosen by the system • SEARCH: a good translation is available for the word but is pruned during the search for the best hypothesis We considered the SEEN, SENSE, and SCORE er- rors as in Irvine et al. (2013), but not the SEARCH errors, assuming that the recent phrase-based SMT systems rarely make this type of errors and with- out impact on the translation quality (Wisniewski et al., 2010; Aziz et al., 2014). We performed a Word Alignment Driven Evaluation (WADE) (Irvine et al., 2013) to count the word-level errors. Table 7 compares the results with and without our gen-lex induced phrase tables (Conf. 1). For the four tasks, more than half of the source tokens were correctly translated according to the translation ref- erence. Our analysis reveals that our induced phrase table helps to obtain more correct translations, as higher percentages of source words were correctly translated, despite the significant increase of SCORE errors (around 5% for all the tasks). This means that the correct translation for the source word is available, but the features associated to this trans- lation were not informative enough for the decoder to choose it. The percentage of SEEN errors in the translations decreased significantly with the induced phrase table for all the tasks, as a result of many words and phrases unseen in the general-domain parallel data being covered by using the in-domain monolingual data. However, our method does not guarantee to find appropriate translations for these words. It is even possible that all the proposed trans- lations are inappropriate. Nonetheless, we can see a noticeable decrease of the SENSE errors, except in the Science En→Fr task, for which we have used only a small amount of in-domain French mono- lingual data. As reported in Table 3, fewer target phrases were collected for this task, leading to only a small chance of obtaining the right translation for a given source phrase. The percentage of SENSE errors still remains higher than 10% for all tasks, in- dicating that the correct translation is not available in our phrase set or is pruned by the classifier during the phrase table induction. From this analysis, we draw the conclusion that our approach has significantly increased the reach- ability of the translation reference along with the quality of the translation produced by the decoder. We expect that more informative or better estimated features can further improve our results. Improv- ing our method to collect the target phrases or using a larger in-domain monolingual corpus would also help to reduce SENSE errors. 5.2 Translation examples Table 8 presents examples of source phrase and their translations chosen by the decoder in the EMEA Fr→En translation task. As shown by Example #1, both LLP and gen-lex configurations can find a good translation in their induced phrase ta- ble for the phrase “au point d’injection” while the general-domain phrase table does not contain this source phrase. As a result, the vanilla Moses system produced a wrong translation using general-domain word translations. 497 System #1 #2 #3 #4 source au point d’ injection glaucome aigu contient du lactose monohydraté le lansoprazole n’ est pas vanilla Moses at injection acute glaucome monohydraté contains lactose the lansoprazole is not LLP IPT at the point of injection acute contains lactose the , is not our IPT (gen-lex) at the site of injection acute glaucoma contains lactose monohydrate the lansoprazole is not reference at the injection site acute glaucoma contains lactose monohydrate lansoprazole is not Table 8: Examples of source phrase and their translation, from the test set of the EMEA Fr→En translation task, produced by the decoder using different configurations: vanilla Moses (Conf. 1) and Moses using a phrase table induced with LLP or with our method (gen-lex). Example #2 shows a typical error made by the LLP configuration. In this example, “glaucome” is OOV, no translation is proposed for this token in the general-domain phrase table. The LLP IPT con- tains the source phrase “glaucome aigu” but none of the 300 best corresponding target phrases contain the token “glaucoma”. However, most of them con- tain the meaning of “acute”. This can be explained by the much higher frequency of “aigu” while the word “glaucome” is very rare, even in the in-domain monolingual data. Consequently, “aigu” has an em- bedding more accurate than the one of “glaucome” which is then much more difficult to project cor- rectly across languages. In contrast, our gen-lex IPT contains the translation reference for “glaucome aigu” and this translation has been used correctly by the decoder, guided by our feature set. Example #3 is similar to Example #2, the embed- ding of the rare word “monohydraté” is probably not accurate enough to be correctly projected, the cor- rect translation is not in the LLP IPT, while our ap- proach succeeded to translate it correctly. Finally, Example #4 presents another common sit- uation where an OOV token, here “lansoprazole” has to be preserved as is and is correctly reported in the translation by the vanilla Moses system. The LLP IPT proposes translations for “lansoprazole”, most of them semantically unrelated, like the one chosen by the decoder in this configuration. We assume that the surface-level similarity fea- tures of our method helped the decoder to identify the right translation in this situation. Nonetheless, even when using our gen-lex IPT, we still ob- served some situations where tokens that should be preserved were actually wrongly translated, produc- ing outputs worse than those produced by the vanilla Moses system. 6 Conclusion and future work We presented a framework to induce a phrase ta- ble from unaligned monolingual data of specific do- mains. We showed that such a phrase table, when integrated to the decoder, consistently and signifi- cantly improved the translation quality for texts in the targeted domain. Our approach uses only sim- ple features without requiring strongly comparable or annotated texts in the targeted domain. Our method could further be improved in several ways. First, we expect better improvements by using more in-domain monolingual data or by being more careful in collecting the target phrases to use for the phrase table induction as opposed to simply pruning them according to the word frequency. Moreover, as we saw in Section 5, scoring the phrase pairs is one of the most important issues. We need more in- formative features to better score the pairs of source and target phrases. Despite their high computational cost, including features based on orthographic simi- larity or using better estimated cross-lingual embed- dings may help for this purpose. Acknowledgments We would like to thank the anonymous reviewers and the action editor, Chris Quirk, for their insight- ful comments. References Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain Adaptation via Pseudo In-Domain Data Se- lection. In Proceedings of EMNLP, Edinburgh, Scot- land, UK. Wilker Aziz, Marc Dymetman, and Lucia Specia. 2014. Exact Decoding for Phrase-Based Statistical Machine Translation. In Proceedings of EMNLP, Doha, Qatar. 498 Marine Carpuat, Hal Daumé III, Alexander Fraser, Chris Quirk, Fabienne Braune, Ann Clifton, et al. 2012. Do- main adaptation in machine translation: Final report. In 2012 Johns Hopkins summer workshop final report. Baltimore, MD: Johns Hopkins University. A. P. Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh M Khapra, Balaraman Ravindran, Vikas Raykar, and Amrita Saha. 2014. An Autoencoder Ap- proach to Learning Bilingual Word Representations. In Proceedings of NIPS, Montréal, Canada. Colin Cherry and George Foster. 2012. Batch Tuning Strategies for Statistical Machine Translation. In Pro- ceedings of NAACL-HLT, Montréal, Canada. Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. 2011. Better Hypothesis Testing for Statis- tical Machine Translation: Controlling for Optimizer Instability. In Proceedings of ACL-HLT, Portland, OR, USA. Jocelyn Coulmance, Jean-Marc Marty, Guillaume Wen- zek, and Amine Benhalloum. 2015. Trans-gram, Fast Cross-lingual Word-embeddings. In Proceedings of EMNLP, Lisbon, Portugal. Hal Daumé, III and Jagadeesh Jagarlamudi. 2011. Do- main Adaptation for Machine Translation by Mining Unseen Words. In Proceedings of ACL-HLT, Portland, OR, USA. Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of EACL, Gothenburg, Sweden. Qing Dou and Kevin Knight. 2012. Large Scale Deci- pherment for Out-of-domain Machine Translation. In Proceedings of EMNLP-CoNLL, Jeju Island, Korea. Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2016. Learning Crosslingual Word Embeddings without Bilingual Corpora. In Pro- ceedings of EMNLP, Austin, TX, USA. Manaal Faruqui and Chris Dyer. 2014. Improving Vector Space Word Representations Using Multilingual Cor- relation. In Proceedings of EACL, Gothenburg, Swe- den. Pascale Fung and Percy Cheung. 2004. Mining Very- Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM. In Proceedings of EMNLP, Barcelona, Spain. Pascale Fung. 1995. Compiling Bilingual Lexicon En- tries From a Non-Parallel English-Chinese Corpus. In Proceedings of the 3rd Workshop on Very Large Cor- pora, Cambridge, MA, USA. Stephan Gouws, Yoshua Bengio, and Greg Corrado. 2015. BilBOWA: Fast Bilingual Distributed Repre- sentations without Word Alignments. In Proceedings of ICML, Lille, France. Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning Bilingual Lexicons from Monolingual Corpora. In Proceedings of ACL- HLT, Colombus, OH, USA. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable Modified Kneser- Ney Language Model Estimation. In Proceedings of ACL, Sofia, Bulgaria. Sanjika Hewavitharana and Stephan Vogel. 2016. Ex- tracting parallel phrases from comparable data for ma- chine translation. Natural Language Engineering, 22(4):549–573. Ann Irvine and Chris Callison-Burch. 2013. Supervised Bilingual Lexicon Induction with Multiple Monolin- gual Signals. In Proceedings of HLT-NAACL, Atlanta, GA, USA. Ann Irvine and Chris Callison-Burch. 2014. Halluci- nating Phrase Translations for Low Resource MT. In Proceedings of CoNLL, Baltimore, MD, USA. Ann Irvine and Chris Callison-Burch. 2016. End- to-End Statistical Machine Translation with Zero or Small Parallel Texts. Natural Language Engineering, 22(4):517–548. Ann Irvine, John Morgan, Marine Carpuat, Hal Daumé III, and Dragos Munteanu. 2013. Measuring Machine Translation Errors in New Domains. Trans- actions of the Association for Computational Linguis- tics, 1. Marcin Junczys-Dowmunt and Arkadiusz Szał. 2012. SyMGiza++: Symmetrized Word Alignment Mod- els for Machine Translation. In Security and Intel- ligent Information Systems (SIIS), volume 7053 of Lecture Notes in Computer Science. Springer-Verlag, Berlin/Heidelberg, Germany. Alex Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky. 2012. Toward Statistical Machine Translation without Parallel Corpora. In Proceedings of EACL, Avignon, France. Philipp Koehn and Kevin Knight. 2002. Learning a Translation Lexicon from Monolingual Corpora. In Proceedings of the ACL Workshop on Unsupervised Lexical Acquisition, Philadelphia, PA, USA. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Con- stantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Pro- ceedings of ACL, Prague, Czech Republic. Angeliki Lazaridou, Georgiana Dinu, Adam Liska, and Marco Baroni. 2015. From Visual Attributes to Ad- jectives through Decompositional Distributional Se- mantics. Transactions of the Association for Compu- tational Linguistics, 3. 499 Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013a. Exploiting Similarities among Languages for Machine Translation. CoRR, abs/1309.4168. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed Representa- tions of Words and Phrases and their Compositionality. In Proceedings of NIPS, Lake Tahoe, NV, USA. Jeff Mitchell and Mirella Lapata. 2010. Composition in Distributional Models of Semantics. Cognitive Sci- ence, 34(8). Dragos Stefan Munteanu and Daniel Marcu. 2005. Im- proving Machine Translation Performance by Exploit- ing Non-Parallel Corpora. Computational Linguistics, 31(4):477–504. Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchi- moto, Masao Utiyama, Eiichiro Sumita, Sadao Kuro- hashi, and Hitoshi Isahara. 2016. ASPEC: Asian Scientific Paper Excerpt Corpus. In Proceedings of LREC, Portorož, Slovenia. Malte Nuhn, Arne Mauser, and Hermann Ney. 2012. De- ciphering Foreign Language by Combining Language Models and Context Vectors. In Proceedings of ACL, Jeju Island, Korea. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of ACL, Philadelphia, PA, USA. Reinhard Rapp. 1995. Identifying Word Translations in Non-parallel Texts. In Proceedings of ACL, Cam- bridge, MA, USA. Sujith Ravi and Kevin Knight. 2011. Deciphering For- eign Language. In Proceedings of ACL-HLT, Portland, OR, USA. Avneesh Saluja, Hany Hassan, Kristina Toutanova, and Chris Quirk. 2014. Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data. In Proceedings of ACL, Baltimore, MD, USA. Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013a. Parsing with Compo- sitional Vector Grammars. In Proceedings of ACL, Sofia, Bulgaria. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013b. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Pro- ceedings of EMNLP, Seattle, WA, USA. Masao Utiyama and Hitoshi Isahara. 2003. Reliable Measures for Aligning Japanese-English News Arti- cles and Sentences. In Proceedings of ACL, Sapporo, Japan. Ivan Vulić and Anna Korhonen. 2016. On the Role of Seed Lexicons in Learning Bilingual Word Embed- dings. In Proceedings of ACL, Berlin, Germany. Guillaume Wisniewski, Alexandre Allauzen, and François Yvon. 2010. Assessing Phrase-Based Trans- lation Models with Oracle Decoding. In Proceedings of EMNLP, Cambridge, MA, USA. Jiajun Zhang and Chengqing Zong. 2013. Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation. In Pro- ceedings of ACL, Sofia, Bulgaria. Bing Zhao and Stephan Vogel. 2002. Adaptive parallel sentences mining from web bilingual news collection. In Proceedings of IEEE ICDM, Maebashi, Japan. Kai Zhao, Hany Hassan, and Michael Auli. 2015. Learn- ing Translation Models from Monolingual Continu- ous Representations. In Proceedings of NAACL-HLT, Denver, CO, USA. 500