Modeling Past and Future for Neural Machine Translation Zaixiang Zheng∗ Nanjing University zhengzx@nlp.nju.edu.cn Hao Zhou∗ Toutiao AI Lab zhouhao.nlp@bytedance.com Shujian Huang Nanjing University huangsj@nlp.nju.edu.cn Lili Mou University of Waterloo doublepower.mou@gmail.com Xinyu Dai Nanjing University dxy@nlp.nju.edu.cn Jiajun Chen Nanjing University chenjj@nlp.nju.edu.cn Zhaopeng Tu Tencent AI Lab zptu@tencent.com Abstract Existing neural machine translation systems do not explicitly model what has been trans- lated and what has not during the decoding phase. To address this problem, we propose a novel mechanism that separates the source information into two parts: translated PAST contents and untranslated FUTURE contents, which are modeled by two additional recur- rent layers. The PAST and FUTURE contents are fed to both the attention model and the de- coder states, which provides Neural Machine Translation (NMT) systems with the knowl- edge of translated and untranslated contents. Experimental results show that the proposed approach significantly improves the perfor- mance in Chinese-English, German-English, and English-German translation tasks. Specif- ically, the proposed model outperforms the conventional coverage model in terms of both the translation quality and the alignment error rate.† 1 Introduction Neural machine translation (NMT) generally adopts an encoder-decoder framework (Kalchbrenner and Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014), where the encoder summarizes the source sentence into a source context vector, and the de- coder generates the target sentence word-by-word based on the given source. During translation, the decoder implicitly serves several functionalities at the same time: *Equal contributions. †Our code can be downloaded from https://github. com/zhengzx-nlp/past-and-future-nmt. 1. Building a language model over the target sen- tence for translation fluency (LM). 2. Acquiring the most relevant source-side in- formation to generate the current target word (PRESENT). 3. Maintaining what parts in the source have been translated (PAST) and what parts have not (FUTURE). However, it may be difficult for a single recur- rent neural network (RNN) decoder to accomplish these functionalities simultaneously. A recent suc- cessful extension of NMT models is the attention mechanism (Bahdanau et al., 2015; Luong et al., 2015), which makes a soft selection over source words and yields an attentive vector to represent the most relevant source parts for the current decoding state. In this sense, the attention mechanism sepa- rates the PRESENT functionality from the decoder RNN, achieving significant performance improve- ment. In addition to PRESENT, we address the impor- tance of modeling PAST and FUTURE contents in machine translation. The PAST contents indicate translated information, whereas the FUTURE con- tents indicate untranslated information, both being crucial to NMT models, especially to avoid under- translation and over-translation (Tu et al., 2016). Ideally, PAST grows and FUTURE declines during the translation process. However, it may be difficult for a single RNN to explicitly model the above pro- cesses. In this paper, we propose a novel neural machine translation system that explicitly models PAST and FUTURE contents with two additional RNN layers. The RNN modeling the PAST contents (called PAST layer) starts from scratch and accumulates the in- 145 Transactions of the Association for Computational Linguistics, vol. 6, pp. 145–157, 2018. Action Editor: Philipp Koehn. Submission batch: 6/2017; Revision batch: 9/2017; Published 3/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. formation that is being translated at each decoding step (i.e., the PRESENT information yielded by at- tention). The RNN modeling the FUTURE contents (called FUTURE layer) begins with holistic source summarization, and subtracts the PRESENT infor- mation at each step. The two processes are guided by proposed auxiliary objectives. Intuitively, the RNN state of the PAST layer corresponds to source contents that have been translated at a particular step, and the RNN state of the FUTURE layer cor- responds to source contents of untranslated words. At each decoding step, PAST and FUTURE together provide a full summarization of the source informa- tion. We then feed the PAST and FUTURE informa- tion to both the attention model and decoder states. In this way, our proposed mechanism not only pro- vides coverage information for the attention model, but also gives a holistic view of the source informa- tion at each time. We conducted experiments on Chinese-English, German-English, and English-German benchmarks. Experiments show that the proposed mechanism yields 2.7, 1.7, and 1.1 improvements of BLEU scores in three tasks, respectively. In addition, it ob- tains an alignment error rate of 35.90%, significantly lower than the baseline (39.73%) and the coverage model (38.73%) by Tu et al. (2016). We observe that in traditional attention-based NMT, most errors occur due to over- and under-translation, which is probably because the decoder RNN fails to keep track of what has been translated and what has not. Our model can alleviate such problems by explicitly modeling PAST and FUTURE contents. 2 Motivation In this section, we first introduce the standard attention-based NMT, and then motivate our model by several empirical findings. The attention mechanism, proposed in Bahdanau et al. (2015), yields a dynamic source context vec- tor for the translation at a particular decoding step, modeling PRESENT information as described in Sec- tion 1. This process is illustrated in Figure 1. Formally, let x = {x1, . . . ,xI} be a given in- put sentence. The encoder RNN—generally imple- mented as a bi-directional RNN (Schuster and Pali- wal, 1997)—transforms the sentence to a sequence ct hi hi hI hI h1 h1 xix1 Encoder + αt,1 αt, i αt, I ss yt Decoder In iti al iz e w ith so ur ce s um m ar iz at io n source vector for present translation xI Figure 1: Architecture of attention-based NMT. of annotations with hi = [−→ h i; ←− h i ] being the an- notation of xi. ( −→ h i and ←− h i refer to RNN’s hidden states in both directions.) Based on the source annotations, another decoder RNN generates the translation by predicting a target word yt at each time step t: P(yt|y