Research Collection Journal Article Fully character-level neural machine translation without explicit segmentation Author(s): Lee, Jason; Cho, Kyunghyun; Hofmann, Thomas Publication Date: 2017-10 Permanent Link: https://doi.org/10.3929/ethz-b-000236131 Rights / License: Creative Commons Attribution 4.0 International This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library https://doi.org/10.3929/ethz-b-000236131 http://creativecommons.org/licenses/by/4.0/ https://www.research-collection.ethz.ch https://www.research-collection.ethz.ch/terms-of-use Fully Character-Level Neural Machine Translation without Explicit Segmentation Jason Lee∗ ETH Zürich jasonlee@inf.ethz.ch Kyunghyun Cho New York University kyunghyun.cho@nyu.edu Thomas Hofmann ETH Zürich thomas.hofmann@inf.ethz.ch Abstract Most existing machine translation systems op- erate at the level of words, relying on ex- plicit segmentation to extract tokens. We in- troduce a neural machine translation (NMT) model that maps a source character sequence to a target character sequence without any seg- mentation. We employ a character-level con- volutional network with max-pooling at the encoder to reduce the length of source rep- resentation, allowing the model to be trained at a speed comparable to subword-level mod- els while capturing local regularities. Our character-to-character model outperforms a recently proposed baseline with a subword- level encoder on WMT’15 DE-EN and CS- EN, and gives comparable performance on FI- EN and RU-EN. We then demonstrate that it is possible to share a single character- level encoder across multiple languages by training a model on a many-to-one transla- tion task. In this multilingual setting, the character-level encoder significantly outper- forms the subword-level encoder on all the language pairs. We observe that on CS-EN, FI-EN and RU-EN, the quality of the multilin- gual character-level translation even surpasses the models specifically trained on that lan- guage pair alone, both in terms of the BLEU score and human judgment. 1 Introduction Nearly all previous work in machine translation has been at the level of words. Aside from our intu- ∗The majority of this work was completed while the author was visiting New York University. itive understanding of word as a basic unit of mean- ing (Jackendoff, 1992), one reason behind this is that sequences are significantly longer when rep- resented in characters, compounding the problem of data sparsity and modeling long-range depen- dencies. This has driven NMT research to be al- most exclusively word-level (Bahdanau et al., 2015; Sutskever et al., 2014). Despite their remarkable success, word-level NMT models suffer from several major weaknesses. For one, they are unable to model rare, out-of- vocabulary words, making them limited in translat- ing languages with rich morphology such as Czech, Finnish and Turkish. If one uses a large vocabulary to combat this (Jean et al., 2015), the complexity of training and decoding grows linearly with respect to the target vocabulary size, leading to a vicious cycle. To address this, we present a fully character-level NMT model that maps a character sequence in a source language to a character sequence in a target language. We show that our model outperforms a baseline with a subword-level encoder on DE-EN and CS-EN, and achieves a comparable result on FI-EN and RU-EN. A purely character-level NMT model with a basic encoder was proposed as a base- line by Luong and Manning (2016), but training it was prohibitively slow. We were able to train our model at a reasonable speed by drastically reducing the length of source sentence representation using a stack of convolutional, pooling and highway layers. One advantage of character-level models is that they are better suited for multilingual translation than their word-level counterparts which require a separate word vocabulary for each language. We 365 Transactions of the Association for Computational Linguistics, vol. 5, pp. 365–378, 2017. Action Editor: Adam Lopez. Submission batch: 11/2016; Revision batch: 2/2017; Published 10/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. verify this by training a single model to translate four languages (German, Czech, Finnish and Rus- sian) to English. Our multilingual character-level model outperforms the subword-level baseline by a considerable margin in all four language pairs, strongly indicating that a character-level model is more flexible in assigning its capacity to different language pairs. Furthermore, we observe that our multilingual character-level translation even exceeds the quality of bilingual translation in three out of four language pairs, both in BLEU score metric and human evaluation. This demonstrates excel- lent parameter efficiency of character-level transla- tion in a multilingual setting. We also showcase our model’s ability to handle intra-sentence code- switching while performing language identification on the fly. The contributions of this work are twofold: we empirically show that (1) we can train character-to- character NMT model without any explicit segmen- tation; and (2) we can share a single character-level encoder across multiple languages to build a mul- tilingual translation system without increasing the model size. 2 Background: Attentional Neural Machine Translation Neural machine translation (NMT) is a recently proposed approach to machine translation that builds a single neural network which takes as an input, a source sentence X = (x1, . . . ,xTX ) and generates its translation Y = (y1, . . . ,yTY ), where xt and yt′ are source and target symbols (Bahdanau et al., 2015; Sutskever et al., 2014; Luong et al., 2015; Cho et al., 2014a). Attentional NMT models have three components: an encoder, a decoder and an attention mechanism. Encoder Given a source sentence X, the en- coder constructs a continuous representation that summarizes its meaning with a recurrent neural network (RNN). A bidirectional RNN is often implemented as proposed in (Bahdanau et al., 2015). A forward encoder reads the input sentence from left to right: −→ h t = −→ fenc ( Ex(xt), −→ h t−1 ) . Similarly, a backward encoder reads it from right to left: ←− h t = ←− fenc ( Ex(xt), ←− h t+1 ) , where Ex is the source embedding lookup table, and −→ fenc and←− fenc are recurrent activation functions such as long short-term memory units (LSTMs) (Hochreiter and Schmidhuber, 1997) or gated recurrent units (GRUs) (Cho et al., 2014b). The encoder constructs a set of continuous source sentence representations C by concatenating the forward and backward hid- den states at each timestep: C = { h1, . . . ,hTX } , where ht = [−→ h t; ←− h t ] . Attention First introduced in Bahdanau et al. (2015), the attention mechanism lets the decoder at- tend more to different source symbols for each target symbol. More concretely, it computes the context vector ct′ at each decoding time step t′ as a weighted sum of the source hidden states: ct′ = ∑TX t=1 αt′tht. Similarly to Chung et al. (2016) and Firat et al. (2016a), each attentional weight αt′t represents how relevant the t-th source token xt is to the t′-th target token yt′, and is computed as: αt′t = 1 Z exp ( score ( Ey(yt′−1),st′−1,ht )) , (1) where Z = ∑TX k=1 exp ( score(Ey(yt′−1),st′−1,hk) ) is the normalization constant. score() is a feed- forward neural network with a single hidden layer that scores how well the source symbol xt and the target symbol yt′ match. Ey is the target embedding lookup table and st′ is the target hidden state at time t′. Decoder Given a source context vector ct′, the de- coder computes its hidden state at time t′ as: st′ = fdec ( Ey(yt′−1),st′−1,ct′ ) . Then, a parametric func- tion outk() returns the conditional probability of the next target symbol being k: p(yt′ =k|y