Leveraging Orthographic Similarity for Multilingual Neural Transliteration Anoop Kunchukuttan1, Mitesh Khapra2, Gurneet Singh2, Pushpak Bhattacharyya1 {anoopk,pb}@cse.iitb.ac.in, {miteshk,garry}@cse.iitm.ac.in 1Department of Computer Science & Engineering Indian Institute of Technology Bombay, Mumbai, India. 2Department of Computer Science & Engineering Indian Institute of Technology Madras, Chennai, India. Abstract We address the task of joint training of translit- eration models for multiple language pairs (multilingual transliteration). This is an in- stance of multitask learning, where individ- ual tasks (language pairs) benefit from shar- ing knowledge with related tasks. We fo- cus on transliteration involving related tasks i.e., languages sharing writing systems and phonetic properties (orthographically similar languages). We propose a modified neural encoder-decoder model that maximizes pa- rameter sharing across language pairs in order to effectively leverage orthographic similar- ity. We show that multilingual transliteration significantly outperforms bilingual translitera- tion in different scenarios (average increase of 58% across a variety of languages we experi- mented with). We also show that multilingual transliteration models can generalize well to languages/language pairs not encountered dur- ing training and hence perform well on the ze- roshot transliteration task. We show that fur- ther improvements can be achieved by using phonetic feature input. 1 Introduction Transliteration is a key building block for multi- lingual and cross-lingual NLP since it is essential for (i) handling of names in applications like ma- chine translation (MT) and cross-lingual information retrieval (CLIR), and (ii) user-friendly input meth- ods. The transliteration problem has been exten- sively studied in literature for a variety of language pairs (Karimi et al., 2011). Previous work has looked at the most natural setup - training on a single lan- guage pair. However, no prior work exists on jointly training multiple language pairs (referred to as mul- tilingual transliteration henceforth). Multilingual transliteration can be seen as an in- stance of multi-task learning, where training each language pair constitutes a task. Multi-task learning works best when the tasks are related to each other, so sharing of knowledge across tasks is beneficial. Thus, multilingual transliteration can be beneficial, if the languages involved are related. We identify such a natural and practically useful scenario: mul- tilingual transliteration involving languages that are related on account of sharing writing systems and phonetic properties. We refer to such languages as orthographically similar languages. We say that two languages are orthographically similar if they have: (i) highly overlapping phoneme sets, (ii) mutually compatible orthographic systems, and (iii) similar grapheme to phoneme mappings. For instance, Indo-Aryan languages largely share the same set of phonemes. They use different In- dic scripts, but correspondences can be established between equivalent characters across scripts. For example, the Hindi (Devanagari script) character क (ka) maps to the Bengali ক (ka) which stands for the consonant sound (IPA: k). The grapheme to phoneme mapping is also consistent for equiva- lent characters. We can identify two major sources of orthographic similarity: (a) genetic relationship between languages (groups like Romance, Slavic, Indo-Aryan and Turkic languages) (b) prolonged contact between languages over a long period of time, e.g. convergence in phonological properties of the Indo-Aryan and Dravidian languages in the In- dian subcontinent, most strikingly retroflex conso- nants (Subbārāo, 2012). Dravidian and Indo-Aryan languages use compatible Indic scripts. Another 303 Transactions of the Association for Computational Linguistics, vol. 6, pp. 303–316, 2018. Action Editor: Mona Diab. Submission batch: 8/2017; Revision batch: 12/2017; Published 5/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. example is the Nigerian linguistic area comprising Niger-Congo languages like Yoruba, Fula, Igbo and Afro-Asiatic languages like Hausa (the most widely spoken language in Nigeria). Most languages use the Latin script (some use Ajani, a modified Arabic script). In this work, we explore multilingual translit- eration involving orthographically similar lan- guages. To the best of our knowledge, ours is the first work to address the task of multilingual transliteration. We propose that transliteration in- volving orthographically similar languages is a sce- nario where multilingual training can be very ben- eficial. Since these languages share phonological properties, the transliteration tasks are clearly re- lated. We can utilize this relatedness by sharing the vocabulary across all related languages. The grapheme-to-grapheme correspondences enable vo- cabulary sharing. It helps transfer knowledge across languages while training. For instance, if the net- work learns that the English character l maps to the Hindi character ल (la), it would also learn that l maps to the corresponding Kannada character ಲ (la). Data from both Kannada and Hindi datasets will reinforce the evidence for this mapping. A sim- ilar argument can be made when both the source and target languages are related. The grapheme- grapheme correspondences arise from the underly- ing phoneme-phoneme correspondences. The con- sistent grapheme-phoneme mappings help establish the grapheme-grapheme correspondences. Due to the utilization of language relatedness, the benefits that are typically ascribed to multi-task learning (Caruana, 1997) may also apply to mul- tilingual transliteration. Since related languages share characters, it is possible to share representa- tions across languages. This may help to generalize transliteration models since joint training provides an inductive bias which prefers models that are better at transliterating multiple language pairs. The train- ing may also benefit from implicit data augmenta- tion since training data from multiple language pairs is available. From the perspective of a single lan- guage pair, data from other language pairs can be seen as additional (noisy) training data. This is par- ticularly beneficial in low-resource scenarios. Our work adds to the increasing body of work investigating multilingual training for various NLP tasks like POS tagging (Gillick et al., 2016), NER (Yang et al., 2016; Rudramurthy et al., 2016) and machine translation (Dong et al., 2015; Firat et al., 2016; Lee et al., 2017; Zoph et al., 2016; Johnson et al., 2017) with a view to learn models that gen- eralize across languages and make effective use of scarce training data. The following are the contributions of our work: (1) We propose a compact neural encoder-decoder model for multilingual transliteration, that is de- signed to ensure maximum sharing of parameters across languages while providing room for learning language-specific parameters. This allows greater sharing of knowledge across language pairs by lever- aging orthographic similarity. We empirically show that models with maximal parameter sharing are ben- eficial, without increasing the model size. (2) We show that multilingual transliteration ex- hibits significant improvement in transliteration ac- curacy over bilingual transliteration in different sce- narios (average improvement of 58%). Our results are backed by extensive experiments on 8 languages across 2 orthographically similar language groups. (3) We perform an error analysis which suggests that representations learnt by the encoder in multilin- gual transliteration can reduce transliteration ambi- guities. Multilingual transliteration also seems bet- ter at learning canonical transliterations instead of alternative, phonetically equivalent transliterations. These could explain the improved performance of multilingual transliteration. (4) We explore the zeroshot transliteration task (i.e., transliteration between languages/language pairs not seen during training) and show that our multi- lingual model can generalize well to unseen lan- guages/language pairs. Notably, the zeroshot transliteration results mostly outperform the direct bilingual transliteration model. (5) We have richer phonetic information at our dis- posal for some related languages. We propose a novel method to incorporate phonetic input in the model and show that it provides modest gains for multilingual transliteration. The rest of the paper is organized as follows. Sec- tion 2 discusses related work. Section 3 formalizes the multilingual transliteration task and describes our proposed solution. Sections 4 and 5 discuss the ex- perimental setup, results and analysis. Section 6 dis- 304 cusses various zeroshot transliteration scenarios, our solutions and the results of experiments. Section 7 discusses incorporation of phonetic information for multilingual transliteration. Section 8 concludes the work and discusses future directions. 2 Related Work General Transliteration Methods Previous work on transliteration has focused on the scenario of bilingual training. Until recently, the best- performing solutions were discriminative statistical transliteration methods based on phrase-based statis- tical machine translation (Bisani and Ney, 2008; Ji- ampojamarn et al., 2008; Jiampojamarn et al., 2009; Finch and Sumita, 2010). Recent work has explored bilingual neural transliteration using the standard neural encoder-decoder architecture (with attention mechanism) (Bahdanau et al., 2015) or its adap- tions (Finch et al., 2015; Finch et al., 2016). Using target bidirectional LSTM with model ensembling, Finch et al. (2016) have outperformed the state-of- the-art phrase-based systems on the NEWS shared task datasets. On the other hand, we focus on mul- tilingual transliteration with the encoder-decoder ar- chitecture or its adaptations. The two strands of work are obviously complimentary. Multilinguality and Transliteration To the best of our knowledge, ours is the first work on multi- lingual transliteration. Jagarlamudi and Daumé III (2012) have proposed a method for transliteration mining (given a name and candidate transliterations, identify the correct transliteration) across multiple languages using grapheme to IPA mappings. Note that their model cannot generate transliterations; it can only rank candidates. Some literature mentions multilingual transliteration (Surana and Singh, 2008; He et al., 2017; Prakash, 2012; Pouliquen et al., 2005) or multilingual transliteration mining (Kle- mentiev and Roth, 2006; Yoon et al., 2007). In these cases, however, multilingual refer to methods which work with multiple languages (as opposed to joint training - the sense of the word multilingual as we use it). Multilingual Translation Our work on multilin- gual transliteration is motivated by recently pro- posed multilingual neural machine translation archi- tectures (Firat et al., 2016). Broadly, these proposals can be categorized into two groups. One group con- sists of architectures that specialize parts of the net- work for particular languages: specialized encoders (Zoph et al., 2016), decoders (Dong et al., 2015) or both (Firat et al., 2016). The other group tries to learn more compact networks with little special- ization across languages by using a joint vocabu- lary (Johnson et al., 2017; Lee et al., 2017). For multilingual transliteration, we adopt an approach that is closer to the latter group since the languages under consideration use compatible scripts result- ing in a shared vocabulary. We specialize just the output layer for target languages, but share the en- coder, decoder and character embeddings across lan- guages. In this respect, we differ from Johnson et al. (2017). They share all network components across languages, but add an artificial token at the begin- ning of the input sequence to indicate the target lan- guage. Zeroshot Transliteration We use the multilin- gual models to address zeroshot transliteration. Ze- roshot transliteration using bridge/pivot language has been explored for statistical machine transliter- ation (Khapra et al., 2010) as well as neural ma- chine transliteration (Saha et al., 2016). Unlike pre- vious approaches which pivot over bilingual translit- eration models, we propose zeroshot transliteration that pivots over multilingual transliteration mod- els. We also propose a direct zeroshot transliter- ation method, a scenario which has been explored for machine translation by Johnson et al. (2017), but not investigated previously for transliteration. In our zeroshot model, sequences from multiple source languages are mapped to a common encoder rep- resentation without the need for a parallel corpus between the source languages. Another work, the correlational encoder-decoder architecture (Saha et al., 2016), maps source and pivot languages to a common space but requires a source-pivot parallel transliteration corpus. 3 Multilingual Transliteration Learning We first formalize the multilingual transliteration task and then describe our proposed solution. 305 Shared LSTM Decoder Shared CNN Encoder Shared Character Embedding Layer L1 Output Layer Shared Attention Network T E N D U L K A R त ◌े ◌ं द ◌ु ल क र L2 Output Layer ത െ◌ ൻ ഡ ◌ു ൽ ക ◌് ക ർ context vector previous state & output annotation vectors (a) Network Architecture E R N A K U L A M Convolution (with same padding) + ReLU stride=1 Character embeddings Output of convolution Max Pooling (with same padding) stride=1 Annotation vectors Filter width=4 Pool width=4 Filter width=3 (b) CNN Encoder Figure 1: Multilingual Neural Transliteration Architecture 3.1 Task Definition The multilingual transliteration task involves learn- ing transliteration models for l language pairs (si, ti) ∈ L (i = 1 to l), where L ⊂ S × T , and S, T are sets of source and target languages respectively. The languages in each set are orthographically simi- lar. S and T need not be mutually exclusive. We are provided with parallel transliteration cor- pora for these l language pairs (Di, ∀i = 1 to l). The goal is to learn a joint transliteration model for all language pairs which minimizes an appropriate loss function over all the transliteration corpora. M ∗ = argmin M L(M, D) (1) where M is the candidate joint transliteration model and D=(D1, D2, ..., Dl) is training data for all lan- guage pairs, L is the training loss function given the model and the training data. We focus on 3 practical training scenarios: Similar source languages: We have multiple ortho- graphically similar source languages and a single tar- get language which is not similar to the source lan- guages. This is an instance of many-to-one learning, e.g., Indic languages to English. Similar target languages: We have multiple or- thographically similar target languages and a single source language which is not similar to the target lan- guages. This is an instance of one-to-many learning, e.g., English to Indic languages. All similar languages: We have multiple source languages as well as target languages, which are all orthographically similar. This is an instance of many-to-many learning, e.g., Indic-Indic languages. 3.2 Proposed Solution We propose a neural encoder-decoder model for multilingual transliteration. For each source-target language pair (s, t), the network models Ps,t = p(ytj |ytj−1...yt1, xs), where xs is the input character sequence and ytj is j th element of the output charac- ter sequence yt. Note that we design a single network to represent all the Ps,t distributions corresponding to the set of language pairs L. Our network is an adaptation of the standard encoder-decoder model with attention (Bahdanau et al., 2015). We describe only the salient aspects of our network and refer the reader to Bahdanau et al. (2015) for the basic encoder-decoder architecture. Figure 1a shows the network architecture of our multilingual translitera- tion system. Encoder & Decoder: We used a CNN encoder to encode the character sequence. It consists of a single convolutional layer (stride size = 1 and SAME padding), followed by ReLU units and max pooling. We use filters of different sizes and con- catenate their output to produce the encoder out- put. Figure 1b shows a schematic of the encoder. We chose CNN over the conventional bidirectional 306 LSTM layer since the temporal dependencies for transliteration are mostly local, which can be han- dled by the CNN encoder. We observed that train- ing and decoding are significantly faster, with little impact on accuracy. The decoder contains a layer of LSTM cells and their start state is the average of the encoder’s output vectors (Sennrich et al., 2017). Parameter Sharing: The vocabulary of the ortho- graphically similar languages (at input and/or out- put) is comprised of the union of character sets of all these languages. Since the character set of these lan- guages overlaps to a large extent, we share their char- acter embeddings too. The encoder is shared across all source languages and the decoder is shared across all target languages. The network uses a shared attention mechanism. The attention network is comprised of a single feed- forward layer, which predicts the attention score given the previous decoder output, previous decoder state and encoder annotation vector. The output layer (a fully connected feedforward layer) transforms the decoder LSTM layer’s output to the size of the output vocabulary, and a softmax function is applied to convert the output scores to probabilities. Each target language has its own set of output layer parameters. Barring the output layer, all network parame- ters (input embeddings, output embeddings, at- tention layer, encoder and decoder) are shared across all similar languages. This allows maxi- mum transfer of information for multilingual learn- ing, while the output layer alone specializes for the specific target language. Compared to using a target language tag in the input sequence (Johnson et al., 2017), we believe our approach allows the language- specific parameters to directly influence the output characters. Training Objective and Model Selection: We minimize the average negative likelihood of paral- lel training corpora across all language pairs. We determined the hyperparameters which gave best re- sults on a validation set. After training the model for a fixed number of iterations (sufficient for conver- gence), we select the model with the maximum ac- curacy on the validation set for each language pair. For instance, if the model corresponding to the 32nd epoch reported maximum accuracy on the validation set for English-Hindi, this model was used for report- ing test set results for English-Hindi. We observed that this criterion performed better than choosing the model with least validation set loss over all language pairs. 4 Experimental Setup We describe our experimental setup. Network Details: The CNN encoder has 4 filters (widths 1 to 4) of 128 hidden units each in the con- volutional layer (encoder output size=512). We use a stride size of 1 and the SAME padding for the convolutional and max-pooling layers. The decoder is a single layer of 512 LSTM units. We used the same configuration for both bilingual and multilin- gual experiments across all datasets for convenience after exploration on some language pairs. We apply dropout (Srivastava et al., 2014) (probability=0.5) at the output of the encoder and decoder, and SGD with the ADAM optimizer (Kingma and Ba, 2014) (learn- ing rate=0.001). We trained our models for a max- imum of 40 epochs (which we found sufficient for our models to converge) and a batch size of 32. In each training epoch, we cycle through the parallel corpora of each language pair. The parallel corpora are roughly of the same size. Better training sched- ules could be explored in future. Languages: We experimented with two sets of or- thographically similar languages: Indian languages: (i) Hindi (hi), Bengali (bn) from the Indo-Aryan branch of Indo-European fam- ily (ii) Kannada (kn), Tamil (ta) from the Dravid- ian family. We studied Indic-Indic transliteration and transliteration involving a non-Indian language (English↔Indic). We mapped equivalent characters in different Indic scripts in order to build a common vocabulary based on the common offsets of the Uni- code codepoints (Kunchukuttan et al., 2015). Slavic languages: Czech (cs), Polish (pl), Slovenian (sl) and Slovak (sk). We studied Arabic↔Slavic transliteration. Arabic is a non-Slavic language (Semitic branch of Afro-Asiatic) and uses an abjad script in which vowel diacritics are omitted in gen- eral usage. The languages chosen are representative of lan- guages spoken by some major groups of peoples 307 en-Indic Indic-en Indic-Indic ar-Slavic en-hi 12K hi-en 18K bn kn ta ar-cs 15K en-bn 13K bn-en 12K hi 3620 5085 5290 ar-pl 15K en-kn 10K kn-en 15K bn 2720 2901 ar-sl 10K en-ta 10K ta-en 15K kn 4216 ar-sk 10K Table 1: Training set statistics for different datasets (number of word pairs). Validation set: 1K (en→Indic & ar↔Slavic), 500 (Indic→en, Indic-Indic). Test set: 1K (all pairs). Pair Src Tgt en-hi KANAKLATA कनकलता (kanakalatA) en-kn LEHMANN ಹಮ (l.ehaman) (a) English-Indic Pair Src Tgt pl-ar DUMITRESCU دومیرتسكو (dwmytrskw) cs-ar MAURICE موریس (mwrys) (b) Slavic-Arabic Table 2: Examples of transliteration pairs from our datasets which exhibit orthographic similarity: Indic, Ro- mance, Germanic, Slavic, etc. These languages are spoken by around 2 billion people. So our approach addresses a major chunk of the world’s people. Datasets: (See Table 1 for statistics of datasets). We used the official NEWS 2015 shared task dataset (Banchs et al., 2015) for English to Indic transliteration. This dataset has been used for many editions of the NEWS shared tasks. We split the NEWS 2015 training dataset as the train and valida- tion data for Indic-English transliteration. For test- ing, we used the NEWS 2015 dev-test set. We cre- ated the Indian-Indian parallel transliteration corpora from the English to Indian language training corpora of the NEWS 2015 dataset by mining name pairs which have English names in common. We mined the Arabic-Slavic dataset from Wiki- data (Vrandečić and Krötzsch, 2014), a structured knowledge base containing items (roughly entities of interest). Each item has a label (title of item page) which is available in multiple languages. We ex- tracted labels from selected items referring to named entities (persons, organizations and locations) to en- sure that we extract parallel transliterations (as op- posed to translations). Pair P B M Pair P B M Similar Source and Target Languages Indic-Indic (45.5%) bn-hi 29.74 19.08 27.69 kn-bn 28.59 24.04 37.47 bn-kn 17.62 18.14 27.74 kn-ta 34.89 30.85 38.30 hi-bn 29.92 25.46 39.15 ta-hi 29.07 19.24 28.97 hi-ta 25.15 28.62 38.70 ta-kn 26.99 19.86 29.06 Similar Source Languages Slavic-Arabic (55.8%) Indic-English (24.2%) cs-ar 38.91 37.10 59.17 bn-en 55.23 48.93 54.01 pl-ar 34.70 34.80 44.83 hi-en 49.19 38.26 51.11 sk-ar 43.26 37.49 62.21 kn-en 42.79 33.77 47.70 sl-ar 41.90 36.74 62.04 ta-en 33.93 23.22 25.93 Similar Target Languages Arabic-Slavic (176.8%) English-Indic (1.1%) ar-cs 15.41 12.08 36.76 en-bn 42.90 41.70 46.10 ar-pl 13.68 12.26 24.21 en-hi 60.50 64.10 60.70 ar-sk 15.24 13.82 38.72 en-kn 48.70 52.00 53.90 ar-sl 18.31 13.63 44.35 en-ta 52.90 57.80 55.30 Table 3: Comparison of bilingual (B) and multilin- gual (M) neural models as well as bilingual PBSMT (P) models (top-1 accuracy %). Figure in brackets for each dataset shows average increase in translit- eration accuracy for multilingual neural model over bilingual neural model. Best accuracies for each lan- guage pair in bold. Evaluation: We use top-1 exact match accuracy as the evaluation metric (Banchs et al., 2015). This is one of the metrics in the NEWS shared tasks on transliteration. 5 Results and Discussion We discuss and analyze the results of our experi- ments. 5.1 Quantitative Observations Table 3 compares results of bilingual (B) and mul- tilingual (M) neural models as well as a bilingual transliteration system (P) based on phrase-based sta- tistical machine transliteration (PBSMT). The PB- SMT system was trained using Moses (Koehn et al., 2007) with no lexicalized reordering and uses mono- tonic decoding. We used a 5-gram character lan- guage model trained with Witten-Bell smoothing. We observe that multilingual training substan- tially improves the accuracy over bilingual training in all datasets (an average increase of 58.2% over all 308 language pairs). Transliteration accuracy increases in all scenarios: (i) similar sources (ii) similar tar- gets and (iii) similar sources & targets. If we look at results for various language groups, transliteration involving Slavic languages and Ara- bic benefits more than transliteration involving In- dic languages and English. Arabic→Slavic translit- eration shows maximum improvement (average: 176.8%) while English→Indic pairs show minimum improvement (average: 1.1% ). We also see that the multilingual model shows significant improvements over a bilingual translit- eration system based on phrase-based SMT. The PBSMT system is better than the bilingual neural transliteration system in most cases. This is consis- tent with previous work (Finch et al., 2015), where the standard encoder-decoder architecture (with at- tention mechanism) (Bahdanau et al., 2015) could not outperform PBSMT approaches. However, a model using target bidirectional LSTM with model ensembling (Finch et al., 2016) outperforms PBSMT models. These improvements are orthogonal to our work and could be used to further improve the bilin- gual as well as multilingual systems. The bilingual neural network models are not able to outperform the PBSMT models possibly due to the small size of the datasets and the limited depth of the network (single layer encoder and decoder). 5.2 Qualitative Observations We see that multilingual transliteration is better than bilingual transliteration in the following aspects: • Vowels are generally a major source of translit- eration errors (Kumaran et al., 2010; Kunchukut- tan and Bhattacharyya, 2015) because of ambigui- ties in vowel mappings. We see a major decrease in vowel errors due to multilingual training (aver- age decrease of ∼20%). We observe substantial de- crease in long-short vowel confusion errors (Indic languages as target languages) and insertion/deletion of A (English/Slavic as target). We also see a ma- jor improvement in Arabic→Slavic transliteration. The Arabic script does not represent vowels, hence the transliteration system needs to correctly generate vowels. The multilingual model is better at generat- ing vowels compared to the bilingual model. • We also observe that consonants with similar pho- netic properties are a major source of transliteration Source B M باستور (bAstwr) bastor pastor كیلني (kylyn) kelen kailin (a) Arabic-Czech examples Source B M व जर्ल (varjila) vergill virgil ए लसन (elisana) elissan ellison (b) Hindi-English examples Table 4: Examples of bi- vs. multi-lingual outputs. ar and hi text are also shown using Buckwalter and ITRANS romanization schemes respectively errors, and these show a substantial decrease with multilingual training. For Indic-English translitera- tion, we see substantial error reduction in the follow- ing character pairs K-C, T-D, P-B. We also observe a decrease in confusion between aspirated and unaspi- rated sounds. For Arabic→Slavic transliteration, we see substantial error reduction for the following char- acter pairs K-C, F-V and P-B. • For Slavic→Arabic, we observed a significant re- duction in the number of errors related to characters representing fricative sounds like j,s,z,g (Buckwalter romanization). • The multilingual system seems to prefer the canon- ical spellings, even when other alternative spellings seem faithful to the source language phonetics. The system is thus able to learn conventional usage better than the bilingual models. e.g. morisa (Hindi, ro- manized text shown) is transliterated incorrectly to the phonetically acceptable English word moris by the bilingual model. The multilingual model gener- ates the correct system Maurice. • Since Indic scripts are very phonetic, very few non-canonical spellings are possible. As a conse- quence, vowel error reduction was also minimum for English-Indic transliteration (10%). This may partly explain why multilingual training provides minimal benefit for English-Indic transliteration. Table 4 shows some examples where multilingual output is better than the bilingual output. 309 Figure 2: Visualization of contextual representations of vowels for hi-en transliteration. Each colour rep- resents a different vowel. 5.3 Analysis We investigated a few hypotheses to understand why multilingual models are better: Better contextual representations for vowels: We hypothesize that the encoder learns better contex- tual representations for vowels. To test this hypoth- esis, we studied 3 character long sequences from the test set with a vowel in the middle (i.e., 1-char win- dow around vowel). We processed these sequences through the encoder of the bilingual and multilin- gual transliteration systems to generate the encoder output corresponding to the vowels. For instance, for the vowel a in the word part, we encode the 3 character sequence par using the encoder. The en- coder output corresponding to the character a is con- sidered the contextual representation of the charac- ter a in this word. Figure 2 shows a visualization of these contextual representations of the vowels using t-SNE (van der Maaten and Hinton, 2008). For the bilingual model, we observe that the contextual rep- resentations of same vowels tend to cluster together. For the multilingual model, the clustering is more specialized. The representations are grouped by the vowel along with the context. For instance, the re- gion highlighted in the plot shows representations of Hindi vowels e (yellow) and i (blue) followed by the consonant v. Other vowels with the same context are seen in the same region too. This suggests that the multilingual model is able to learn specialized rep- resentations of vowels in different contexts and this helps the decoder generate correct transliterations. More monolingual data: In the many-one scenario, more monolingual data is available for the target lan- guage since target words from all training language pairs are available. We hypothesize that this may help the decoder to better model the target language sequence. To test this, we decoded the test data using the bilingual models along with a larger target RNN LM (with LSTM units) using shallow fusion (Gul- cehre et al., 2017). The RNN LM was trained on all the target language words across all parallel cor- pora. These experiments were performed for Indic- English and Slavic-Arabic pairs. We did not observe any major change in the transliteration accuracies of bilingual models due to integration of a larger LM. Thus, larger target side data does not explain the im- provement in transliteration accuracy due to multi- lingual transliteration. More parallel data: Multilingual training pools to- gether parallel corpora from multiple orthographi- cally similar languages, which effectively increases the data available for training. To test if this ex- plains the improved performance, we compared mul- tilingual and bilingual models under similar data size conditions, i.e., the total parallel corpora size across all language pairs used to train the multilin- gual model is equivalent to the size of the bilingual corpus used to train the bilingual model. Specifically we compared the following under similar data size conditions: (a) {bn,hi,kn,ta}-en multilingual model (50%hi-en training pairs) withhi-enbilingual model (b) {cs,pl,sk,sl}-ar multilingual model (30% pl-ar training pairs) with pl-ar bilingual model. Table 5 shows the results of these experiments. We ob- served that the multilingual system showed signif- icantly higher transliteration accuracy compared to the bilingual model for Polish-Arabic. For Hindi- English, the bilingual model was better than the mul- tilingual model. So, we cannot decisively conclude that the performance improvement of multilingual transliteration can be attributed to the effective in- crease in training data size. In a multilingual train- ing scenario, data from other languages act as noisy 310 Pair B M pl-ar 14.64 44.83 hi-en 38.26 35.93 Table 5: Results of experiments under balanced data conditions Pair sepdec sepout sepnone Pair sepdec sepout sepnone Indic-Indic bn-hi 27.28 27.69 28.72 kn-bn 32.22 37.47 35.76 bn-kn 25.86 27.74 27.11 kn-ta 40.06 38.30 40.37 hi-bn 37.22 39.15 39.35 ta-hi 27.74 28.97 28.45 hi-ta 35.74 38.70 35.54 ta-kn 28.13 29.06 28.65 Arabic-Slavic English-Indic ar-cs 39.68 36.76 35.95 en-bn 41.00 46.10 44.10 ar-pl 27.25 24.21 26.85 en-hi 56.00 60.70 61.60 ar-sk 41.36 38.72 40.14 en-kn 49.30 53.90 54.30 ar-sl 49.14 44.35 45.88 en-ta 53.00 55.30 52.90 Table 6: Comparison of multilingual architectures version of data from the original language and sup- plement the available bilingual data, but they can- not necessarily substitute data from the original lan- guage pair. 5.4 Comparison with variant architectures We compare three variants of the encoder-decoder architecture: (a) our model: target language specific output layer (and parameters) (sepout), (b) every tar- get language has its own decoder and output layer (sepdec) (Lee et al., 2017), and (c) all languages share the same decoder and output layer, but the first to- ken of the input sequence is a special token to spec- ify target language (sepnone) (Johnson et al., 2017). These architectures differ in the degree of parameter sharing. sepdec has fewer shared parameters than our model and sepnone has more shared parameters than our model. In all three cases, the encoder is shared across all source languages. Table 6 shows the re- sults of this comparison. We cannot definitively con- clude if one model is better than the other. Except for Arabic-Slavic transliteration, the trend seems to indicate that models with greater parameter sharing (sepout/sepnone) may perform better. In any case, given the comparable results we prefer models with fewer model parameters (sepout/sepnone). 6 Zeroshot Transliteration In the previous sections, we have shown that multi- lingual training is beneficial for language pairs ob- served during training. In addition, the encoder- decoder architecture opens up the possibility of ze- roshot transliteration i.e., transliteration between language pairs that have not been seen during train- ing. The encoder-decoder architecture decouples the source and target language network compo- nents and makes the architecture more modular. As a consequence, we can consider the encoder out- put (for the source language) to be embedded in a language-neutral, common subspace - a sort of inter- lingua. The decoder proceeds to generate the target word from the language neutral representation of the source word. Hence, training on just a few language pairs is sufficient to learn all language-specific pa- rameters - making zeroshot transliteration possible. Before describing different zeroshot translitera- tion scenarios, we introduce a few terms. A language that is the source in any language pair seen during training is said to be source-covered. We can define target-covered languages analogously. Now, we can envisage the following zeroshot transliteration sce- narios: (a) unseen language pairs: both the source and target languages are covered, but the pair was not observed during training, (b) unseen source lan- guage: the source language is not covered, but it is orthographically similar to other source-covered lan- guages, (c) unseen target language: can be defined analogously. Next, we describe our proposed solu- tions to these scenarios. 6.1 Unseen Language Pair We investigated the following solutions: Multilingual Zeroshot-Direct: Since source and target languages are covered, we use the trained mul- tilingual model discussed in previous sections for source-target transliteration. Model selection can be an issue for this approach. As discussed earlier, model selection using valida- tion set accuracy for each language pair is better than average validation set loss. For an unseen language pair, we cannot use validation set accuracy/loss for model selection (since validation data is not avail- able). So we explored the following model selection criterion: maximum average validation set accuracy 311 across all the trained language pairs (sc_acc). We also compared sc_acc to the model with least av- erage validation set loss over all training language pairs (sc_loss). Nevertheless, there are inherent limitations to model selection by averaging validation accuracy or loss for trained language pairs. Irrespective of the model selection method used, the chosen model may still be suboptimal for unseen language pairs since the network is not optimized for such pairs. Multilingual Zeroshot-Pivot: To address the limi- tations with zeroshot-direct transliteration, we pro- pose transliteration from source to target using a pivot language, and pipelining the best source-pivot and pivot-target transliteration models. We choose a pivot language such that the network has been trained for source-pivot and pivot-target pairs. Since the network has been trained for the source-pivot and target-pivot pairs, we can expect optimal perfor- mance in each stage of the pipeline. Note thatweuse the multilingual model for source-pivot and pivot- target transliteration (we found it better than using the bilingual models). To reduce cascading errors due to pipelining, we consider the top-k source-pivot transliterations in the next stage of the pipeline. The probability for a target word y given the source word x is computed as: p(y|x) = k∑ i=1 p(y|zi)p(zi|x) (2) zi: ith best source-pivot transliteration. We used k=10. 6.2 Unseen Source Language An unseen source language can be easily handled in our architecture. Though the language has not been source-covered, a source word can be directly processed through the network since all source lan- guages share the encoder and character embeddings. 6.3 Unseen Target Language Handling an unseen target language is tricky since the output layer is specific to each target language. Hence, parameters for the unseen language’s output layer cannot be learned during training. Note that even architectures where the entire network is com- pletely shared between all language pairs (Johnson et al., 2017) cannot handle an unseen target language - Pair Biling Zeroshot ZeroshotDirect Pivoting † Direct sc_acc sc_loss sc_ora bn-ta 16.20 36.12 (hi) 31.79 27.45 34.47 ta-bn 18.48 47.01 (hi) 24.87 17.97 25.28 hi-kn 34.66 47.14 (ta) 44.48 42.13 43.05 kn-hi 34.12 39.02 (ta) 40.45 37.28 40.96 † best pivot for multilingual pivoting in brackets (a) Unseen language pair Method Slavic-ar (cs-ar) Indic-en (hi-en) Bilingual 37.10 38.26 Zeroshot 60.48 45.24 Multilingual 59.17 51.11 (b) Unseen source language Method ar-Slavic (ar-cs) en-Indic (en-hi) proxy acc proxy acc Bilingual none 12.08 none 64.10 sk 42.09 kn 9.00 Proxy sl 41.89 bn 18.30 pl 38.27 ta 2.90 Proxy sk 42.20 kn 9.70 + sl 40.99 bn 18.10 LMFusion pl 36.35 ta 3.10 (c) Unseen target language Table 7: Results for zeroshot transliteration the embedding for target language indicator tokens are not learned during training for unseen target lan- guages. We found that simple approaches like as- signing parameters for unseen target languages by averaging the parameters of the trained target lan- guages do not work. Hence, we use a target-covered language as proxy for the unseen target language. The simplest ap- proach considers the output of the source to proxy language transliteration system as the target lan- guage’s output. However, this doesn’t take into ac- count phonotactic characteristics of the target lan- guage. We propose to incorporate this information by using an RNN (using LSTM units) character-level language model of the target language words. While predicting the next output character during decoding, we combine the scores from the multilingual translit- eration model and the target LM using shallow fu- sion of the transliteration model and the language model (Gulcehre et al., 2017). 312 6.4 Results and Discussion We discuss the results for each scenario below. 6.4.1 Unseen language pair We experimented with transliteration between four Indic languages viz., hi,bn,ta,kn. We trained a mul- tilingual model on 8 out of the 12 possible language pairs, covering all the 4 languages. The remaining 4 language pairs (ta-bn, bn-ta, hi-kn and hi-kn) are the unseen language pairs (results in Table 7a). Zeroshot vs. Direct Bilingual: For all unseen lan- guage pairs, all zeroshot systems (pivot as well dif- ferent direct configurations) are better than the di- rect bilingual system. Note that unlike the zeroshot systems, the bilingual systems were directly trained on the unseen language pairs. Yet the zeroshot sys- tems outperform direct bilingual systems since the underlying multilingual models are significantly bet- ter than the bilingual models (as seen in Section 5). These results show that the multilingual model gen- eralizes well to unseen language pairs. Direct vs. Pivot Zeroshot: We also observe that the pivot zeroshot system is better than both of the direct zeroshot systems (sc_acc and sc_loss). To verify if the limitations of the model selection criterion ex- plain the direct system’s relatively lesser accuracy, we also considered an oracle direct zeroshot system (sc_ora). The oracle system selects the model with the best accuracy using a parallel validation corpus for the unseen language pair. This oracle system is also inferior to the pivot transliteration model. So, we can conclude that the network is better tuned for transliteration of language pairs it has been directly trained on. Hence, multilingual pivoting works bet- ter than direct transliteration in spite of the cascading errors involved in pipelining. Model Selection Criteria: For the direct zeroshot system, the average accuracy (sc_acc) is a better model selection criterion than average loss (sc_loss). 6.4.2 Unseen source language We conducted 2 experiments: (a) train on (bn,kn,ta)- en pairs and test on hi-en pair, and; (b) train on (pl,sk,sl)-ar pairs and test on cs-ar pair. See Table 7b for results. In this scenario, too, we observe that zeroshot transliteration performs better than direct bilingual transliteration. In fact, zeroshot transliter- ation is competitive with multilingual transliteration too (accuracy is > 90% of multilingual translitera- tion). Though the network has not seen the source language, the encoder is able to generate source lan- guage representations that are useful for decoding. 6.4.3 Unseen target language We conducted 2 experiments: (a) train on ar- (pl,sk,sl) pairs and test on ar-cs pair, and;(b) train on en-(bn,kn,ta) pairs and test on en-hi pair. The target languages used in training were the proxy languages. See Table 7c for results. We observe contradictory results for the experi- ments. For ar-cs, the use of proxy language gives transliteration results better than bilingual transliter- ation. But, the proxy language is not a good substi- tute for Hindi. Shallow fusion of target LM with the transliteration model makes little difference. We see that the transliteration performance of proxy languages is correlated to its orthographic sim- ilarity to the target language. Thus, it is preferable to choose a proxy with high orthographic similar- ity to the target language. We see one anomaly to this trend. Kannada as proxy performs badly com- pared to Bengali, though Kannada is orthographi- cally more similar to Hindi. One reason could be the orthographic convention in Hindi and Bengali that the terminal vowel is automatically suppressed. In Kannada, the vowel has to be explicitly suppressed with a terminal halanta character. Simply deleting theterminalhalanta in Kannadaoutput to conform to Hindi conventions increases accuracy to 24.1% (bet- ter than Bengali). Clearly, shallow fusion is not suf- ficient to adapt a proxy language’s output to the tar- get language, and further investigations are required. If a proxy-target corpus is available, we can generate better transliterations via pivoting. 7 Incorporating Phonetic Information So far, we considered characters as atomic units. We have thus relied on correspondences between charac- ters for multilingual learning. In addition, for some languages, we can find an almost one-one correspon- dence from the characters to phonemes (a basic unit of sound). Each phoneme can be factorized into a set of articulatory properties like place of articulation, nasalization, voicing, aspiration, etc. If the input for transliteration incorporates these phonetic prop- erties, it may learn better character representations 313 Pair O Ph Pair O Ph Indic- bn-en 54.01 55.53 kn-en 47.70 53.31 English hi-en 51.11 54.86 ta-en 25.93 29.63 bn-hi 27.69 28.51 kn-bn 37.47 39.19 Indic- bn-kn 27.74 29.72 kn-ta 38.30 41.93 Indic hi-bn 39.15 37.73 ta-hi 28.97 29.17 hi-ta 38.70 39.00 ta-kn 29.06 31.54 Table 8: Onehot (O) vs. phonetic (Ph) input across languages by bringing together similar char- acters. e.g. the Kannada character ಳ (La), has no Hindi equivalent character, but the Hindi character ल (la) is the closest character. The two characters differ in terms of one phonetic feature (the retroflex property), which can be represented in the phonetic input and can serve to indicate the similarity between the two characters. We incorporated phonetic features in our model by using feature-rich input vectors instead of the con- ventional onehot vector input for characters. Our phonetic feature input vector is a bitvector encod- ing the phonetic properties of the character, one bit for each value of every property. The multiplication of the phonetic feature vector with the weight ma- trix in the first layer generates phonetic embeddings for each character. These are inputs to the encoder. Apart from this input change, the rest of the network architecture is the same as described in Section 3.2. Experiments: We experimented with Indian lan- guages (Indic→English and Indic-Indic translitera- tion). Indic scripts generally have a one-one cor- respondence from characters to phonemes. Hence, we use phonetic features described by Kunchukut- tan et al. (2016) to generate phonetic feature vectors for characters (available via the IndicNLPLibrary1). These Indic languages are spoken by nearly a billion people and hence the use of phonetic features is use- ful for many of the world’s most widely spoken lan- guages. Results and Discussion: Table 8 shows the results. We observe that phonetic feature input improves transliteration accuracy for Indic-English transliter- ation. The improvements are primarily due to reduc- tion in errors related to similar consonants like (T,D), (P,B), (C,K) and the use of H for aspiration. 1https://github.com/anoopkunchukuttan/indic_ nlp_library For Indic-Indic transliteration, we see moderate improvement in transliteration accuracy due to pho- netic feature input. Since the source as well as tar- get scripts are largely phonetic, phonetic representa- tion may not be useful in resolving ambiguities (un- like Indic-English transliteration). Again, we see im- provements due to reduction of errors related to sim- ilar consonants. 8 Conclusion and Future Work We show that multilingual training using a neural encoder-decoder architecture significantly improves transliteration involving orthographically similar languages compared to bilingual training. Our key idea is maximal sharing of network components in order to utilize high task relatedness on account of orthographic similarity. The primary reasons for the improvements could be: (a) learning of specialized representations by the shared encoder and; (b) abil- ity to learn canonical spellings. We also show that the multilingual transliteration models can general- ize well to language pairs not encountered during training and observe that zeroshot transliteration can outperform direct bilingual transliteration in many cases. Moreover, multilingual transliteration can be further improved by shared phonetic input. Transliteration is an example of a sequence to se- quence task which is characterized by the follow- ing properties: (a) small vocabulary (b) short se- quences (c) monotonic transformation (d) unequal source and target sequence length (e) significant vo- cabulary overlap across languages. Given the bene- fits we have shown for multilingual transliteration, other NLP tasks that can be characterized simi- larly (viz., grapheme to phoneme conversion, trans- lation of short text like tweets and headlines between related languages at subword level, and possibly speechrecognitionaswellasTTS)couldalsobenefit from multilingual training. 9 Acknowledgements We would like to thank Rudramurthy V for making available code for parsing Wikidata and extracting multilingual named entities. We would also like to thank the action editor and reviewers for their valu- able comments. 314 References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learn- ing to align and translate. In International Conference on Learning Representations. Rafael E. Banchs, Min Zhang, Xiangyu Duan, Haizhou Li, and A. Kumaran. 2015. Report of NEWS 2015 Machine Transliteration Shared Task. In Proceedings of the Fifth Named Entities Workshop. Maximilian Bisani and Hermann Ney. 2008. Joint- sequence models for grapheme-to-phoneme conver- sion. Speech Communication. Rich Caruana. 1997. Multitask learning. Machine learn- ing. Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-Task learning for mul- tiple language translation. In Annual Meeting of the Association for Computational Linguistics. Andrew Finch and Eiichiro Sumita. 2010. A Bayesian model of bilingual segmentation for transliteration. In InternationalWorkshoponSpokenLanguageTransla- tion. Andrew Finch, Lemao Liu, Xiaolin Wang, and Eiichiro Sumita. 2015. Neural network transduction models in transliteration generation. In Proceedings of the Fifth Named Entities Workshop. Andrew Finch, Lemao Liu, Xiaolin Wang, and Eiichiro Sumita. 2016. Target-Bidirectional neural models for machine transliteration. In Proceedings of The Sixth Named Entities Workshop. Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Conference of the North American Chapter of the Association for Computational Linguistics. Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2016. Multilingual language process- ing from bytes. In Conference of the North American Chapter of the Association for Computational Linguis- tics. Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, and Yoshua Bengio. 2017. On integrating a lan- guage model into Neural Machine Translation. Com- puter Speech and Language. Junqing He, Long Wu, Xuemin Zhao, and Yonghong Yan. 2017. HCCL at SemEval-2017 Task 2: Combin- ing multilingual word embeddings and transliteration model for semantic similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation. Jagadeesh Jagarlamudi and Hal Daumé III. 2012. Reg- ularized interlingual projections: Evaluation on mul- tilingual transliteration. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natu- ral Language Processing and Computational Natural Language Learning. Sittichai Jiampojamarn, Colin Cherry, and Grzegorz Kon- drak. 2008. Joint processing and discriminative train- ing for letter-to-phoneme conversion. In Annual Meet- ing of the Association for Computational Linguistics. Sittichai Jiampojamarn, Aditya Bhargava, Qing Dou, Kenneth Dwyer, and Grzegorz Kondrak. 2009. Di- recTL: A language-independent approach to translit- eration. In Proceedings of the 2009 Named Entities Workshop: Shared task on transliteration. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Cor- rado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation sys- tem: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics. Sarvnaz Karimi, Falk Scholer, and Andrew Turpin. 2011. Machine transliteration survey. ACM Computing Sur- veys. Mitesh M. Khapra, A. Kumaran, and Pushpak Bhat- tacharyya. 2010. Everybody loves a rich cousin: An empirical study of transliteration through bridge lan- guages. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. DiederikKingmaandJimmyBa. 2014. Adam: Amethod for stochastic optimization. In International Confer- ence on Learning Representations. Alexandre Klementiev and Dan Roth. 2006. Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. InProceedings of the21st InternationalConferenceonComputational Linguistics and the 44th Annual Meeting of the Asso- ciation for Computational Linguistics. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. A. Kumaran, Mitesh M. Khapra, and Pushpak Bhat- tacharyya. 2010. Compositional machine transliter- ation. ACM Transactions on Asian Language Infor- mation Processing. Anoop Kunchukuttan and Pushpak Bhattacharyya. 2015. Data representation methods and use of mined corpora for Indian language transliteration. In Named Entities Workshop. 315 Anoop Kunchukuttan, Ratish Puduppully, and Pushpak Bhattacharyya. 2015. Brahmi-Net: A transliteration and script conversion system for languages of the In- dian subcontinent. In Conference of the North Amer- ican Chapter of the Association for Computational Linguistics - Human Language Technologies: System Demonstrations. Anoop Kunchukuttan, Pushpak Bhattacharyya, and Mitesh M. Khapra. 2016. Substring-based unsu- pervised transliteration with phonetic and contextual knowledge. In SIGNLL Conference on Computational Natural Language Learning. Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully character-level Neural Machine Transla- tion without explicit segmentation. Transactionsof the Association for Computational Linguistics. Bruno Pouliquen, Ralf Steinberger, Camelia Ignat, Tem- nikova Irina, and Anna Widiger. 2005. Multilingual person name recognition and transliteration. Corela. Ram Prakash. 2012. Quillpad multilingual predictive transliteration system. In Proceedings of the Second Workshop on Advances in Text Input Methods. V. Rudramurthy, Mitesh M. Khapra, and Pushpak Bhat- tacharyya. 2016. Sharing network parameters for crosslingual Named Entity Recognition. arXiv preprint arXiv:1607.00198. Amrita Saha, Mitesh M. Khapra, Sarath Chandar, Ja- narthanan Rajendran, and Kyunghyun Cho. 2016. A correlational encoder-decoder architecture for pivot- based sequence generation. In International Confer- ence on Computational Linguistics. Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan- dra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. 2017. Nematus: A toolkit for neural machine trans- lation. In Software Demonstrations of the 15th Con- ference of the European Chapter of the Association for Computational Linguistics. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research. Kārumūri V Subbārāo. 2012. South Asian languages: A syntactic typology. Cambridge University Press. Harshit Surana and Anil Kumar Singh. 2008. A more discerning and adaptable multilingual transliteration mechanism for Indian languages. In Proceedings of the Third International Joint Conference on Natural Language Processing. Laurens van der Maaten and Geoffrey Hinton. 2008. Vi- sualizing data using t-SNE. JournalofMachineLearn- ing Research. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM. Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270. Su-Youn Yoon, Kyoung-Young Kim, and Richard Sproat. 2007. Multilingual transliteration using fea- ture based phonetic method. In Annual Meeting- Association for Computational Linguistics. Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource Neural Machine Translation. 316