key: cord-0554805-mz6cw9fe authors: Mahdieh, Mahdis; Chen, Mia Xu; Cao, Yuan; Firat, Orhan title: Rapid Domain Adaptation for Machine Translation with Monolingual Data date: 2020-10-23 journal: nan DOI: nan sha: 1db066c7b87f297aaa340254d4b91082592a830b doc_id: 554805 cord_uid: mz6cw9fe One challenge of machine translation is how to quickly adapt to unseen domains in face of surging events like COVID-19, in which case timely and accurate translation of in-domain information into multiple languages is critical but little parallel data is available yet. In this paper, we propose an approach that enables rapid domain adaptation from the perspective of unsupervised translation. Our proposed approach only requires in-domain monolingual data and can be quickly applied to a preexisting translation system trained on general domain, reaching significant gains on in-domain translation quality with little or no drop on general-domain. We also propose an effective procedure of simultaneous adaptation for multiple domains and languages. To the best of our knowledge, this is the first attempt that aims to address unsupervised multilingual domain adaptation. COVID-19 is an unexpected world-wide major event that hit almost all aspects of human life. Facing such an unprecedented pandemic, how to timely and accurately communicate and share latest authoritative information and medical knowledge across the world in multiple languages is critical to the well-being of the human society. This naturally raises a question of how an existing translation system, usually trained on data from general domains, can rapidly adapt to emerging domains like COVID-19, before any parallel training data is yet available. Domain adaptation is one of the traditional research topics for machine translation for which a lot of proposals have been made (Chu and Wang, 2018) . Nevertheless most of them are not suitable for the purpose of rapid adaptation to emerging events. A large body of the existing adaptation approaches are supervised, requiring time-consuming data collection procedure, and while there has been some recent progress made in unsupervised domain adaptation (for example (Jin et al., 2020; Dou et al., 2019 Dou et al., , 2020 ), they are not designed specifically to fulfil the requirement of rapidity in domain adaptation, often involving costly algorithmic steps like lexicon induction, pseudo-sample selection, or building models from scratch etc. In this paper, we propose a novel approach for rapid domain adaptation for NMT, with the goal of enabling the development and deployment of a domain-adapted model as quickly a possible. For this purpose, we keep the following principles in mind when designing the procedure: Simplicity: The procedure should be as simple as possible, requiring only in-domain monolingual data and avoiding excessive auxiliary algorithmic steps as much as possible. Scalability: The procedure should be easy to scale up for multiple languages and multiple domains simultaneously. Quality: The adapted model should not sacrifice its quality on general domains for the improvement on new domains. Our approach casts domain adaptation as an unsupervised translation problem, and organically integrates unsupervised NMT techniques with a pre-existing model trained on general domain. Specifically, we engage MASS (Song et al., 2019) , an effective unsupervised MT procedure, for the purpose of inducing translations from in-domain monolingual data. It is mingled with supervised general-domain training to form a composite objective in a continual learning setup. We demonstrate the efficacy of our approach on multiple adaptation tasks including COVID-19 (Anastasopoulos et al., 2020) , OPUS medical (Tiedemann, 2012) as well as an in-house sports/travel adaptation challenge. What is more, we show that this procedure can be effectively extended to multiple languages and domains simultaneously, and to the best of our knowledge, this is the first attempt of unsupervised domain adaptation for multilingual MT. One of the most intriguing research topics in MT is how to enable translation without the presence of parallel data, for which the collection process is costly. Throughout the history of MT research, many approaches for unsupervised MT have been proposed, but it is not until recent years that significant progress has been made on this topic (Artetxe et al., 2018; Lample et al., 2018a,b; Conneau and Lample, 2019; Artetxe et al., 2019; Song et al., 2019; Zhu et al., 2020) , together with the rapid advancement in neural translation models. For example, the BLEU score on WMT14 English-French improved from 15 (Artetxe et al., 2018) to 38 within just two years. The approach we propose in this paper, to be detailed in Sec 3.1, engages unsupervised MT methods for the purpose of domain adaptation. The specific technique we focus on is named MASS (Song et al., 2019) , for which we give a brief account as follows. In a nutshell, MASS is an encoder-decoder version of the popular BERT (Devlin et al., 2019) pre-training procedure, in which blocks of the encoder inputs are masked, and are forced to be predicted on the decoder side with only the remaining context available. This procedure is done for monolingual data from both source and target languages, which forces the representation learned for both languages through this denoising auto-encoding process to live in the same space. As a result, even with monolingual inputs, at the end of the MASS training procedure the model's translation ability already starts to emerge. To further boost the translation quality, it is a common practice to continue the training process with online back-translation, which translates target inputs back into source side to form pseudo parallel data to guide model training. Overall, the algorithm of MASS is simple and elegant while demonstrating superior performance almost comparable to supervised approaches. It naturally fits the encoder-decoder framework and can be easily extended for rapid continual domain adaptation scenario. We therefore adopt this approach as the backbone of our proposed method. When directly applying an existing NMT system to translation tasks for emerging events like COVID-19, the results often contain numerous errors as the model was never trained on data from this novel domain. The challenging part of this adaptation scenario is that at the beginning of such events, no in-domain parallel corpus is available yet but the NMT system is required to respond properly in time. Therefore an unsupervised and rapid adaptation procedure needs to be in place to fulfil such requirements. Although domain adaptation has been a traditional research area of MT, most of the existing approaches assume the availability of parallel indomain data (Freitag and Al-Onaizan, 2016; Wang et al., 2017; Zhang et al., 2019; Thompson et al., 2019; Saunders et al., 2019; . While there are also approaches that require only monolingual data (Farajian et al., 2017; Dou et al., 2019; Jin et al., 2020) ,, their adaptation procedures are often heavy-weight (for example training data selection, or retrain model from scratch) and not suitable for the purpose of rapid adaptation. What is more, existing approaches usually only consider adaptation towards a single domain for a single language pair. How to rapidly adapt to multiple domains across multiple language pairs remains an under-explored topic. To address the aforementioned problems, we develop a light-weight, unsupervised continual adaptation procedure that effectively handles multiple domains and languages simultaneously. We now detail our methodology in the following section. We treat unsupervised domain adaptation as unsupervised learning of a new language and leverage MASS, introduced in Sec2.1, as a central building block in our procedure. In order to find out the most suitable way for domain adaptation tasks, we start by investigating different training procedure configurations outlined in Fig 1. Our training procedures consist of three main components that can be trained sequentially or jointly: 1. Supervised training with general parallel data. 2. MASS pre-training on monolingual data. 3. Online back-translation using monolingual data. The monolingual data used for training these components can be either general or in-domain data. Components trained using in-domain data are represented with dark orange color in Fig 1. In this paper, we focus on the S4 configuration as it achieves the highest quality improvement on the adapted domain. Also it provides faster domain adaptation compared to other approaches as it only requires in-domain data in the last step of the training process. In section 4.3, we compare these approaches in more details. S4 consists of three training steps as shown in Fig 1. The first two steps rely on general parallel and monolingual data, while the third step makes use of in-domain monolingual data. This final step allows us to adapt the model to a new domain rapidly while not suffering from quality loss on the general domain. It has become common for a neural machine translation system to handle multiple languages simultaneously. However, efficiently adapting a multilingual translation model to new domains is still an under-explored topic. We show that our approaches outlined in Sec. 3.1 can be easily extended to multilingual settings. Almost all existing work focus on adapting an existing model to one single domain. We explore novel setups where the model is adapted to multiple domains in an unsupervised manner. This provides an insight into the model's ability of retaining previously acquired knowledge while absorbing new information. With a given general model G, trained using the first two steps of the S4 training procedure, we explore three different setups to adapt G to two new domains A and B: Each → indicates an adaptation process by jointly training on general parallel data and domain monolingual data based on the third step of the S4 configuration. We conduct our experiments on OPUS (Tiedemann, 2012) (law and medical domains), COVID-19 (Anastasopoulos et al., 2020) as well as an inhouse dataset in sports/travel domain. For OPUS and COVID-19 experiments, the general-domain parallel and monolingual data comes from WMT, the same corpus as in (Siddhant et al., 2020) . Detailed dataset statistics can be found in Table 1 and Table 2 . Our in-house datasets are collected from the web. The general-domain parallel data sizes range from 130M to 800M and the sports/traveldomain monolingual data sizes are between 13K and 2M. We evaluate our approaches with both bilingual and multilingual tasks on each dataset. For OPUS medical and law domains, the bilingual tasks are en→de, en→fr, en→ro and the multilingual task is en→{de, fr, ro}. For COVID-19, they are en→fr, en→es, en→zh and en→{fr, es, zh}. For the inhouse sports/travel domain data, we report results on zh→ja and a 12-language pair ({en, ja, zh, ko}→{en, ja, zh, ko}) multilingual model setup. All the experiments are performed with the Transformer architecture (Vaswani et al., 2017) using the Tensorflow-Lingvo implementation (Shen et al., 2019) . We use the Transformer Big (Chen et al., 2018) model with 375M parameters and shared source-target SentencePiece model (SPM) (Kudo and Richardson, 2018) with a vocabulary size of 32k. Baselines We compare the results of our proposed unsupervised domain adaptation approach with the corresponding bilingual and multilingual models trained only with general-domain parallel data, without any adaptation. For datasets that have in-domain parallel data available, such as OPUS and COVID-19, we also compare our performance against supervised domain adaptation results, which are produced by experimenting with both continued and simultaneous training using different mixing strategies of in-domain and general parallel data and selecting the best results for each task. In all cases, we report BLEU scores on both general and in-domain test sets. Single-domain adaptation Our bilingual results are shown in Table 3 . Compared with the unadapted baseline models, our unsupervised approach achieves significant quality gain on the indomain test sets with almost always no quality loss on the general test sets (i.e. learning without forgetting). This improvement is consistent across all three datasets and all languages, with BLEU gains of +13 to +24 on OPUS medical domain, +8 to +15 on OPUS law domain (with the exception of en-fr), +2.3 to +2.8 on COVID-19 and +3.5 on sports/travel domain. Moreover, our method is able to almost match or even surpass the best supervised adaptation performance on a few tasks (e.g., COVID-19 en-fr, en-es, en-zh, OPUS medical en-fr, OPUS law en-ro). Table 4 and Figure 2 show our multilingual results. We can see that our approach can be effectively extended to multilingual models. There is large quality improvement across all supported language pairs on the adapted new domains while there is almost no quality regression on the general domains. The improvement ranges from +5 to +9 on OPUS medical domain, +3 to +10 on OPUS law domain, +0.4 to +2.3 on COVID-19 and up to +3 BLEU on sports/travel domain. We demonstrate our multi-domain adaption approaches with a twodomain setup on OPUS medical and law domains. We report the results of the three different setups described in Section 3.3 for both bilingual and multilingual scenarios, shown in Table 5 and Table 6 respectively. Our results suggest that the two-domain simultaneous adaptation approach is able to match the quality of individual single-domain adaptation, with a gap of less than 1.5 BLEU points on both domains and all language pairs for the bilingual models. For the multilingual model, our two-domain adaptation approach matches or outperforms the singledomain adaptation method on the medical domain, while there is a gap of between 0.9 and 4.1 BLEU points on the law domain. Since multi-domain adaptation with a multilingual model requires joint training with both general and in-domain data from all supported language, data mixing/sampling strategy becomes more important in order to achieve balanced quality improvement across multiple domains as well as multiple language pairs. We further observed that among the three multidomain adaptation setups, simultaneous adaptation to all domains is the most effective approach. In the sequential setups, there is almost always certain quality regression on the previous domain when the model is being adapted to the second domain. In this section, we compare the different training procedure configurations described in Section 3.1 on the in-house zh→ja task in sports/travel domain. Table 7 shows the best results we were able to obtain for each configuration after experimenting with different data sampling ratios and training parameters. Our main observations are the following: • Comparing with the baseline model, initializing the supervised training stage with a model pretrained using domain monolingual data either with MASS (S1) or both MASS and online back-translation (S2) can result in slight quality improvement (less than 1 BLEU) on the adapted domain. • Comparing {S1, S2} vs. {S3, S4, S5, S6}, joint MASS, online back-translation and supervised training (with both parallel and monolingual data) always seems more effective in boosting the model quality on the adapted domain than purely pipe-lined procedures. • It is always helpful to initialize the joint training phase with pretrained models (e.g., S3, S4, S5). Otherwise, it can be hard to find the right sampling ratios among MASS, online back-translation and supervised tasks during a single training process so that the model can improve towards the adapted domain while not having any quality regression on the general domain. • Among all the pretraining procedures, it is better to include both MASS and supervised training phases, instead of only supervised training. This way the model would be able to also pick up the language-dependent compo- nents inside the architecture during pretraining, which is beneficial for the subsequent joint training phase. Overall, we find that S4 is our most preferable setup. It also offers the advantage of "rapid" adaptation, as the MASS and supervised training phases only require general-domain data, thus can be prepared in advance. Domain adaptation is an active topic for MT research (Chu and Wang, 2018) and has been considered as one of the major challenges for NMT (Koehn and Knowles, 2017) , especially when no or little in-domain parallel data is available. Perhaps mostly related to our work is (Jin et al., 2020) , which also relies on denoising autoencoder, iterative back-translation as well as supervision from general domain data for unsupervised domain adaptation. Our work differs from theirs in the following ways: First of all, our work is motivated by rapid adaptation from existing models via continual learning, whereas their work builds in-domain model from scratch, therefore we pay close attention to the prevention of catastrophic forgetting. What is more, we also investigate the problems of simultaneous unsupervised domain adaptation across multiple languages and domains, topics rarely studied before. While our work is inspired by recent progress made in unsupervised MT, other approaches of using monolingual data for domain adaptation exist. (Dou et al., 2020) presents an approach that wisely select examples from general domain data that are representative of target domain and simple enough for back-translation. (Dou et al., 2019) propose to use both in-and out-of-domain monolingual data to learn domain-specific features which allow model to specify domain-specific representations of words and sentences. creates pseudoparallel training data via lexicon induction from both general-domain parallel data and in-domain monolingual data. (Farajian et al., 2017) adapts to any in-domain inputs by selecting a subset of outof-domain training samples mostly similar to new inputs, then fine-tune the model on this specific subset only for the adaption to the new inputs. Besides unsupervised domain adaptation, traditionally many approaches have been proposed for supervised domain adaptation. For example model ensembling between in-and out-of-domain models (Freitag and Al-Onaizan, 2016; Saunders et al., 2019) , applying regularization that prevents catastrophic forgetting (Thompson et al., 2019) , training data selection based on in-and out-of-domain sample similarity (Wang et al., 2017; Zhang et al., 2019) , meta-learning for domain-specific model parameters . We also note that our approach is tightly related to techniques for improving NMT quality for lowresource language pairs by making use of monolingual data. For example (Siddhant et al., 2020) proposed an approach of improving low-resource translation quality by mingling MASS objective on monolingual data with supervised objectives for high-resource languages during training, and observed significant gains. We presented an unsupervised rapid domain adaptation approach for machine translation inspired by unsupervised NMT techniques. Our approach continually adapts an existing model to novel domains using only monolingual data based on a MASS-inspired procedure, which is shown to have significantly boosted quality for unseen domains without quality drop on existing ones. We further demonstrate that this approach is flexible enough to accommodate multiple domains and languages simultaneously with almost equal efficacy. While the problems of domain adaptation, unsupervised and multilingual translation are usually treated as separate research topics, indeed the boundaries between them can be blurred so that a unified procedure can serve all purposes, as our study finds. Alp Oktem, Eric Paquin, Grace Tang, and Sylwia Tur An effective approach to unsupervised machine translation Unsupervised neural machine translation The best of both worlds: Combining recent advances in neural machine translation A survey of domain adaptation for neural machine translation Crosslingual language model pretraining BERT: Pre-training of deep bidirectional transformers for language understanding Antonios Anastasopoulos, and Graham Neubig. 2020. Dynamic data selection and weighting for iterative back-translation Unsupervised domain adaptation for neural machine translation with domainaware feature embeddings Multi-domain neural machine translation through unsupervised adaptation Fast domain adaptation for neural machine translation Domain adaptation of neural machine translation by lexicon induction A simple baseline to semisupervised domain adaptation for machine translation Six challenges for neural machine translation Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing Unsupervised machine translation using monolingual corpora only Phrase-based & neural unsupervised machine translation Metamt, a metalearning method leveraging multiple domain data for low resource machine translation Multilingual denoising pre-training for neural machine translation Domain adaptive inference for neural machine translation Lingvo: a modular and scalable framework for sequence-to-sequence modeling Leveraging monolingual data with self-supervision for multilingual neural machine translation MASS: Masked sequence to sequence pre-training for language generation Overcoming catastrophic forgetting during domain adaptation of neural machine translation Parallel data, tools and interfaces in opus Attention is all you need Sentence embedding for neural machine translation domain adaptation Curriculum learning for domain adaptation in neural machine translation Incorporating bert into neural machine translation