key: cord-0561743-9tuve55e authors: Luo, Shengxuan; Ying, Huaiyuan; Li, Jiao; Yu, Sheng title: Sentence Alignment with Parallel Documents Facilitates Biomedical Machine Translation date: 2021-04-17 journal: nan DOI: nan sha: 70e3a85d51f0ee57d5badae4b26b51b412dac134 doc_id: 561743 cord_uid: 9tuve55e Objective: Today's neural machine translation (NMT) can achieve near human-level translation quality and greatly facilitates international communications, but the lack of parallel corpora poses a key problem to the development of translation systems for highly specialized domains, such as biomedicine. This work presents an unsupervised algorithm for deriving parallel corpora from document-level translations by using sentence alignment and explores how training materials affect the performance of biomedical NMT systems. Materials and Methods: Document-level translations are mixed to train bilingual word embeddings (BWEs) for the evaluation of cross-lingual word similarity, and sentence distance is defined by combining semantic and positional similarities of the sentences. The alignment of sentences is formulated as an extended earth mover's distance problem. A Chinese-English biomedical parallel corpus is derived with the proposed algorithm using bilingual articles from UpToDate and translations of PubMed abstracts, which is then used for the training and evaluation of NMT. Results: On two manually aligned translation datasets, the proposed algorithm achieved accurate sentence alignment in the 1-to-1 cases and outperformed competing algorithms in the many-to-many cases. The NMT model fine-tuned on biomedical data significantly improved the in-domain translation quality (zh-en: +17.72 BLEU; en-zh: +17.02 BLEU). Both the size of the training data and the combination of different corpora can significantly affect the model's performance. Conclusion: The proposed algorithm relaxes the assumption for sentence alignment and effectively generates accurate translation pairs that facilitate training high quality biomedical NMT models. In recent years, neural machine translation (NMT) systems have achieved near human-level performance on several language pairs in the general domain. [1] [2] [3] However, the reliance on large parallel corpora is the major bottleneck of training an NMT system, and in many domains, the vast majority of language pairs have very little, if any, parallel data. [4, 5] Manually obtaining such parallel corpora is expensive and labor intensive. In the biomedical domain, machine translation is essential to facilitate international collaborations. The outbreak of the COVID-19 pandemic showed the importance of international medical communication and cross-language medical data sharing in helping humans around the world respond to public health emergencies. [6, 7] A biomedical machine translation system can facilitate the exchange of treatment experiences and research findings in a timely and efficient manner in countries with different languages. In addition to communication, biomedical neural machine translation (NMT) is needed for many other applications, such as adapting statistical or machine learning models developed in one country to countries in which other languages are spoken or unifying the terminology systems and ontologies of different countries for global collaboration in medical knowledge and artificial intelligence. The architecture of machine translation systems has developed over a long period of time. Statistical machine translation used to be the mainstream machine translation, [8] [9] [10] while NMT has become the de facto standard for large-scale machine translation in the past few years. [11] [12] [13] [14] Currently, the most popular machine translation model is the Transformer. [13] The first and most crucial step in constructing a translation model is to obtain a parallel corpus since the essence of machine translation (MT) is to use the patterns mined from existing corpora to translate new text. [15] Most current work follows a pipeline to construct a parallel corpus: text collection, text preprocessing, and sentence alignment. [15] [16] [17] [18] In the public domain of some dominant languages, there are many large-scale publicly available parallel corpora containing tens of millions of sentence pairs. In contrast, parallel corpora in the biomedical domain are significantly smaller and less available. On the other hand, the biomedical domain is well known for its complex nomenclature, [19] massive amounts of terminologies, and a language style that differs from that of the general domain. As a result, translation models trained in the public domain perform poorly in the biomedical domain [20, 21] and lead to lower translation quality. The biomedical NMT model needs to learn how to translate biomedical texts precisely from domain corpora. Some existing translation works have provided bilingual biomedical corpora for a few major languages, including de/en, es/en, fr/en, it/en, pt/en, and zh/en 1 . Their data size ranges from thousands to hundreds of thousands of sentence pairs, [19, 22, 23] which is insufficient to train NMT models. Noting that there are document-level translations (parallel documents) available from research papers and multilanguage websites, in this paper, we propose a novel unsupervised sentence alignment method to extract biomedical sentence pairs from parallel documents to develop a high-quality biomedical NMT model. The method first mixes the paired documents of two languages into a pseudodocument according to the relative position of the words in the documents. These pseudodocuments are then used to train bilingual word embeddings (BWEs) to evaluate bilingual word similarities. [24] To align the sentences of a parallel document pair, we define sentence distance by combining the word similarity and the relative position of the sentences in the document. Finally, the alignment of sentences is formulated as an extended earth mover's distance (EMD) optimization problem to transfer the information from the source language to the target language. To establish a biomedical parallel corpus for English-Chinese machine translation, we apply the proposed sentence alignment method to the bilingual articles from UpToDate [25] and translations of abstracts from PubMed. We also explore various settings for training biomedical NMT models by mixing general domain and biomedical corpora and evaluate their performance and generalizability. This section reviews related works involving BWEs, sentence alignment tasks, and the EMD problem. BWEs aim to produce a common vector space for words of two languages. BWEs usually maintain the semantic information of both languages, such as the word similarity that is usually computed by the cosine between their embedding vectors. [21] Current approaches to constructing BWEs can be classified into two categories: (1) methods that train embeddings for two languages separately and learn a mapping to align the two embedding spaces with bilingual information [26] [27] [28] or unsupervised approaches [29, 30] and (2) methods that train BWEs by learning a shared embedding space for both languages. [24, 31] Our method for BWE is similar to Vulić et al. [24] in that aligned documents of two languages are merged into one document to train word embeddings. Since our data comprise whole-document translations, the positions of the words are also utilized when merging the documents to generate better BWEs, instead of using shuffling as in Vulić et al. [24] Sentence alignment refers to the task of aligning sentences in a document pair. The aligned sentence pairs express the same meaning in different languages. The most common alignment is 1-to-1 alignment. However, there is a significant presence of complex alignment relationships, including 1-to-0, 0-to-1 and many-to-many cases, such as 1-to-2, 1-to-3, and 2-to-2, due to the characteristics of languages, translators' personal reasons, or errors in sentence segmentation. Most previous work on sentence alignment can be classified into three categories: length-based, [32, 33] word-based,[34, 35] and translation-based. [16, 36] Some of these models are supervised and depend on dictionaries or existing sentence pairs. Furthermore, strong assumptions about the alignment types are commonly seen in these models, e.g., assuming that there are only specific types of alignment or the alignment must be ordinal. Most models are weak in the manyto-many case. We propose a novel unsupervised sentence alignment method that achieved high accuracy in both 1-to-1 and many-to-many cases. Neither external information nor unnecessary restriction on the form of the alignment is required. EMD is the optimization problem of minimizing the cost of moving all mounds into holes, where the volumes of the mounds and holes are known. The mounds and holes have equal total volumes. The cost of moving the earth is the product of the moving distance and the moving amount. The EMD problem is solved as a linear program (see more details in Section 3.3), and the solution gives a transport matrix (the volume from each mound to each hole). EMD has been used in the bilingual lexicon induction task.[27, 37, 38] Given the weight of words and the dissimilarity between words, the transport matrix obtained by solving the EMD infers the translation relation of words between the two languages. Our method is inspired by these works and applies an extended EMD to the sentence alignment scenario. This section introduces unsupervised sentence alignment using parallel documents for building a biomedical machine translation system. Figure 1 illustrates the 4 steps of the process, namely, BWE from parallel documents, defining sentence distance, sentence alignment as an EMD optimization problem, and translation model training. Parallel documents can be used to develop BWEs to measure word similarity across two languages. As translations are generally sentences for sentences, it is reasonable to assume that bilingual word pairs usually appear in similar positions in parallel documents. Based on this assumption, we mix two parallel The distance between two sentences of parallel documents can be defined by the words' similarity and the The word-based distance 1 ( , ) is defined as where and are words in and , respectively. 1 ( , ) searches the most similar word in for each word in and averages the similarity to measure the word-based distance between sentences. The distance based on the relative position is defined as The definitions of and are the same as in Section 3.1. Finally, the distance between and can be defined as In practice, we find = 1 generally works well. To formulate sentence alignment as an optimization problem, we temporarily assume that in a pair of parallel documents, ( , ), and contain an equal amount of information, where the amount of information in one sentence is proportional to the sentence's relative length. To see sentence alignment as an EMD, information is treated as earth, and we treat the sentences in as mounds and the sentences in as holes and take the distance from the mound to the hole as the sentence distance defined in Section 3.2. Furthermore, let the volume of the mound and hole be ( ) and ( ), respectively, where is the distance between sentences and . Then, the problem of sentence alignment can be formulated as solving the EMD from to : where is the transport matrix and denotes the transport volume from to . The first constraint guarantees that the amount of earth moved from each mound does not exceed the volume of the mound, while the second constraint means that the information is completely transported. Since ∑ , = ∑ ( ) = 1 = ∑ ( ), the "≤" in the first constraint is effectively equivalent to "=". However, in real data, the above strict constraints result in elements in corresponding to some unaligned sentences being nonzero since the amount of information contained in the sentence is not strictly proportional to sentence length. These nonzero elements in lead to incorrect many-to-many sentence alignment. Therefore, we introduce a relaxation factor to transform the problem into the following: where ≥ 0 is the relaxation factor, which can effectively remove nonzero elements that should not appear in . The selection of value is a new problem. It is easy to see that = 0 reduces the problem to the previous one. Increasing tends to reduce the number of nonzero elements, and with sufficiently large , transport will only occur in one pair of sentences with the smallest distance. To address this problem, we propose a practical selection approach. Since having too many completely nonzero 2 × 2 submatrices in is an indicator of incorrect many-to-many alignments, we define ε ( ) as the sum of the smallest elements in all the completely nonzero 2 × 2 submatrices and grid search to minimize ε ( ) + γε, resulting in a small that can effectively reduce the alignment error. In our experiments, γ = 1 is a good choice. Each nonzero element in the solution of implies an alignment between sentences. In real data, the sentence alignment method occasionally bundles several alignments into one alignment group. We split the alignment groups containing at least three sentences in both languages by the length-based approach, the Gale-Church algorithm, [33] to obtain the final alignments. We followed the traditional method of domain adaptation for NMT to pretrain the NMT model on general corpora and then fine-tuned it with in-domain parallel corpora. We use the base Transformer model as our NMT model.[39] 2 We collected 7,137 pairs of documents in both Chinese and English from the UpToDate website. We also obtained Chinese translations of 60,553 PubMed abstracts contributed by volunteers from a public website 3 . We applied preprocessing for both datasets, including punctuation standardization, sentence boundary detection, truecasing, and Chinese word segmentation. 4 We used the corpora of UpToDate and PubMed abstracts to train BWEs according to Section 3.1. To test the quality of word similarity evaluated by BWEs, we retrieved all the medical terms in Xiangya Medical Dictionary and randomly selected 200 terms whose English and Chinese forms both appear in a pair of parallel documents and the difference of the relative position for both forms in the parallel documents is less than 0.1. For each selected Chinese term, we manually checked whether the most similar English term found by BWEs was correct. The top 1 and top 10 accuracy in cosine similarity were 66.8% and 86.0%, respectively. The 200 terms are divided into 4 quarters by word frequency in UpToDate (Figure 2 ). We found that the error was mainly due to inconsistency in the morphological variation and words that are close in meaning. For example, the most similar term in BWEs for "角膜 (cornea)" is "corneal". In addition, frequent terms (in the first quarter) are likely to be common words with many meanings, which leads to low accuracy in this quarter. We compared the proposed method with three sentence alignment methods: the length-based method Gale- Bleualign. [16] We randomly selected parallel documents in UpToDate and PubMed abstracts and manually aligned these documents as test sets. Figure 3 shows the sizes of the test sets. The majority are 1-to-1 alignments, although the proportion of n-to-m alignments with min( , ) ≥ 1, max( , ) > 1 is not ignorable. During the manual alignment, we also found that the UpToDate data were better than PubMed abstracts in terms of translation quality. The precision, recall, and F1 scores are shown in Table 1 , with the metrics calculated as follows: We established a biomedical domain corpus by merging the resulting 1.4 million sentence pairs from Table 2 ). 19 We found that even when the domain was limited to biomedicine, the different styles between subdomains were not negligible. To probe this phenomenon, we compared the performance of en-zh models trained with different approaches on a small cancer corpus, ECCParaCorp. The ECCParaCorp covers information on cancer prevention, cancer screening, and cancer treatment, while UpToDate covers comprehensive clinical topics and is much larger than the ECCParaCorp. There are differences in language patterns between the two corpora. Table 3 lists the translation performance on the subdomain data ECCParaCorp from models that were trained by various approaches. It shows that fine-tuning on comprehensive clinical topics (WMT18/UpToDate) does not guarantee good generalizability to a subdomain; indeed, its performance can be inferior to only fine-tuning on the small subdomain data (WMT18/ECCParaCorp). It is also worth noting how training order significantly affects the performance: (1) fine-tuning on the mixture of UpToDate and ECCParaCorp (WMT18/BioMed) did not perform as well as fine-tuning on the target subdomain alone (WMT18/ECCParaCorp); (2) however, fine-tuning first on UpToDate then on ECCParaCorp (WMT18/UpToDate(first)+ECCParaCorp(later)) significantly improved the BLEU score by 10.78. These results suggest that a well-trained biomedical base model can contribute to training a better model in a subdomain, but it is necessary to finish the training on the target subdomain. As demonstrated by the preceding results, the method we propose can obtain a large number of high-quality 1-to-1 and n-to-m sentence pairs, and the biomedical NMT model has good performance on the BLEU score. In the following section, we discuss the role of the amount of in-domain data and the n-to-m sentence pairs in improving the model performance. In addition, we present examples to explain how the translation quality is improved by fine-tuning and the effects of adding an external in-domain dictionary. We investigated the relationship between model performance and the number of sentence pairs used in fine-tuning. As Figure 4 shows, to allow the BLEU score to increase linearly would require exponential growth of the size of the fine-tuning data. Moreover, Figure 4 indicates that adding n-to-m sentence pairs improves the model performance consistently under the same number of parallel documents, even though their quality is not as good as 1-to-1 sentence pairs. A case study is conducted to illustrate the effect of fine-tuning on in-domain data. The example 1 in Figure 5 is in the zh-en direction. The source sentence contains medical terms such as "盐皮质激素", "螺内酯", "依普利酮", "罗格列酮", and "袢利尿剂". In the fine-tuned model, these terms were correctly translated as "mineralocorticoid", "spironolactone", "eplerenone", "rosiglitazone", and "loop diuretic", whereas the pretrained model generated incorrect translations. The common word "给予" is usually translated to "give", while "administration" is more appropriate in the medical domain, which means "the act of giving a drug to someone". Similar situations are demonstrated in the example 2 in Figure 5 . The model learned the translation of "thiazolidinediones" after fine-tuning and translated the term to "噻唑烷 二酮类药物" correctly despite decoding different subwords than the target sentence. In terms of language preference, the word "retention" is better translated as the medical term "潴留" instead of the common word "滞留". These examples show that in-domain fine-tuning helps the model learn more terminology and fits the in-domain language style in the decoding phase. Furthermore, in-domain sentence pairs do not cover all biomedical terms. In the example 3 in Figure 5 , "aegyptianellosis", "eperythrozoonosis", "grahamellosis", and "haemobartonellosis" never appeared in sentence pairs in the training data (BioMed), and the model was not able to translate them. However, the last three terms appeared in the in-domain dictionary that we added to the training set as a set of bilingual pairs (BioMed*), and the model fine-tuned on BioMed* with the dictionary terms oversampled correctly translated "grahamellosis" and "haemobartonellosis" to "格雷汉体病" and "血巴尔通体病", showing that an additional in-domain dictionary can be helpful. In this paper, we proposed a new unsupervised sentence alignment method as a linear program that utilizes bilingual word alignment information to evaluate word similarity and further evaluate sentence distance. The proposed method relaxes the assumption about the types of alignment and has better performance on n-to-m alignment. We used all obtained data to build the Chinese-English Google's neural machine translation system: bridging the gap between human and machine translation Achieving human parity on automatic chinese to english news translation Has machine translation achieved human parity? A case for document-level evaluation Unsupervised statistical machine translation Six challenges for neural machine translation Handbook of COVID-19 Prevention and Treatment Data sharing for novel coronavirus (COVID-19) The mathematics of statistical machine translation: parameter estimation on Syntax, Semantics and Structure in Statistical Translation (SSST-8) Neural machine translation: a review UM-corpus: a large English-Chinese parallel corpus for statistical machine translation MT-based sentence alignment for OCR-generated parallel texts Parallel corpora for the biomedical domain Can multilingual machine translation help make medical record content more comprehensible to patients? Stud Don't forget the long tail! A comprehensive analysis of morphological generalization in bilingual lexicon induction Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) BVS corpus: a multilingual parallel corpus of biomedical scientific texts NEJM-enzh: a parallel corpus for English-Chinese translation in the biomedical domain Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction Long Papers) Unsupervised bilingual lexicon induction via latent variable models Multilingual models for compositional distributed semantics Aligning sentences in parallel corpora A program for aligning sentences in bilingual corpora Parallel corpora for medium density languages Finding similar sentences across multiple languages in Wikipedia Building earth mover's distance on bilingual word embeddings for machine translation Earth mover's distance minimization for unsupervised bilingual lexicon induction THUMT: an open-source toolkit for neural machine translation West Bengal: Association for Machine Translation in the ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application CUNI NMT system for WAT 2017 translation tasks