key: cord-0577753-cpjk4nl6 authors: Bergmanis, Toms; Pinnis, Marcis title: Dynamic Terminology Integration for COVID-19 and other Emerging Domains date: 2021-09-10 journal: nan DOI: nan sha: 253b42cd497e41e682348ef2a8f60e38cdbab53d doc_id: 577753 cord_uid: cpjk4nl6 The majority of language domains require prudent use of terminology to ensure clarity and adequacy of information conveyed. While the correct use of terminology for some languages and domains can be achieved by adapting general-purpose MT systems on large volumes of in-domain parallel data, such quantities of domain-specific data are seldom available for less-resourced languages and niche domains. Furthermore, as exemplified by COVID-19 recently, no domain-specific parallel data is readily available for emerging domains. However, the gravity of this recent calamity created a high demand for reliable translation of critical information regarding pandemic and infection prevention. This work is part of WMT2021 Shared Task: Machine Translation using Terminologies, where we describe Tilde MT systems that are capable of dynamic terminology integration at the time of translation. Our systems achieve up to 94% COVID-19 term use accuracy on the test set of the EN-FR language pair without having access to any form of in-domain information during system training. We conclude our work with a broader discussion considering the Shared Task itself and terminology translation in MT. This work is part of WMT2021 Shared Task: Machine Translation using Terminologies, which is concerned with improving machine translation (MT) accuracy and consistency on newly developed domains by utilising word and phrase-level terms. We describe Tilde MT systems that are capable of dynamic terminology integration at inference time. Our submissions consist of translations by terminology-enabled general-purpose MT systems for EN-RU, EN-FR, and CS-DE translation *Both authors have contributed equally. directions. Our systems are deliberately trained without consideration for the test domain to follow the spirit of the Shared Task-MT for emerging domains. Despite term collections being noisy, our MT systems with dynamic terminology integration improve term translation accuracy proving their usefulness in dynamic adaptation for novel domains, where training-time domain adaptation methods are not feasible. The remainder of this work describes the methods used for dynamic terminology integration (Section 2) by describing the tasks of terminology filtering, term recognition, and dynamic terminology integration in the translation process. The bulk of Section 2 describes problems due to the low-quality term collections and terminology mismanagement and our solutions to them. We hope that the examples provided will not only illustrate the selfimposed problems by the Shared Task but also will motive reconsidering the purpose and the desired qualities of a term collection in the context of MT. We then briefly describe the experimental setting and results in Section 3 and Section 4 respectively. We conclude our work with a broader discussion considering the Shared Task and terminology translation in MT in Section 5. This section describes the three tasks necessary for successful MT with terminology: terminology filtering, term recognition, and finally, integration of terminology constraints in the translation process. To guarantee terminology translation correctness and consistency, which are two quality aspects of terminology translation, term collections must provide unambiguous information about the preferred translation equivalent for each source term's type (full form, short form, or acronym) when listing multiple possible translation equivalents in a tar- Table 1 ). Besides, unlike the common custom of providing terms and their translations in their dictionary forms (e.g., see examples of terms in Eu-roTermBank 1 , the InterActive Terminology for Europe 2 , the United Nations Terminology Database 3 , and other authoritative term banks), the terminologies provided for the Shared Task often contain entries where the source or the target language form is already inflected (examples 1, 2, and 6 in Table 1 ). To reduce the noise present in the provided term collections, we performed filtering by discarding: 1. Term pairs that feature terms consisting of symbols other than digits, letters, apostrophes, white-spaces, and hyphen symbols. This filter allows to identify and discard expressions that do not represent terminology (e.g., full sentences, complete clauses, formulas, expressions consisting of terms and their acronyms within one term entry, etc.; see examples 1-6 and 8 in Table 2 ). 2. Term pairs where the source term is longer than the target term and the source term contains the target term as a sub-string (or vice versa). This filter is intended to discard term entries representing named entities that are written identically in both source and target languages, but for which one of the sides is incomplete (see examples 7 and 9 in Table 2 ). 3. Term pairs that represent general language (i.e., are too common). General language phrases are typically ambiguous and may require different translations based on surrounding context as well as external knowledge, which may not be available when translating. Therefore, it may be safer to allow the NMT model to handle the translation of general language phrases. We also do not want to burden the MT model too much with excessively annotated input data since longer segments are typically handled worse by NMT models than shorter segments (Neishi and Yoshinaga, 2019) . To identify term entries that are too general, we apply an inverse document frequency (IDF) (Jones, 1972) filter (Pinnis, 2015a) . As an example, this filter discarded from the term collections all term entries of the English term "spread" as it is a highly ambiguous word and according to the Collins EN-FR dictionary 4 it may have at least 20 distinct translations . Since the term collection features just one possible translation without any added meta-data, it is safer to not use such terms (considering also the limitations of term recognition when working with emerging domains with scarce or no parallel data). Besides filtering out the noisy term entries, the type and the quality of the term collections provided in the Shared Task also require selecting one among potentially many term translation equivalents. As noted before, this is typically done by a human, possibly, a domain expert. Nevertheless, we opt for two different strategies. If more than one translation equivalent is provided, it is fair to assume that they are all equally applicable. Thus we propose to select the first translation equivalent in the list. We refer to it as 1 st Trg Term term selection strategy. After analyzing the term collections, however, we conclude that it is not the case that all translation equivalents provided in the Shared Task are of equal quality. Therefore, we employ a statistical word alignment-based strategy to select the translation equivalent with the highest alignment score. To compute word alignments we use eflomal 5 (Östling and Tiedemann, 2016) . We refer to it as Alignment-based term selection strategy. Table 3 gives examples of terms selected by either of the term selection strategy. Examples illustrate that some translation equivalents are of equal quality (examples 1, 3, 6, and 7). However, selecting the first translation equivalent can sometimes give long (examples 6 and 8) or inadequate (example 4) translation equivalents. The Alignmentbased term selection strategy also tends to select translation equivalents that are dictionary forms (example 2) instead of inflections. Having a term collection, the next task in the MT workflow is term recognition in a running text. Term recognition depending on the morphological typology of the source language and the nature of the domain can prove to be a complex task. Recognition involves term identification in its surface form, which for morphologically complex languages may be hindered by many surface forms a single word can take or by the level of form ambiguity in the case of morphologically impoverished languages (Bergmanis and Goldwater, 2018) . To overcome issues posed by the morphology of the natural language, one can use one of the many off-the-shelf morphological taggers to obtain contextually correct part-of-speech and lemma pairs for each token and perform term recognition on lemmatized collections and texts. We, however, opt for an alternative, a more rudimentary method utilizing language-specific stemmers to normalise the surface forms and do the term recognition on stemmed running text and term collections (Pinnis, 2015a,b) . We opt for the stemmer-based approach because, in the production setting, stemming is faster than morphological tagging and has broader coverage for low-resource languages. Besides, to take full advantage of the morpho-syntactic information provided by morphological taggers, similar information must be provided by the term collection. However, as term collections of this Shared Task exemplify, expecting any meta-data is naive. Last but not least, recognised forms must be word-sense disambiguated if more than one translation sense (i.e., term entry per source-side lexical form) is available. Word-sense disambiguation tools typically are lexicalized classifiers that are trained using large amounts of parallel data. However, the spirit of this Shared Task is MT using terminology for emerging domains where "parallel data are hard to come by" 6 . Thus we skip wordsense disambiguation and use just one word-sense per word form. In the day-to-day work of professional translators, terminologies are glossaries containing source language terms and their corresponding target language translations in their dictionary forms. Some of the previous work on terminology translation assumes that term entries are given in forms already inflected as required by the target morpho-syntactic context. Thus, such work focuses either on morphologically impoverished languages or is concerned with terminology translation in unrealistic scenarios. Either way, such methods are not relevant for the languages of the Shared Task because all of the target languages, with the exception of Chinese, are to some degree inflective languages. There is, however, another body of work that addresses translation with terminology while accounting for morphological complexity of the target language (Exel et al., 2020; Niehues, 2021 Bergmanis and Pinnis, 2021) . We base our submission on Bergmanis and Pinnis (2021) and employ target lemma annotations (TLA) to augment MT training data. An example of a sentence fragment annotated with TLA is "infections|s инфекция|t result|w in|w mild|w symptoms|w", where |s, |t, |w are factors indicating whether token is a source language term, a lemma (the dictionary form) of a target language term, or an ordinary source language word respectively. Systems trained on such data are equipped with a mechanism for passing soft terminology constraints at inference time. An essential property of MT systems trained using TLA is that they learn not just to copy but also to inflect the provided terminology constraints according to the target morphosyntactic context. Therefore, the translation of the sentence above, for example, "инфекций приводят к легким симптомам", contains the plural noun "инфекций" and not just the annotated singular form "инфекция". Data. We use all parallel data provided for the Shared Task for training, except for development data which we use to choose the best model for final submission. Although back-translated monolingual data could, in theory, improve the overall translation quality, we do not use it to train our systems because typically, the monolingual target data is selected based on its similarity with the target domain data. However, the scenario proposed for the Shared Task assumes that the domain is novel; thus, we aim to explore the merits of terminology translation and do not look for extra synthetic target domain data. MT Model and Training. For system training, we use the Marian toolkit (Junczys-Dowmunt et al., 2018) because of its factored model functionality developed within the scope of the User-Focused Marian project 78 . In this Shared Task, we train standard MT systems that mostly follow the Transformer (Vaswani et al., 2017 ) base model configuration. The only deviations from the standard configuration are 1) the use of source-side factors (we use factor embeddings of dimensionality 8 and concatenate them with word embeddings), 2) increased -optimizer-dealy (from 16 to 24), and 3) an increased maximum sequence length (from 128 to 196 tokens). These changes are necessary purely for TLA support during training and inference: increased sequence length accounts for longer input sequences due to TLA and terminology constraints, while increased optimizer delay compensates for fewer sentences fitting in workspace memory-based batch due to their increased maximum length. We trained one NMT system per translation direction and evaluated translation quality on the development sets using the terminology translation evaluation tool provided by the Shared Task 9 (ibn Alam et al., 2021) . We compare the baseline translation scenario where no terms are annotated in the source text with improved scenarios where terms are annotated using term collections acquired using the different filtering and term translation equivalent selection strategies. When analysing the lemmatized exact match accuracy, we must bear in mind that the evaluation data similarly to the term collections feature term 7 https://marian-project.eu/ 8 https://github.com/marian-cef/ marian-examples/blob/forced-translation/ forced-translation/docs/Experiments.md 9 https://github.com/mahfuzibnalam/ terminology_evaluation entries 1) with more than one allowed synonymous translation equivalent (not counting different inflected forms), and 2) where different terms are merged into one entry (see examples in Section 2.1). This consequently means that the evaluation procedure 1) allows terminological ambiguity on the target side, 2) does not allow analysing terminology translation consistency, and 3) may depict a rough estimate of terminology translation accuracy. Therefore, we believe that the lemmatized exact match accuracy results should be analysed with a grain of salt. That being said, the results in Table 4 show that the metric improves when using a term collection in all but one experiment. The fact that in overall the Alignment-based term collections show better overall translation results (in terms of BLEU) and also allow reaching the highest terminology translation accuracy results shows that relying on the first translation equivalent in a term entry is not a good idea. We also see that the overall terminology translation quality is already relatively high for the baseline systems ranging from 76% for EN-RU till 88.5% for EN-FR. This makes us wonder whether the evaluated domain can be considered emerging as it features few novel terms and the majority are well handled by the baseline systems. To investigate further, we analysed whether the bilingual terminology that can be found in the development sets is also present in the training data of the NMT systems. We found that for CS-DE, 97.9% and 92.5% of such (unique) bilingual terms are featured at least once or 10 times in the training data respectively. The numbers are even higher if we analyse running terms (tokens) -99.8% and 98.7% respectively. Since the terminology for EN-RU was human-created and not extracted from parallel data, it shows slightly lower results when analysing unique terms -93.5% and 88.3% respectively. However, the situation is similar to CS-DE when analysing running terms -99.1% and 97.8% of bilingual terms found in the EN-RU development data are also found in the training data at least once and 10 times respectively. Based on these findings, we believe that the validation data does not depict an emerging domain and does not help analysing terminology translation quality for emerging domains. When analysing the overall translation quality (in terms of BLEU), we see that term filtering using the IDF-based filter is crucial when relying on very noisy and automatically acquired term collections (as was the case with the CS-DE term collection). The results show that translation quality drops by 3 BLEU points when using the unfiltered term collections. This shows that too general (and ambiguous) terminology can be harmful and lower translation quality. The overall translation quality change is marginal for the translation directions that featured human-created term collections (EN-FR and EN-RU), however we do see an increase in terminology translation accuracy. Our final submission consists of machine translations of Shared Task test sets provided by generalpurpose MT systems that use dynamic terminology integration using TLA (Bergmanis and Pinnis, 2021) . To translate our final submissions, term collections are filtered by basic filters (see Section 2.1) for EN-FR and EN-RU language pairs, while for CS-DE language pair, we also use IDF>5 filtering. We use the statistical word alignment term selection strategy for term entries with multiple translation equivalents for all language pairs. The development set results for the corresponding systems are marked bold in Table 4 . Shared Task. Results of automatic metrics show that our baseline systems are already well equipped to translate the development and test sets regardless of their seemingly novel domain. Indeedwe found no statistically significant differences in scores measuring general translation quality between the baseline systems and systems with terminology integration. Preliminary test results suggest a similar pattern in other submissions (c.f. results of submissions by Prompt). The only seemingly meaningful differences are in metrics specifically targeting terminology integration. These results are in stark contrast with previous work (Exel et al., 2020; Niehues, 2021; Bergmanis and Pinnis, 2021) which report significant improvements not only on terminology use targeted metrics but also on metrics measuring general translation quality. This disparity suggests that test data is not from an emerging or novel domain (at least as far MT systems trained on the training data provided are concerned). Considering this shortcoming, together with the visibility of WMT Shared Tasks, these results pose a risk of misrepresenting the problem the Shared Task was set out to research. The outcome might be unintended downplaying of the role of terminology translation for technical domains, which could lead to diminishing interest in terminology translation from the MT research community. Term Collections in MT. Tables 1, 2, and 3 of Section 2 provide numerous examples of problems present in the provided term collections. We believe that these examples illustrate the understanding of the purpose and desired qualities of a term collection not just of those individuals involved in preparing term collections for the Shared Task but also to a broader community of translation professionals. Many of the problematic examples suggest that the shift from human-readable to machinereadable term collections is not there yet, or that it has happened rather formally by merely reformatting the for-human-made term collections into neater TSV-formatted files. While having TSVformatted files helps for the file to be machinereadable, it does not make the content machineusable. The standards for-machine-made term collections have to be higher than those made for humans. At least as long as there is no sophisticated intelligence in the MT workflow that is on par with humans to recover from the irregularities and noise present in the term collections typically made for humans. Likewise, the encyclopedia-style entries explaining a concisely coined concept of the source language using a whole sentence to define it in the target language are still present in for-human-made term collections, but they are of no use for current MT systems. If translation with terminology is supposed to improve MT for novel domains, the term collections, being the supposed source of the expected improvement, have to be of a higher quality than the MT systems they are intended to improve. Antonios Anastasopoulos, Laurent Besacier, James Cross, Matthias Gallé, Philipp Koehn, and Vassilina Nikoulina. 2021. On the evaluation of machine translation for terminology consistency Context sensitive neural lemmatization with lematus Facilitating terminology translation with target lemma annotations Terminology-constrained neural machine translation at SAP A statistical interpretation of term specificity and its application in retrieval Marian: Fast Neural Machine Translation in C++ On the relation between position information and sentence length in neural machine translation Continuous learning in neural machine translation using bilingual dictionaries Efficient word alignment with Markov Chain Monte Carlo Dynamic terminology integration methods in statistical machine translation Terminology Integration in Statistical Machine Translation Attention Is All You Need This research has been supported by the Userfocused Marian project which is co-financed by the European Union's Connecting Europe Facility programme (Action number: 2019-EU-IA-0045).