key: cord-0680164-m5gtyvfl authors: Oktem, Alp; DeLuca, Eric; Bashizi, Rodrigue; Paquin, Eric; Tang, Grace title: Congolese Swahili Machine Translation for Humanitarian Response date: 2021-03-19 journal: nan DOI: nan sha: 297bbef24e36f8bc2cf6756ee3b4b090c394418c doc_id: 680164 cord_uid: m5gtyvfl In this paper we describe our efforts to make a bidirectional Congolese Swahili (SWC) to French (FRA) neural machine translation system with the motivation of improving humanitarian translation workflows. For training, we created a 25,302-sentence general domain parallel corpus and combined it with publicly available data. Experimenting with low-resource methodologies like cross-dialect transfer and semi-supervised learning, we recorded improvements of up to 2.4 and 3.5 BLEU points in the SWC-FRA and FRA-SWC directions, respectively. We performed human evaluations to assess the usability of our models in a COVID-domain chatbot that operates in the Democratic Republic of Congo (DRC). Direct assessment in the SWC-FRA direction demonstrated an average quality ranking of 6.3 out of 10 with 75% of the target strings conveying the main message of the source text. For the FRA-SWC direction, our preliminary tests on post-editing assessment showed its potential usefulness for machine-assisted translation. We make our models, datasets containing up to 1 million sentences, our development pipeline, and a translator web-app available for public use. Swahili (Kiswahili among its speakers) is a macrolanguage spoken widely in east Africa with an estimated 100 to 150 million speakers. It is the official language in Tanzania and Kenya, where it is referred to as coastal Swahili, and one of four national languages in the Democratic Republic of Congo (DRC), where it is referred to as Congolese Swahili. Coastal and Congolese Swahili differ substantially in terms of vocabulary, grammar, and structure. This is largely due to educational politics and differences in the colonial languages of the countries they are spoken in. For example, compared to coastal Swahili, Congolese Swahili has a lot of French and Lingala influence. To illustrate, take the translations of "Once the test is negative, the family can take care of the funeral themselves": Coastal Swahili Mara tu upimaji ukiwa hasi, familia inaweza kushughulikia mazishi wao wenyewe. Congolese Swahili Ikiwa tu vipimovinaonesha kama ni mtu ambaye hakuhakikishwa ku kuwa na ugonjwa, familia inaweza kufanya mazishi yenyewe. The translations differ in vocabulary and structure. In the Congolese dialect, the first clause needs to be expressed as "If the test shows that someone is not sick," since a commonly used word for "negative" (Hasi in the coastal dialect) does not exist. The conditionality is expressed with Ikiwa ("if") instead of Mara ("once"), as the latter is very rarely used in DRC. And finally, the Congolese Swahili translation uses the colloquial word kufanya for "dealing with a funeral" instead of kushughulikia ("care"), since "taking care of a funeral" does not mean anything in common speech. This example hints that machine translation (MT) engines that are solely based on the coastal dialect could be ineffective in delivering or receiving sensitive information. That is why there is a need for dialect-specific MT development to aid humanitarian relief efforts in DRC. Translators without Borders (TWB) specializes in helping people affected by crisis to get information in their language in a format they understand. Communicating effectively in the languages and formats people understand is central to ensuring that people understand health risks and know how to keep themselves and their families safe. TWB translates vital messages into local languages and works with responders to develop tools and language capacity to provide communities with better access to information and services that meet their needs. TWB has been active in eastern DRC since early 2019 with the Ebola response. In October 2020, TWB launched a multilingual COVID-19 chatbot that leverages natural language understanding to allow users to ask questions in their own words and receive relevant answers in the same language. Language technology, such as machine translation, plays an important role in crisis response, increasing the capacity to communicate critical information and key messages in the languages people understand at speed and at scale (Lewis et al., 2011) . Crisis-affected people can access content in local languages firsthand through various channels such as websites, news sources, and social media. In the reverse direction, their questions and feedback can be used to inform aid programs to analyze trends and better meet affected people's needs (Öktem et al., 2020) . For marginalized languages where there are limited translation resources, MT can also help standardize terms and improve translation quality. The task of creating MT specialized in the Congolese dialect is especially challenging due to the low-resource status of the language. In Joshi et al. (2020) 's index that ranks languages in terms of resources, it is listed as "Left Behind". According to their definition, it is substantially difficult to establish a digital foundation for the language with the existing data resources. If the bottleneck of data scarcity is not addressed, it is impossible to develop useful tools such as MT that can greatly serve the translators of a language. In fact, this does not only put translators of this language at a great disadvantage, but it could also contribute to the deprioritization of the language by its speakers both digitally and socially. The objectives of this work are to: • Curate research and public data sources for use in MT development for Congolese Swahili (Section 2). • Describe the datasets that we created and compiled for this work (Section 3). • Investigate the use of low-resource methodologies to deliver optimal bidirectional MT models (Section 4). • Assess the usability of the systems for humanitarian response (Section 5). The only MT-related work that specifically includes the Congolese Swahili dialect is by the grassroots research initiative Masakhane (∀ et al., 2020). Their paper brings to light how much and why African languages are left behind in terms of natural language processing (NLP) research. Their work addresses this by publicly releasing models and benchmarks for many African languages, including Congolese Swahili. The main data resource used in this work that makes the breadth of benchmarks possible is the JW300 collection (Agić and Vulić, 2019), a parallel corpus of more than 300 languages, including many marginalized ones. According to Ethnologue 1 , Swahili is a Bantu language spoken by up to 150 million people throughout eastern Africa including in Tanzania, Kenya, Uganda, Rwanda, Burundi, Mozambique, Somali, and DRC. It is comparatively well represented by commercial MT service providers such as Google Translate and Microsoft Translator. Research on machine translation of Swahili has been taking place for over 50 years. Woodhouse (1968) analysed morphological structure of the language with the aim of building mechanical translation. Mechanical translation logic consists of parsing the source text and mapping each morphological unit to its translation using dictionaries. Even though the mechanical process is outdated in the era of neural machine translation, it still sheds light on the difficulties that any computational translation setup might face. The paper points out the extensive use of prefixes and suffixes in Swahili: subject and object, tense, and negation are all embedded in one word as prefixes, whereas passive, causative, prepositional, reciprocal, subjunctive, plural imperative, and some singular imperative forms are all formed using suffixes. Pauw et al. (2011) pioneered statistical machine translation in Swahili. Their work presents a 2million-word parallel corpus paired with English and also part-of-speech tagging annotation from English into Swahili. Their SAWA corpus is not openly distributed, however it is available on demand (Sánchez-Martínez et al., 2020) . For its large web presence and status as a lingua franca in many countries, Swahili is listed as a "Rising Star" in Joshi et al. (2020)'s index. Recent years have seen an expansion in language coverage for neural machine translation in many low-resource languages. Sánchez-Martínez et al. (2020) presents bidirectional English-Swahili MT systems with the motivation of supporting international media outlets that publish in the language. They also address the data scarcity problem by openly publishing crawled monolingual and parallel data. Another recent work by Lakew et al. (2020) investigates the use of various methodologies like semi-supervised, transfer-learning, and multilingual modeling for building NMT benchmarks for five east African languages: Swahili, Amharic, Tigrigna, Oromo, and Somali, all paired with English. Parallel data is the main ingredient to train MT systems. Also referred to as bitext, this is essentially a set of sentences with their translations. To build a bidirectional Congolese Swahili and French model, we needed access to translated data in these languages. We sourced hand-crafted parallel data in this language pair from three sources. Gamayun kits are a starting point for developing audio and text corpora for languages with few or no pre-existing language data resources. Source sentences for this corpora are selected from Tatoeba corpus 2 ensuring they represent everyday language without any domain specificity (Öktem et al., 2020). The portion published with this work is a set of 25,302 French sentences translated into Congolese Swahili. TICO-19 translation memories are collected to assist translators and build MT benchmarks in 36 languages. Each set consists of 3,071 sentences in COVID-19 domain with qualitychecked translations (Anastasopoulos et al., 2020) . TWB in-house translation memories are collected from various translations that were made within TWB. Most of the source content is from translation requests from NGOs that assist in humanitarian relief in DRC. The only large external open resource we have found specifically for Congolese Swahili is the JW300 corpus (Agić and Vulić, 2019) . Additionally, we decided to collect data in non-dialect 2 https://tatoeba.org/ Swahili since it would enable us to do transfer learning (Zoph et al., 2016) . We sourced sentences paired with French and English. All internally sourced and publicly available parallel data sources are listed in Table 1 . The TWBkits.sw, ELRC and GoURMET datasets are in coastal Swahili paired with English. With a significant total of 161,668 sentences, we decided to include them in our experiments. To use them to train our models paired with French, we used an off-the-shelf machine translation model to translate the English sentences to French. For this, we used an open source model by Helsinki-NLP research group provided through the huggingface library 3 (Tiedemann and Thottingal, 2020) . One other technique to improve MT is semisupervised training (Sennrich et al., 2016) . This process involves expanding the training data with monolingual data paired with their back-translation. For our experiments, we automatically translated 766,398 monolingual Swahili sentences sourced from Wikipedia 4 and News sites (Barrault et al., 2019) to French using a model that was trained during this work using the rest of the available data. All monolingual data sources are also listed in Table 1 . We used five test sets to perform automatic and human evaluations of our models: 1000 sentences from Gamayun kits, 500 sentences from the TICO-19 set, 2,478 sentences from the JW300 test set used in ∀ et al. (2020), 100 user-submitted messages from chatbot conversations and 10 chatbot response strings. The first two sets of sentences were randomly sampled from the datasets they belong to. Details on testing datasets are listed in Table 2 . We are publishing both the hand-crafted and synthetically generated parallel data used in this work together with the model weights created from them. They are accessible through our project portal: https://gamayun.translatorswb.org/data/. Researchers who would like to reproduce our work can also find our development pipeline and We followed a three-stage approach for training our models. Each stage is associated with a different dataset mixture (listed in Table 3 ). In the first stage, we obtain a base model by pre-training on all available data: authentic and synthetic, indomain and out-of-domain, and in-dialect and nondialect. To assess the effect of adding different datasets, we prepared three intermediate data mixtures: mix.sw containing non-dialect parallel data, mix.mted containing data that was pair-converted from SW-EN datasets and finally mix.mono con- taining back-translated data generated from monolingual Swahili corpora. In the second stage, we fine-tune the model into the Congolese dialect with both hand-crafted and crawled authentic data (mix.swc). In the third stage, we fine-tune the model using our datasets consisting only of handcrafted translations (mix.inTWB). Figure 1 illustrates our training procedures. In procedure A (baseline) we train only on the Congolese Swahili data (mix.swc) and then fine-tune on mix.in. Procedure B performs cross-dialect transfer by pre-training on non-dialect swahili data (mix.sw). Procedure C is similar to Procedure B with addition of pair-converted synthetic data (mix.mted). Finally, Procedure D utilizes synthetic data generated from back-translated monolingual data (mix.mono). In all procedures, we used a validation set of 1000 sentences allocated from mix.in. We used the OpenNMT-py toolkit (Klein et al., 2018) to train the models. The model consists of an eight-head Transformer (Vaswani et al., 2017) with six-layer hidden units of 512 unit size. A tokenbatch size of 2,048 was selected for multi-dialect and in-dialect training and 512 for final fine-tuning stage. Adam optimizer (Kingma and Ba, 2015) was selected with 4,000 warm-up steps. Trainings were performed until no further improvement was recorded in development set perplexity in the last five validations. We report our test set BLEU-scores (Papineni et al., 2002) for all training procedures and for each stage in Table 4 . The results show that cross-dialect transfer learning improves the quality of the models in both directions and in all domains. A pretraining step with non-dialect Swahili data resulted in an increase of 1.5/1.8 BLEU points in TWBkits and 1.7/2 BLEU points in TICO-19 test sets in the SWC-FRA/FRA-SWC directions respectively. In the same test sets, augmenting the pre-training data with synthetically created training data also improved in general. We recorded a 0.4 point increase in TWBkits set in the FRA-SWC direction. In TICO-19 test set, we initially saw a 0.4 BLEU points increase in the SWC-FRA and a 0.7 BLEU points increase in the FRA-SWC directions with the addition of pair-converted corpus (mix.mted). These improvements further increased to 0.7 and 1.5 when we used the back-translated monolingual set (mix.mono). Adding that to the improvement recorded from cross-dialect transfer, TICO-19 set shows greatest performance boost with 2.4 and 3.5 BLEU point increases in the SWC-FRA and FRA-SWC directions. For the JW300 test set, we consistently saw a quality decrease in the third training stage. This is because the final fine-tuning stage used data from other sets and domains. Therefore, we evaluate improvements between procedures using the penultimate stage results. Using cross-dialect transfer, we saw 1.8 BLEU points increase in the SWC-FRA direction and 1.4 in the FRA-SWC direction. We can also notice that use of synthetic data only contributed in the SWC-FRA direction with a 1 BLEU point increase between best performing models of procedures B and C. Best scoring models for the three sets given in the order of SWC-FRA and FRA-SWC are as follows: 29.2 and 16.6 BLEU points for TWBkits set, 20.1 and 16.5 BLEU points for TICO-19 set, 30.7 and 23.5 BLEU points for JW300 set. The only baseline comparable to previous literature among these is the FRA-SWC direction in the JW300 set. ∀ et al. (2020) reported a BLEU point benchmark of 33.7 using JW300 dataset both for training and testing. The difference in results shows how much multidomain pre-training affects model performance for domain-specific test sets. It is essential to evaluate the quality of an MT system in order to assess its usability in real-world scenarios. We designed our manual evaluation setups to give us insights on how our MT models would perform in the context of a COVID-domain chatbot. Uji, TWB's chatbot deployed in DRC, is an artificial intelligence-based virtual assistant that allows people to interact in conversations. Uji answers questions, records concerns and feedback in French, Swahili, and Lingala 5 . Two main translation tasks involved in the project are: • Translation of professionally curated responses to common queries related to COVID- Table 4 : Automatic evaluation results in 3 test sets. Highest scoring model of each column is marked with bold. BLEU scores were calculated with SACREBLEU toolkit with tokenize "intl" (Post, 2018). 19. This content is originally prepared in French and needs to be translated into Swahili. • Analysis of unclassified queries, complaints and other monitorizable information from Swahili-speaking chatbot users. For the first use-case, we emulated a translation setup where a linguist would post-edit MT output to generate correct translations. For the second use case, we assessed machine translation output directly. In the post-edited translation scenario, translators are given the option to use a machine translation system to translate a document segment by segment. Since MT is prone to error, these translations are then post-edited to ensure quality. One way of evaluating the quality of an MT system is assessing the amount of post-editing effort needed to obtain correct translations (Bentivogli et al., 2018) . Data. We selected 10 recently prepared response strings originally prepared in French. The responses covered various topics such as vaccinations, negationism, myths, and new virus variants. The strings contain an average of 2.3 sentences and 45 words. Results. Table 5 lists the results that compare automatic and post-edited versions of the translations in terms of BLEU, HTER (Snover et al., 2006) and CHRF (Popović, 2015) . We found higher than average agreement where our evaluator reports principle errors as missing words, French words appearing, and confusion of plural and singular. An example segment with machine translated (Raw-MT) and post-edited versions (PE-MT) is given below. Corrections made by the translator are marked in red. French Des cas de réinfection du COVID-19 ont eté signalés mais sont rares. En général, la réinfection signifie qu'une personne aété infectée (est tombée malade) une fois, s'est rétablie, puis est redevenue infectée plus tard. Sur la base de ce que nous savons de virus similaires, certaines réinfections sont attendues. We extracted 100 Congolese Swahili strings from user-submitted chatbot conversations. These strings were translated into French using our MT BLEU TER ChrF 55.92 0.37 74.81 models. We then randomly separated the strings into four unique 25-question surveys, and each survey was shared with two translators. Each translator only responded to one survey. Translators were shown the source text along with the target text. These strings were shown independent of each other and not within the context of a wider document or corpora. Translators were then asked three simple questions: We chose to omit the commonly used accuracy and fluency metrics (Graham et al., 2013) as we feel they are overly academic for our purposes. As our use-case focuses on eliciting a general understanding of user-submitted content, it's less important for us to have target translations with perfect grammar or strong linguistic fluency. Instead, we've focused on comprehension of the key message (Question 1) and general quality (Question 2) to evaluate the model's fitness for purpose. To control for variations in evaluation scales, the results for each string were averaged between the two translators. Generally, there was strong agreement between the translators' rankings, with a median difference of 2 points on the 11-point scale. To avoid discrepancies in individual differences, any string where the translator scores for Question 2 differed by more than 3 points was omitted from the final analysis. In other words, if one translator provided a score of 3 and the other provided a score of 7, results for that string were omitted. We also removed seven strings that contained significant source content in Lingala. The final analysis was based on 71 strings. Results for Question 1 were also averaged, but using a number-based conversion from the threetier ranking to break the tie in the 28% of strings where the two translators responded differently. The results were assigned a value corresponding with their answer (No = 0, Kind of = 2, Yes = 4) and results were averaged between the two translators. So a score of 3 corresponds with a situation where one translator responded "Kind of" and the other responded "Yes." Data. All of the strings were real, usersubmitted questions related to COVID-19. As these were questions submitted primarily through What-sApp, the strings were short, averaging 11 words and 63 characters. Results. On average, our strings were rated 6.3 out of 10, with a standard deviation of 3.0. It's important to note that we only defined the bottom and top of the scale, so there was some interpretation required, and different translators could have scored with different personal criteria for how they defined "good." In general, we did not notice a relationship between quality and string length, but that can partly be explained by the small difference between our shortest (37 characters) and longest strings (144 characters). We did notice a stronger relationship between the quality of the translation and whether the main message was being conveyed. This is not surprising, as we would expect to understand the main message better when the translations are better. Still, the relationship could suggest utility in focusing basic human evaluations on simple questions (like whether the main message was conveyed), as they are easier and quicker for translators to answer. In terms of comprehension, our evaluators reported that at least some of the key message of the source text was coming across 75% of the time (see Figure 2 ). In 48% of the strings, our evaluators both agreed that the key message was being conveyed. In this work, we have provided a case study of machine translation development in a low-resource setting, Congolese Swahili. We provided detailed experimental results on the use of cross-dialect transfer learning and semi-supervised learning methodologies. Automatic evaluation among various test sets showed an improvement of 1.5 to 2.4 BLEU points in the SWC-FRA direction and 1.4 to 3.5 BLEU points in the FRA-SWC direction through the use of such methods. The results of evaluation tests by human translators demonstrate a potential for improving translation workflows in humanitarian relief scenarios. To serve both the general population and translators, we provide a demo translator application serving our models in http://gamayun.translatorswb. org/. With the motivation of promoting NLP research and language technology development for Congolese Swahili, we open-source the resources created in this work. These are: 25,302 clean human-translated general domain sentences and 928,065 synthetic parallel sentences and weights of best-performing models. To ensure the replicability of our work, we also share our development pipeline and test sets under http://github.com/translatorswb/ TWB-MT/tree/swc-fra-bidirectional. Our current and future work involves integration of our models into our workflows. Our primary objective is giving this tool to TWB's translator community. We hope to further develop our models with the feedback we receive from linguists and non-professional translators of diverse backgrounds. Also, we aim to achieve impact in humanitarian data collection, such as surveying crisisaffected communities with machine-assisted translation. JW300: A widecoverage parallel corpus for low-resource languages TICO-19: the translation initiative for covid-19 Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19) Machine Translation Human Evaluation: an investigation of evaluation based on Post-Editing and its relation with Direct Assessment A massively parallel corpus: the bible in 100 languages. Language Resources and Evaluation Continuous measurement scales in human evaluation of machine translation The state and fate of linguistic diversity and inclusion in the NLP world Adam: A method for stochastic optimization OpenNMT: Neural machine translation toolkit Low resource neural machine translation: A benchmark for five african languages. CoRR, abs Crisis MT: developing A cookbook for MT in crisis situations Gamayun -Language technology for humanitarian response Bleu: a method for automatic evaluation of machine translation Exploring the sawa corpus: collection and deployment of a parallel corpus english -swahili chrF: character n-gram f-score for automatic MT evaluation A call for clarity in reporting BLEU scores An englishswahili parallel corpus and its use for neural machine translation in the news domain Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia Neural machine translation of rare words with subword units A study of translation edit rate with targeted human annotation OPUS-MT -building open translation services for the world Parallel data, tools and interfaces in opus Attention is all you need A note on the translation of swahili into english Transfer learning for low-resource neural machine translation