key: cord-0670565-w52wppz3 authors: Adewumi, Tosin; Brannvall, Rickard; Abid, Nosheen; Pahlavan, Maryam; Sabry, Sana Sabah; Liwicki, Foteini; Liwicki, Marcus title: Sm{aa}prat: DialoGPT for Natural Language Generation of Swedish Dialogue by Transfer Learning date: 2021-10-12 journal: nan DOI: nan sha: 8ecbb70c9d4490fbe70d5f4b781db2d9b8a6c951 doc_id: 670565 cord_uid: w52wppz3 Building open-domain conversational systems (or chatbots) that produce convincing responses is a recognized challenge. Recent state-of-the-art (SoTA) transformer-based models for the generation of natural language dialogue have demonstrated impressive performance in simulating human-like, single-turn conversations in English. This work investigates, by an empirical study, the potential for transfer learning of such models to Swedish language. DialoGPT, an English language pre-trained model, is adapted by training on three different Swedish language conversational datasets obtained from publicly available sources. Perplexity score (an automated intrinsic language model metric) and surveys by human evaluation were used to assess the performances of the fine-tuned models, with results that indicate that the capacity for transfer learning can be exploited with considerable success. Human evaluators asked to score the simulated dialogue judged over 57% of the chatbot responses to be human-like for the model trained on the largest (Swedish) dataset. We provide the demos and model checkpoints of our English and Swedish chatbots on the HuggingFace platform for public use. The introduction of Eliza, the chatbot, in the 1960s marked an epoch in the area of conversational systems [34] . Since then, open-domain conversational systems have evolved [1, 2] . Advances in deep neural networks, such as the tranformer-based architectures, have brought improvements to the field [9, 24, 14] . These models have demonstrated SoTA performances in natural language understanding (NLU) and Natural Language Generation (NLG) tasks [32, 13] . The advancements notwithstanding, challenges still exist with building conversational systems [18, 36] . These challenges include technical and ethical challenges [17] . This is more so that many of the models are originally pre-trained on English data [36] , though researchers have recently been producing multilingual versions of some of the models [10, 8, 35] . Some of these multilingual models, however, have been shown to have poor performance compared to models trained completely on the target language data [31, 28] . In this work, we perform an empirical study of the performance of one of the recent SoTA models, DialoGPT (medium), on various Swedish datasets of different sizes. Their perplexity results are compared to the one we trained on the English Mul-tiWOZ benchmark dataset. We further compare these experiments with an LSTM-based seq2seq baseline model with the attention mechanism and trained on the GDC dataset. DailoGPT is an English pre-trained model for open-domain chatbots [36] . We are not familiar with any previous published work that investigates an English pre-trained dialogue model fine-tuned to produce a different target language model. However, there are multilingual models, which are pre-trained on unstructured text of several languages [10, 35] . We thereby investigate how the English pre-trained model performs in NLG (of dialogues) by fine-tuning on a foreign target language. We contribute the codes 1 , datasets 2 and model checkpoints for public use and host the demos 3 on the HuggingFace platform. The Swedish models are fine-tuned on extracted/crawled datasets. The Swedish language is the official language of Sweden and is spoken by more than 8.5 million people [26] . We show that generation of dialogues is possible, with reasonable performance, for a foreign, target language though the pre-training was in English. There are a number of pretrained models for opendomain conversational systems. Some of them include Texar [16] , DLGnet [21] , Meena [3] and BlenderBot [27] . These are pretrained on dialogue datasets. There exist, also, models pretrained on large text and adapted for conversational systems. Examples of such models include T5 [25] and BART [19] . Another pretrained model on conversational data, DialoGPT (dialogue generative pre-trained transformer), was trained on Reddit conversations of 147M exchanges [36] . In single-turn conversations, it achieved performance close to human in open-domain dialogues. DialoGPT is based on GPT-2 [24] . It is an autoregressive model, which achieved SoTA results in different NLP tasks [24] . In a recent work on cross-lingual transferability [5] , Artetxe et al (2020) suggest that deep monolingual models learn abstractions that generalize across languages. This is in contrast to past hypothesis that attributes the generalization ability 1 github.com/tosingithub/gemdesk 2 We may provide the datasets or the APIs for extracting them, where applicable. 3 huggingface.co/tosin/dialogpt mwoz huggingface.co/tosin/dialogpt sv of multilingual models to the shared subword vocabulary used across the languages and joint training, as demonstrated for mBERT [23] . The performance of such multilingual models on low-resource languages and unseen languages are known to be poor [22, 33] . In evaluating the performance of open-domain chatbots, it has been shown that automatic metrics, like the BLEU score, can be very poor but they are still used in some cases [20] . Conversation turns per session is another metric of interest [37] . Perplexity is widely used for intrinsic evaluation of language models in pilot experiments and its theoretical minimum, which is its best value, is 1 [3] . Probably the best evaluation is done by human evaluators (or annotators) but this can be subjective. The judgment of human evaluators is seen as very important, especially since humans are usually the end-users of such systems [36] . We used the DialoGPT medium model with 345M parameters and 24 transformer layers in this work. We chose this because it was reported to have the best performance (compared to its small and big versions) across a set of related tasks [36] . The experiments were carried out on several Tesla V100 GPUs on an Nvidia DGX-1 server running Ubuntu 18. The datasets were split in the ratio 80:10:10 for training, dev and test sets. Multiple runs (5) per experiment were conducted and the average perplexity reported in section 4. Although one automatic metric (perplexity) was used to evaluate the models, it has been shown to correlate with another proposed human evaluation metric called Sensibleness and Specificity Average (SSA) [3] . The conversation context was set as 7 during training. Larger contexts bring memory challenges, hence 7 appears to be a good balance for training [3] . The finetuning process involved adjusting all the parameters of the pretrained model. Furthermore, we compare the DialoGPT experiments with a reasonable baseline model: a seq2seq model trained on the Swedish GDC dataset. The seq2seq model is based on the LSTM architecture [15] and uses the attention mechanism [6]. The model has 6M trainable parameters. A batch size of 64 is used. Perplexity models the average predictability (i.e. minimizing the uncertainty of predicting the next token). The lower the perplexity, the better the model performs [3] . This is used often to evaluate the language models built with n-grams of text dataset [29] . Perplexity, P P , calculates the probability ρ of the test corpus, normalized by the total number of words, N , in the test corpus W test . The normalization is done by taking the Nth root of inverse of calculated probability (see Equation 1) . In addition to intrinsic evaluation, the ideal model in each category of datasets was evaluated on single-turn conversations by native/near-native Swedish speakers. The authors decided to test single-turn conversations as practiced in the original paper. This is because the lack of long-term contextual information is still an existing problem in conversational systems [36] . A scale of clearly human-like (4.0), somewhat human-like (3.0), not very human-like (2.0), clearly not human (1.0) was provided. Similarly to the original DialoGPT work [36] , we drew 30 input sentences randomly from the test set for each model and recorded their corresponding responses. Seven human annotators then scored each conversation turn online, where the turns were assigned randomly from one of the three models or the human ground truth. The annotators were to use their best judgment to decide what they felt was human-like or otherwise on the four grade scale for each conversation. For some further analysis it was decided to also aggregate annotator scores into an aggregate binary score with classes not human-like (0) and human-like (1) such that inter-annotator agreement could be defined simply as the mean of class agreement for all annotator pairs that scored the same model-question combinations. The English translation of part of the single-turn conversations of the familjeliv 1M+ model is available in the appendix. They were translated using Google translate and reviewed by a Swedish native speaker. The original Swedish conversations are also available 1 . A token vocabulary that includes all common words can become quite large and requires the use of an additional special token for unknown words. Byte-Pair Encoding (BPE) was introduced [30] to address both these shortcomings by first identifying a base vocabulary of tokens consisting of all symbols that occur in the text and then defining merge rules based on frequencies of compound symbols to form new tokens from two tokens of the base vocabulary. The merge process proceeds until the vocabulary has attained a desired fixed size. GPT-2 uses bytes as the base vocabulary, which forces the base vocabulary to be of size 256 while allowing all unicode characters as well as higher level subword components, basic words and common compounds to be expressed in compressed form. With some additional rules to deal with punctuation, the tokenizer of GPT-2 can handle every text without the need for the symbol. For GPT-2 the merge set consists of 50,000 tokens which combined with the 256 bytes base tokens and a special end-of-text token yields a total vocabulary size of 50,257. GPT-2's tokenizer trained on English text can thus also express words in Swedish (including words with special Swedish characterså,ä andö) as it relies on BPE, however, one can expect the compression rate to be lower as the frequency and composition of subwords are different in the two languages. More often, it will have to rely on character level tokens. Indeed, a comparison of the merge vocabularies of the (English) GPT-2 tokenizer and one constructed 4 based on the Swedish language Gothenburg Dialogues Corpus (GDC) shows that only about 8,000 of the compound tokens are shared. The authors experimented with various Swedish datasets and the English MultiWOZ. The Swedish datasets are conversational data from Reddit (2 sizes), Familjeliv (3 sizes) and the GDC [4] . The extracted data were pre-processed by removing emails, URLs, numbers and some special characters. Table 1 summarizes the datasets. Example conversation lines from all the Swedish datasets are available in appendix A. Reddit is a social discussion website with various communities or subreddits 5 . The discussions can be very informal and contain slangs, offensive text and emojis. The Reddit data was constructed with breadth-first search traversal, using the applicable API: PAWN. Two sizes of the data were created and experimented with: the hot 800 topics (with about 60K conversation lines) and hot 50 topics (with over 4K conversation lines) of the Swedish subreddits. The four Swedish subreddits from which data were extracted are sweden, svenskpolitik, swedishproblems and stockholm. The sentences in the conversation were given a maximum length of 500 characters. Familjeliv 6 is a popular website forum in Swedish. The name literally translates as family life and mainly targets adults with questions concerning pregnancy, parenthood, and domestic life hacks. It has more than a million daily visitors (2016) with lively user forum participation. The fora were scraped by automated software over the extent of a month to provide a dataset with over a million conversation turns. The Gothenburg Dialogue Corpus (GDC) is comprised of 360 individual dialogues transcribed from 5 reddit.com 6 www.familjeliv.se recordings of about 25 different social activities, including debates, academic seminars and situations [4] . The content is somewhat different from the other Swedish corpora used in this study as it originates from real speech conversations, and contains casual language, dialect, slang and sometimes broken grammar, typical of spoken language. It counts over 108K dialogue turns with over 1.3M tokens. Among the many English conversational datasets available is the MultiWOZ [7]. It is a large, multidomain and multi-task conversational dataset that has been extensively used since its creation. It consists of more than 10,000 dialogues distributed between 70% multi-domain and 30% single domain dialogues. It has been a standard benchmark for different dialogue problems. There are several versions of the dataset, with each new one bringing improvements [12] . Table 2 shows the mean perplexity results for the various datasets. We observe a trend of decreasing perplexity score with increased dataset -as expected. More data implies improved score. The best perplexity score on the test set is obtained with the MultiWOZ. This is followed by the Familjeliv size of over 1M turns while the Reddit 4K had the worst perplexity of the models trained with DialoGPT. The English MultiWOZ model has a better perplexity compared to any of the Swedish models, though some of the latter have more conversation turns. This observation is not surprising, as the pre-trained model was pretrained in English, though it used the Reddit-style conversation. The seq2seq model had the worst perplexity result overall. This should be partly because it was not pretrained. Its architecture is also not completely comparable to that of DialoGPT. Table 3 shows results from human evaluation of the presented Swedish single-turn conversations for three of the models, with the last row including scores for the original dialogue (i.e. replies by real humans). The scale is as discussed in the previous section. The overarching target of the human evaluation was to collect a subjective opinion of how natural the simulated conversations appear to humans. The aggregate human-like ratio across the survey is presented in Table 4 where we see that the model trained on the large familjeliv dataset was considered human-like in 57.3% of the assessments. This is when the scores on the four-level scale are collapsed into binary classes for simplicity: human-like (assigned value 1.0), which aggregates the clearly human-like and somewhat human-like assessments into one category, placing the other two assessments in the other class not human-like (value 0.0). The inter-annotator agreement scores in table 4 can then be calculated as the mean of binary class agreement across all annotator pairs that scored the same model-question combinations. We also note that the lower percentile for the human dialogue scores are above the higher percentile for the best performing chatbot (also visualised in Figure 1 ). We must therefore conclude that none of the models achieved human performance. However, this investigation seemingly agrees with the hypothesis that deep monolingual models learn abstractions that generalize across languages, as demonstrated also by [5] , even though their experiments are different from those carried out in this work. Indeed, in this work, less computational effort was needed to demonstrate this. Demos of the English and Swedish chatbots are hosted on the HuggingFace platform 3 , including the model checkpoints. It should be pointed out that there are risks with using the models, such as producing dialogue texts that contain unwanted bias, stereotypes or offensive language. This is a well-known, but hard to avoid problem due to the difficulty of removing such material from the underlying sources of data which the models are trained on [36] . In this work, we show through an empirical study, using the recent SoTA model DialoGPT-medium [36] , that a model pre-trained in English can be adapted by fine-tuning on a foreign language corpus such that the performance of NLG of dialogues for the target language can be good. The foreign, target language in this study is Swedish. The investigation was carried out using 3 differentlysourced Swedish datasets. The human evaluators asked to score the simulated single-turn conversations judged that about 57% of the chatbot responses are human-like for the model trained on the largest Swedish dataset. Despite the advancements in deep neural networks and the demonstration of SoTA performances in NLG tasks [32, 13] by a model such as DialoGPT, challenges still exist [36] . Hence, future work may involve attempts to understand the mathematics of languages/linguistics and their relatedness, comparison of the performance observed here with a model pretrained on Swedish data and the possibility of using a Swedish tokenizer on the model pretrained on Swedish data. It may also involve finding better decoding schemes than existing ones. July 2020. Association for Computational Linguistics. [6] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, ICLR 2015, 2015. [ Prompt: It turns out that the girl smeared powder into her son's hair and painted him in the face with mascara. BOT: Then he got angry and shouted that he was going to report this to the police. Prompt: All rear-facing chairs are safer than turning one over for a young child. And then they've gone through the same tests to be sold. BOT: the only thing they do anyway is drive a chair. Is it really worth it? Prompt: Probably wouldn't have been able to keep your mouth shut in front of them either. It's okay to get angry with both the parents and their children when she hurts your child. BOT: I love my children and do everything I can to make them feel good and have a good time. Prompt: So I think merino wool can work. I think the usual wool itched a little, because the baby wasn't completely satisfied before she fell asleep... BOT: I'm also afraid it's going to be too much mess on the clothes, having a hard time buying second-hand and can't really afford to castrate them. Prompt: I have worked in healthcare for 10 years and have worked 7 Christmases so far. So now it's my turn to be free. BOT: The first year I have only been on sick leave for 3 weeks and it was changed to 3 days a week. But I don't have a job to go back to. Prompt: I think it's perfectly OK to let your child become a child model provided the child enjoys the camera and thinks it's fun. BOT: We've had cameras that have been known to be good, and it's worked great. Prompt: Double-check with the social insurance fund so you don't get caught for benefit fraud. BOT: Thanks for the tip! I have already been in contact with fk and they would send me forms that I will submit. Prompt: I'm afraid if he goes over, he's going to be that flamboyant again who's going to assert himself with new classmates, and then he's safe in the class he's in. Anyone have any wise thoughts on this? BOT: He loves being in kindergarten and therefore enjoys it. However, he eats poorly and does not want to eat anything other than water and juice. Prompt: In many countries, babies are given to e.g. day mother, grandmother, aunt... because parental insurance is not as generous as ours. BOT: Then even if it were still good ownership, it increases the cost of a single person to make ends meet. YOU: so we can go through it together ... BOT: but I still want to look around a little before I get into it, to see what to expect. Prompt: there are no parking facilities where as much as I know BOT: You have no obligation to report anything. It is not illegal to park in a family car park. Prompt: no I don't think we can make ourselves independent of nature or mo re independent of that nature BOT: What exactly do you think? That there are also plants that can produce man in this way? Prompt: but you won't get into the teacher's college later BOT: No, but I can go in because of the training. Prompt: but she can get evil too BOT: And what should I do about it? Prompt: uh we try flying first and we'll see BOT: I've never been to the airport and don't know how it works. Prompt: well then it counts the different assets as natural products then what about manure should it count you would count it as a natural resource BOT: Natural products are eaten to increase the greenhouse gas in the body, which in turn causes the burning of natural tissues in nuclear power. Conversational systems in machine learning from the point of view of the philosophy of science-using alime chat and related studies Vector representations of idioms in chatbots Towards a human-like open-domain chatbot Annotations and tools for an activity based spoken language corpus On the cross-lingual transferability of monolingual representations Cross-lingual language model pretraining Pre-training of deep bidirectional transformers for language understanding Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics Multiwoz 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines The gem benchmark: Natural language generation, its evaluation and metrics Deberta: Decoding-enhanced bert with disentangled attention Long shortterm memory A modularized, versatile, and extensible toolkit for text generation Understanding the role of objectivity in machine learning and research evaluation Speech and Language Processing: An Introduction to Natural Language Processing BART: Denoising sequenceto-sequence pre-training for natural language generation, translation, and comprehension Conversational chatbots with memory-based question and answer generation Multiturn dialogue response generation with autoregressive transformer models Adapterhub: A framework for adapting transformers Association for Computational Linguistics Language models are unsupervised multitask learners Exploring the limits of transfer learning with a unified text-to-text transformer Pluricentric languages. Differing norms in different nations Recipes for building an open-domain chatbot Is multilingual bert fluent in language generation? Perplexity minimization for translation model domain adaptation in statistical machine translation Neural machine translation of rare words with subword units Multilingual is not enough: Bert for finnish Superglue: A stickier benchmark for general-purpose language understanding systems Towards zero-label language learning A computer program for the study of natural language mT5: A massively multilingual pre-trained text-to-text transformer Dialogpt: Large-scale generative pre-training for conversational response generation The design and implementation of xiaoice, an empathetic social chatbot Reddit examples "Om man tycker att munskyddär fånigt, tycker man då att ansiktsmaskär ansiktslarv?"