key: cord-0490969-6h87jizr
authors: Gehrmann, Sebastian; Adewumi, Tosin; Aggarwal, Karmanya; Ammanamanchi, Pawan Sasanka; Anuoluwapo, Aremu; Bosselut, Antoine; Chandu, Khyathi Raghavi; Clinciu, Miruna; Das, Dipanjan; Dhole, Kaustubh D.; Du, Wanyu; Durmus, Esin; Duvsek, Ondvrej; Emezue, Chris; Gangal, Varun; Garbacea, Cristina; Hashimoto, Tatsunori; Hou, Yufang; Jernite, Yacine; Jhamtani, Harsh; Ji, Yangfeng; Jolly, Shailza; Kumar, Dhruv; Ladhak, Faisal; Madaan, Aman; Maddela, Mounica; Mahajan, Khyati; Mahamood, Saad; Majumder, Bodhisattwa Prasad; Martins, Pedro Henrique; McMillan-Major, Angelina; Mille, Simon; Miltenburg, Emiel van; Nadeem, Moin; Narayan, Shashi; Nikolaev, Vitaly; Niyongabo, Rubungo Andre; Osei, Salomey; Parikh, Ankur; Perez-Beltrachini, Laura; Rao, Niranjan Ramesh; Raunak, Vikas; Rodriguez, Juan Diego; Santhanam, Sashank; Sedoc, Joao; Sellam, Thibault; Shaikh, Samira; Shimorina, Anastasia; Cabezudo, Marco Antonio Sobrevilla; Strobelt, Hendrik; Subramani, Nishant; Xu, Wei; Yang, Diyi; Yerukola, Akhila; Zhou, Jiawei
title: The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
date: 2021-02-02
journal: nan
DOI: nan
sha: e4d39244a969c0ab568954a471ff5e7dd1c3dba4
doc_id: 490969
cord_uid: 6h87jizr

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. However, due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of corpora and evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the initial release for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. However, due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of corpora and evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the initial release for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.

* Correspondence to gehrmann@google.com

Natural language generation is the task to automatically generate understandable texts, typically using a non-linguistic or textual representation of information as input (Reiter and Dale, 2000) . These texts aim to fulfill an underlying communicative goal (e.g., to produce a summary of an article) while remaining faithful to the input information, fluent, grammatical, and natural-looking. An NLG system needs to be robust to shifts in the data distribution and be able to produce text in many different languages. Finally, it is often desired that repeated interactions with the model produce diverse outputs, for example, to explain concepts in multiple ways or to become a more interesting conversational agent. All these optimization objectives can often be conflicting (Hashimoto et al., 2019) and, as a result, evaluations that focus only on a single aspect may fail to recognize the drawbacks of a particular method. To demonstrate this trade-off, consider an improvement on the CNN-DM summarization dataset (Hermann et al., 2015; Nallapati et al., 2016) measured by the ROUGE-L met-ric (Lin, 2004) . Since ROUGE only tests the extent to which a generated summary has a lexical overlap with a reference summary, it can erroneously produce high scores for fluent, yet meaningless and unfaithful outputs as long as many of the same words are used (Maynez et al., 2020; Gabriel et al., 2020) . Moreover, ROUGE tends to favor systems that produce longer summaries (Sun et al., 2019) . It is thus crucial to carefully assess the progress of NLG toward all of its goals at the same time in ways that evolve alongside the models. This is currently not the case; new models are evaluated on different datasets, most of which focus only on the English language (Bender, 2019), and using these flawed metrics. Moreover, while human evaluations of generated texts can provide complementary insights to automatic evaluation (Manning et al., 2020) , it can also lead to contradicting results since studies often omit crucial replication details and assume different definitions of the measured quantities (Howcroft et al., 2020) .

We propose a living benchmark called GEM (Generation, Evaluation, and Metrics) that aims to enable research on a wide range of NLG challenges. To avoid the fallacy of encouraging hill climbing on a leaderboard (Linzen, 2020), GEM focuses on an in-depth evaluation of model outputs across human and automatic evaluation that aims to uncover shortcomings and opportunities for progress. As datasets, metrics, and models improve, the benchmark environment will improve as well, replacing "solved" tasks with more challenging ones, incorporating newly developed metrics, and addressing discovered flaws in the experimental setup, as demonstrated in Figure 1 . Making all model outputs available under an open-source license will support evaluation research and integrating new metrics will, in turn, help their adoption and increase the robustness of model evaluations.

The initial set of eleven included datasets is presented in Table 1 . They measure specific generation challenges, such as content selection and planning, surface realization, paraphrasing, simplification, and others (Reiter and Dale, 2000; Gatt and Krahmer, 2018) . In addition to those challenges, GEM datasets also differ in their communicative goals, languages, the noisiness of data, and resource availability, to evaluate the consistency of evaluation schemes. About half of the datasets have multiple references and more than half were post-processed to improve data quality. The sizes range from 5k to Figure 1 : The opportunities of living benchmarks and pitfalls of evaluation. As models improve, we need consistent evaluations such that models can be compared to each other. This can only happen if we develop robust human evaluation standards and improve our automated metrics. Otherwise, results are challenging to interpret and compare to each other. Finally, as models improve and metrics saturate, we need to evaluate them on more challenging datasets instead of continuing to move sideways on old ones. GEM aims to provide this environment for natural language generation. 500k data points. GEM features seven languages across all tasks and two of the datasets do not include English at all. To be able to properly assess the performance of models in a way robust to the shortcuts a model can take, we additionally introduce challenging test sets that probe for specific modeling aspects (Perez-Beltrachini and Gardent, 2017; Ribeiro et al., 2020) . To ensure that research with GEM is conducted responsibly, all the datasets are documented in an NLG-specific version of data cards (Bender and Friedman, 2018; Gebru et al., 2018) we developed and for which we release a template and guide. Disclaimer: This paper currently describes the initial release of the GEM training and validation sets in support of the announcement of the shared task at ACL 2021. Some aspects of GEM are deliberately omitted and will be publicized upon release of the test sets. We will update this paper at that time to reflect the changes and extensions. More information can be found on our website https://gem-benchmark.com/.

In this section, we summarize common criticisms of benchmarks in NLP, discuss how they apply to NLG, and how we plan to address them. Then, we describe opportunities that GEM can provide. NLP benchmarks such as GLUE (Wang et al., 2019b) are common for natural language understanding Summarize relevant points within a news article *de/es *520k Articles Schema-Guided Dialog Provide the surface realization for a virtual assistant en *165k Dialog Act

ToTTo Produce an English sentence that describes the highlighted cells in the context of the given Produce high quality summaries of an instructional article. *en/es/ru/tr/vi *175k Article Table 1 : A description of all the datasets included in GEM. The tasks vary in communicative goal, data size, and input type. * indicates changes from the originally published dataset made for GEM.

(NLU) tasks. They aggregate multiple tasks under a unified evaluation framework, which enables researchers to fairly compare their models to others. Due to the improved model comparability, benchmarks are critical in measuring modeling progress.

However, they also pose a risk that progress is reduced to the single number shown in a benchmark's leaderboard and thus may encourage blindly optimizing it without regard to other considerations like model size or fairness (Ethayarajh and Jurafsky, 2020). This is especially challenging for benchmarks in NLG since, as discussed above, the performance cannot be described through a single metric and it is often not clear what metric to optimize for. This shortfall can be seen in benchmarks like DecaNLP (McCann et al., 2018) and GLGE which include NLG tasks but focus only on a single metric and, as a result, may mischaracterize a system's performance.

Moreover, an easy-to-use data infrastructure also disincentivizes researchers from interacting with and conducting in-depth analyses of the data sets that models are trained on. The limited analysis delegates the responsibility to ensure that all included datasets have been collected fairly to the creators of the benchmark (Denton et al., 2020) . The dataset and benchmark creators thus must provide in-depth statements that describe the data characteristics and surface potential issues and consider these issues when selecting datasets for a benchmark (Gebru et al., 2018; Bender and Friedman, 2018) .

These dangers emphasize selecting datasets for a benchmark needs to be carefully done, that the setup has to remain flexible to be able to address newly found limitations, and that the benchmark should focus on climbing a leaderboard. Instead, a living benchmark that can adjust its datasets and specific evaluation metrics can be much more powerful and long-lived. This can, for example, be seen in Dynabench, 1 (Potts et al., 2020) which has a static evaluation, but interactively adds more test data through a human-in-the-loop approach.

Increasing multilingualism of NLG research. Another potentially harmful choice by benchmark creators is the choice of the languages of the included datasets. It is often assumed that work on English transfers to other languages (Bender, 2011) . However, this assumption does not consider differences between the languages that lead to higher modeling complexity, for example, a richer morphology or a flexible word-order. Still, the majority of work in NLP and almost all benchmarks exclusively focus on English (e.g., Wang et al., 2019b; McCann et al., 2018) . Even if multiple languages are considered, the availability of data in a language often does not represent the number of speakers of a language. This means that work on languages with little available data can potentially impact many more people than work on highly resourced languages (Joshi et al., 2020) .

As a result, many recent benchmarking and dataset creation efforts in NLU develop and focus on tasks that are inherently multilingual or which explore cross-lingual transfer. For example, XTREME (Hu et al., 2020) introduces a benchmark covering 40 languages across multiple NLU and retrieval tasks, XCOPA (Ponti et al., 2020) is a commonsense reasoning dataset for eleven languages, and MLQA ) is a dataset for extractive question answering across seven languages. We can observe a similar recent trend in natural language generation, where ML-Sum and WikiLingua (Ladhak et al., 2020) were created as multilingual summarization datasets. There also have been first steps toward including NLG tasks in multilingual NLU benchmarks. For example, XGLUE includes Question and News Title Generation (Liang et al., 2020) . Unfortunately, XGLUE reduces the generation evaluation to BLEU-4, a metric that is inadequate for NLG (Reiter, 2018) .

There have also been multiple shared tasks in NLG that focus on multilingualism, for instance, the shared task on multilingual surface realization which includes eleven languages (Mille et al., , 2020 . The shared task on document-level generation and translation featured German and English generation challenges (Heafield et al., 2020). The WebNLG+ shared task asked participants to 1 https://dynabench.org/ contribute models that can realize text in Russian and English (Ferreira et al., 2020) .

A benchmark that focuses only on NLG can enable much richer evaluation (as described in the next sections), and promote non-English datasets. In addition, it can ensure that the datasets created for those shared tasks continue being evaluated.

Providing a testbed for automated evaluation. Most traditional automated metrics, such as ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002) , measure the n-gram overlap between a reference and the generated text. However, in most cases, there is more than one correct way to generate a text, especially in tasks with a latent content planning or selection step (Reiter and Dale, 2000) . That means that a correct solution may score low on a metric. While multiple references alleviate the issue somewhat, these metrics still have a low correlation with human judgments (Reiter, 2018; Fabbri et al., 2020) . To address the issue, the machine translation community has been organizing yearly metrics shared tasks which produce metrics that achieve a high correlation (Stanojević et al., 2015; Bojar et al., 2016 Bojar et al., , 2017 Ma et al., 2018 Ma et al., , 2019 Mathur et al., 2020b) . The latest metrics focus on semantic equivalence instead of lexical similarity, which improves the correlations drastically. However, recent work by Fabbri et al. (2020) demonstrates that this may not hold in summarization, where the automated metric BERTScore (Zhang et al., 2020b) does not improve upon the correlation of ROUGE. Moreover, Mathur et al. (2020a) and Freitag et al. (2020) find that when comparing two high-quality systems, differences according to a metric may also stem from how references are written or flaws in the metric itself. 2 Given that automated metrics perform differently across tasks, setups, and languages, a multi-task NLG benchmark has the opportunity to act as a testbed to evaluate how the latest advances in automated metrics perform on these different tasks. The benchmark can facilitate this research through the release of system outputs and associated human annotations, which is what we are planning to do with GEM. Moreover, we allow the integration of additional metrics into our living benchmark system, which enables a much faster adoption.

Developing reproducible human evaluation standards. In recent work, Howcroft et al. (2020) investigated NLG papers from the last twenty years and the evaluation methodologies differ drastically across papers. Moreover, in most cases, it is not even mentioned what the human evaluation aims to measure and that definitions of measures like "accuracy" or "fluency" are inconsistent. They thus suggest reporting standards for criteria and methods, following a classification system proposed by Belz et al. (2020) . In addition, regularly scheduled shared tasks like WMT have lead to standardization of human evaluation setups and enabled controlled experimentation with them. GEM has the opportunity to develop reproducible standards for how human evaluation for NLG tasks beyond translation should be conducted while at the same time incorporating lessons from related work. Acting on the same need, the recently proposed GENIE (Khashabi et al., 2021) system aims to automate and standardize the human evaluation of different NLG systems, however with the contrasting goal of reducing the evaluating to a leaderboard-like score.

As highlighted in Figure 1 , the selection of included datasets is an integral part of a benchmark. They should be challenging for models, but it should still be possible to evaluate models trained on them. Moreover, the datasets should cover a wide range of relevant generation challenges that allow for findings to be as general as possible. Finally, the datasets should cover tasks that are interesting for contributors to work on to facilitate the wide adoption of the benchmark.

To collect datasets with those desired properties, the selection methodology for GEM is composed of three steps. First, we elicited a set of proposals from everyone involved in the effort. Second, we identified criteria for the selection. Third, all GEM members voted on individual dataset and criteria utilities. The final selection maximizes the utility under constrained resources, similar to a knapsack solver. 3 This can be seen as an extension of the selection process of SuperGLUE (Wang et al., 2019a) that had similar first and second steps but made the final decision based on which were harder for a baseline model to solve after identifying a final set of candidate datasets. Since we are going to introduce challenge sets, the baseline performance of models on a dataset matters less.

Dataset Elicitation. In the first step, all GEM participants were asked to suggest datasets following the schema provided in Appendix A. The categories included multiple brief categorizations, such as a description of the challenge that this dataset provides, its high-level task, and the communicative goal of an agent trained on the data. Following our goal to focus on non-English languages, we further asked for the languages included in the dataset, as well as the language locale. This step yielded 35 proposed datasets, listed in Appendix B.

Estimating Task+Criterion Utility. The second step focused on the selection of criteria to inform the selection. The initial set of criteria was selected through open discussion involving all members. We split criteria into "hard" and "soft" ones -hard criteria would lead to the definite inclusion/exclusion of a task if (not) satisfied. Soft criteria inform the utility of the remaining tasks. All GEM members filled out a survey asking them to rate, on a 5-point Likert scale, how much they wanted to see a task included in GEM. Additionally, we posed yes/no questions for all considered hard criteria and various questions about the soft criteria (e.g., "what percentage of the tasks should feature non-English language?", or "do we prefer noisy or clean datasets?"). Finally, the survey included open text fields that asked for (1) comments on any of the tasks, (2) comments or suggestions on hard exclusion criteria, and (3) suggestions of additional criterion/criteria. The full list of questions is shown in Appendix C.

The survey received 28 responses, revealing that the initial version of GEM should include a median of 10 tasks or an average of 12. Of those tasks, about a third should feature non-English language.

Selected Criteria. For the hard criteria, there was an agreement to focus only on open-access datasets and that concurrent or past shared tasks for the same datasets are not an issue. Overall, the sentiment determined the following selection principles:

• We focus on diverse high-level tasks over a single high-level task evaluated in-depth. However, each high-level task should include multiple datasets.

• We focus on clean datasets to avoid conflating model mistakes and learned noise.

• We include a mix of high-and low-resource datasets.

• We focus on data with interesting test sets.

• We should not focus on the quality of current evaluation strategies for a given dataset.

• We prefer multi-reference datasets since those have been shown to lead to more robust automatic evaluation.

High-Level Tasks. Since these principles dictate that we should focus on a small set of high-level tasks, we used the free-text replies to evaluate the interest in different high-level tasks.

Grouping the proposed tasks yielded the following candidates: Summarization, Dialog, Simplification/Compression, Question Answering, Creative Writing, Data-to-Text, and Question Generation. 4 There was a preference to exclude image inputs and question answering because those tasks add complexity to the evaluation beyond the generated text. Moreover, since creative generation tasks like story generation and poetry generation suffer even more from inadequate evaluation approaches, there was a consensus to not include them. There was, however, a strong preference for the high-level tasks Summarization, Data-to-text, and Dialog. 5 Specific Datasets. The final selection is shown in Table 1 . To arrive at the selection, we first ranked all datasets by their average rating. For this, we treated positive ratings as 1, negative ratings as -1, and neutral ratings as 0. The highestranked datasets were E2E with 0.577, XSum with 0.538, and ToTTo with 0.461. Unfortunately, non-English datasets were ranked lower, with only WebNLG and MLSum among the top 15 datasets. We grouped all datasets by their high-level tasks and selected a group that would not violate the selection principles (e.g., only high-resource tasks). If two datasets fit, we picked the one with a higher interest rating. Among the 11 datasets, we have seven different languages, and the dataset sizes range from 5,000 examples to 1.5M, with most datasets between 50-150k examples. Two of them do not include English at all, which we hope reduces the dependence of the modeling approaches on anglocentric pretraining (Anastasopoulos and Neubig, 2020). The high-level tasks include Dialog, Summarization, Data-to-Text, and Simplification. About half of the datasets have multiple references and more than half had post-processing steps applied to them to ensure high data quality.

We produce data cards (Bender and Friedman, 2018; Gebru et al., 2018) for all data sets in GEM, for which we developed an NLG-specific template. 6 In addition to describing the data itself, the cards acknowledge potential limitations of a dataset regarding its creation process and describe its real-world use cases to ensure that the research is conducted responsibly. These datasets are the base selection, and as part of GEM, we may change datasets and how they are used. For example, we may improve the training sets, make the test sets more challenging, or probe for specific skills a model must exhibit with testonly datasets (Perez-Beltrachini and Gardent, 2017; Linzen, 2020; Ribeiro et al., 2020; Schlegel et al., 2020) . We may also ask to evaluate a single model on multiple test sets, following the design by Dua et al. (2019) .

For this release of the training sets, we are including modifications to several of the datasets: (1) MLSum: We excluded all languages besides Spanish and German since the sources for other languages disallow scraping content. Additionally, we removed all duplicate items (i.e., items with the same input text) and we used langdetect 7 to filter out examples that were in the wrong language. In total, 147 examples were removed from the German portion (0.06%) and 7417 examples were removed from the Spanish portion (2.5%). (2) XSum: Summaries in this dataset often have divergence issues between the source and target texts since gold summaries are introductory sentences prefacing each article. Models agnostic to such noises are vulnerable to hallucinations Dhingra et al., 2019) . To combat this, we fine-tuned a BERT-based (Devlin et al., 2019) classifier on 500 document and gold summary pairs, manually annotated for faithfulness (Maynez et al., 2020) and excluded all document-summary pairs from the original XSum dataset where the classifier was not confident (p(faithful) > 0.8) whether the summary is faithful to the document or not. (3) Schema-Guided Dialog: We are focusing on the response-generation part of the dataset and thus reformatted the dataset to treat the service agent utterances as the targets to be generated and the previous customer utterance and the agent's dialog act as the input. We additionally reformat the dialog acts to directly conform to the format described in the paper (Kale and . (4) Wik-iLingua: We focus on the same five languages that were benchmarked in its original release (en, es, ru, tr, vi). Specifically, we are focusing on assessing the cross-lingual alignment ability by varying the input language but always generating English.

The modifications to the remaining datasets will affect only test sets and thus be released later.

Since the GEM test sets and final metrics selection have not been released yet, we describe an experimental setup that will ensure that participating models are trained correctly and evaluated on publicly available data with available metrics that will give a sufficient indication of a model's performance. To do this, we are reporting the results of the baseline models on the validation sets.

Much of the recent modeling progress in NLP can be attributed to the rise of the pretrain-then-finetune paradigm which has led to consistently better results. This finding is consistent with human judgments for summarization, as shown by Fabbri et al.

(2020), among others. However, many of the tasks included in GEM may not benefit from a language model encoder since their input is not natural language. We thus apply a variety of different architectures that vary in size, complexity, and training schema. Our main baselines are T5 with 60M parameters (Raffel et al., 2020) and BART with 139M parameters (Lewis et al., 2020a) . For non-English datasets, we use their multilingual counterparts mT5 (Xue et al., 2020) and mBART (Liu et al., 2020b) . We additionally train the following baselines on a subset of tasks: TGen (with added language model and lemma tags denoted as TGen+/++) (Dušek and Jurčíček, 2016b), an architecture for generation from dialog acts, an LSTM-based Sequence-to-sequence model with attention (Bahdanau et al., 2015) , DialoGPT (Zhang et al., 2020c) , a pretraining approach for conversational models, and PEGASUS (Zhang et al., 2020a) , which uses a summarization-specific pretraining schema that masks and predicts entire sentences.For WikiLingua, we additionally report results on a setup proposed by Ladhak et al. (2020) which includes first training a monolingual model followed by finetuning with the correct source language, coupled with synthetic data generated through translation (mBART+). Almost all baselines can be reproduced on a GPUbased colaboratory notebook within 2-3 hours.

As mentioned above, GEM provides a testbed for automated metrics and can be used to popularize newly developed ones. Thus, models are evaluated via a constantly expanding list of metrics and, to avoid overfitting to known metrics, we will use metrics on the test submissions that are not included in this initial writeup. Consequentially, the baseline results are an incomplete list which will be expanded upon the announcement of the test metrics. The set of metrics can be computed via the framework described at https://gem-benchmark. com/shared_task which comprises metrics in the following categories:

Lexical Similarity. We include multiple "traditional" metrics as baseline metrics, notably BLEU (Papineni et al., 2002) , ROUGE-1/2/L (Lin, 2004) , and METEOR (Banerjee and Lavie, 2005). These metrics can often be gamed, for example, ROUGE can be improved by increased the output length of the model (Sun et al., 2019) . Moreover, the reliability of these metrics depends on the quality and number of the references (Mathur et al., 2020a; Freitag et al., 2020) . However, on a system-level, they still correlate well with human judgments for some tasks (Reiter, 2018) .

Semantic Equivalence. More recently, metrics that rely on pretrained language models have shown improved correlations with human judgments on the segment-level. We thus include BERTScore (Zhang et al., 2020b) , a metric based on the similarity of sentence embeddings, and BLEURT (Sellam et al., 2020) , a metric that is fine-tuned on human ratings. The reported baseline Probing for Faithfulness. While not included in this initial release, we want to note another approach that has shown promise in summarization. The approach relies on the insight that a reader of a reference and generated summary should be able to answer the same question, regardless of how the summary is phrased. There has been much development toward these QA-based approaches (Eyal et al., 2019; Scialom et al., 2019; Durmus et al., 2020; Wang et al., 2020, among others) and they can provide an alternative angle to model evaluation that does not highly correlate with other evaluation approaches (Fabbri et al., 2020) . In addition to faithfulness, there have also been related efforts to provide more fine-grained and interpretable metrics, for example to measure consistency in datato-text problems (Opitz and Frank, 2020; Dhingra et al., 2019) or to combine multiple measures such as entailment and similarity (Kané et al., 2020).

Diversity. As argued by Hashimoto et al. (2019) among many others, NLG models intrinsically trade off diversity and quality. A model can produce more diverse outputs through sampling but at the cost of output quality. To account for this as-pect, we compute multiple diversity metrics, starting with those proposed for the analysis of the results of the E2E NLG challenge (Dusek et al., 2020) and by van Miltenburg et al. (2018) . These include the Shannon Entropy (Shannon and Weaver, 1963) over unigrams and bigrams (H 1 , H 2 ), the mean segmented type token ratio over segment lengths of 100 (MSTTR, Johnson, 1944) , the ratio of distinct n-grams over the total number of n-grams (Distinct 1,2 ), and the count of n-grams that only appear once across the entire test output (Unique 1,2 , Li et al., 2016) .

System Characterization. The final section of metrics will characterize the systems. While the focus of this section will be on qualitative descriptions through model cards, we also gather quantitative information that is not necessarily associated with a judgment. As part of this, we collect the number of parameters of a system, as suggested by Ethayarajh and Jurafsky (2020). For each task, we additionally report the vocabulary size over the output (|V|) and the mean output length of a system (Sun et al., 2019) .

One of the central aims of GEM is to measure the progress in NLG without misrepresenting the complex interactions between the sometimes contradicting measures. We thus will not distill the complex interplay of the data, metrics, and model outputs into a single number or statement, and we do not present results in a traditional leaderboard. Instead, we developed an interactive result exploration system that allows analyses of model results, and which we describe in this section. To further motivate this change, consider the following conclusion someone may draw from looking at a leaderboard:

System Foo performs the best.

Our interactive system aims to enable more nuanced statements such as:

System Foo leads to consistent performance increases in Bar-type metrics on challenges that measure Baz while maintaining equal performance on most metrics of type Qux.

A screenshot of our system is presented in Figure 2. 8 In addition, our baseline results are presented in a tabular view in Tables 2 and 3. Our interactive system is centered around a parallel coordinates plot (Inselberg, 1985) which shows all results as lines through parallel axes. Every line intersects the axes at the corresponding mapped value. For instance, see the red line representing the results for task "ToTTo" of baseline "t5-small". Filters can be applied along axes (see BLEURT axis in Figure 2 ) and the filtered selection is highlighted through bold lines. A selection can be a set of metrics, systems, or tasks. This style of presenta- Table 3 : Results of the baseline results we release with GEM, focusing on diversity of the outputs and neutral system characterizations.

tion has not been used before for a benchmark. The closest prior work is by Fu et al. (2020) for namedentity recognition which allows similar filtering and sorting, but presents the results in a table. However, the parallel coordinates approach can scale to a much greater number of metrics than a table. Moreover, by using a parallel coordinates plot instead of a table, it is easy to spot patterns that span multiple metrics, systems, or tasks. For example, the highlighted line in Figure 2 uncovers that, for the T5 baseline on ToTTo, the diversity metrics score higher than other systems while scoring lower on reference-based metrics. Since we only have a single baseline for ToTTo, it is unclear whether this difference can be attributed to the dataset or the system but this relationship will be uncovered once we receive submissions.

The final system will additionally be able to display the model cards and other related metainformation associated with submissions. It will also be able to show (and compare) exemplary outputs for each test set. Those two features will improve the transparency of the results and systems to those who are not familiar with a task and provide necessary information to those who consider using a particular system. The combination of all components will enable analysis on quantitative, individual, and qualitative level which can support formulating new research hypotheses and gather in-depth insights about system performance. For example, the functionality to compare human annotation and automatic measures could lead to a better understanding how fluency affect BERTScore.

In addition to the interactive self-directed result exploration, our shared task features an evaluation and analysis part. Instead of dictating the interpre-tation of the modeling shared task results, we will release all system outputs and metrics in this second part and participants of this part may run their own evaluation and conduct interesting analyses.

This section lists the currently active developments and the long-term steps we will take to ensure that GEM will continue to evolve and improve.

In addition to applying consistent metrics to existing test sets, mining specific model behavior, such as model generalization capabilities or performance under targeted cases, is also key for improvement. This is difficult to assess through evaluations on i.i.d. test splits. We will thus release specific challenge sets to evaluate data-to-text and text-to-text models. In addition to enabling a more specific breakdown of how a model performs in the presence of challenging inputs, the set of system outputs on these test sets also constitutes a rich corpus that enables further error analysis and research. We apply multiple strategies to create the special test sets, in particular (i) the alteration of the existing test sets (e.g., the introduction of distractors), (ii) the breaking down of the existing test sets into subsets with pre-established specificities (e.g., feature-or frequency-controlled subsets), and (iii) the compilation of new test sets (e.g., compile out-of-vocabulary inputs). Some of the test sets will also not be evaluated, and are developed to produce a rich set of outputs for future (manual) error analyses.

The test sets will be released according to the schedule listed on our website https:// gem-benchmark.com/.

GEM can be used to develop reproducible and consistent human evaluation strategies for generated text. This task involves selecting and defining which quantities of the generated text should be measured, developing annotation schemes and rater guidelines to capture these quantities accurately, and infrastructure to annotate system outputs. This process is complicated by the fact that GEM includes different task setups such as summarization, dialogue, simplification, and data-to-text. To approach this task, we will follow the recently proposed taxonomy of human evaluation measures by Belz et al. (2020) and follow the reporting strategies proposed by Howcroft et al. (2020) . All shared task participants will be asked to provide gold annotations on system outputs, which we will then use to evaluate the consistency of crowdsourced annotations. 9

Many of the initial datasets in GEM are focused on (American or British) English; we see this release as a starting point for the collection of new datasets to improve the inclusiveness of other languages and cultures. From the task point of view, to ensure the longevity of the dataset, we want it to be practical and socially beneficial. Through GEM, we have developed a set of desired criteria for NLG datasets and we aim to apply this knowledge to data collection and actively work toward reducing the disparity in data availability between languages (Joshi et al., 2020) . To this end, we are focusing on a task that requires content selection, planning, and surface realization along in a grounded scenario. The idea is in the prototyping stage with prospects broadly towards dialog response generation and topic summarization in multiple languages. We plan to do so by collaborating with speakers of low-resourced languages through a participatory research approach, as suggested by (∀ et al., 2020) . Toward this goal, GEM welcomes anyone interested in collaborating on this effort.

GEM currently focuses on tasks that deterministically transform an input into an output. With the increasing use of NLG models in real-world applications, how to enable and evaluate personalized NLG systems (e.g., in dialect or formality) remains challenging. Several related tasks have been proposed, for example, the transfer of writing style from informal to formal (Rao and Tetreault, 2018) , personalization of machine translation systems to align with particular personal traits (Mirkin and Meunier, 2015) , or persona-guided response generation of dialogue systems . We envision our framework to be extended (e.g., dataset, evaluation) to incorporate this line of usercentric NLG.

To activate the benefits of a living benchmark that is focused on evaluation, we commit to regular updates for GEM. We invite contributions in the form of model outputs, analyses, and metrics at any time and will automatically update the results presented on our website to incorporate them. For the updates to the dataset selection, we want to consider the input of the wider NLG research community. To do so, we will set up a yearly selection process similar to the one described in Section 3. The first update process will be run after the GEM workshop at ACL 2021. To be able to have a robust comparison between different versions of GEM, we will only replace a small subset of datasets at a time.

In this paper, we have introduced GEM, a living natural language generation benchmark with a focus on evaluation. While GEM does not claim to instantly solve all issues of benchmarks in NLG, we aim to provide an environment in which systems can be tested in a principled manner and which can elevate the prominence of interesting evaluation approaches. By providing a testbed to easily conduct experiments across many datasets and evaluate in a repeatable, consistent, and more interpretable way, we will be able to track progress toward the goals in NLG research much more clearly. Moreover, we will be able to extend and shape GEM in the future to include more multilingual datasets, which will assist in their adoption across the wider research community.

GEM is a large effort with a decentralized organization that is split into different task-specific subgroups. To acknowledge everyone's contribution, we list the contribution statements below for all groups.

Steering Committee. Antoine Bosselut, Esin Durmus, Varun Prashant Gangal, Sebastian Gehrmann, Laura Perez-Beltrachini, Samira Shaikh, and Wei Xu make up the steering committee. Sebastian Gehrmann coordinates and leads the GEM effort. All others provide feedback and discuss larger decisions regarding the direction of GEM and act as conference organizers for the ACL 2021 workshop. Data2Text. Ondrej Dusek wrote the data cards for E2E NLG and Czech Restaurants data and a TF loader for Czech Restaurants. He also supplied baseline outputs for E2E, Czech Restaurants, and WebNLG. Sebastian Gehrmann supplied baseline outputs for E2E, WebNLG, and CommonGen. Yacine Jernite wrote the data card for CommonGen and the Hugging Face loaders for Czech Restaurants and WebNLG. Teven Le Scao wrote the Hugging Face loader for E2E. Simon Mille and Anastasia Shimorina wrote the data card for WebNLG.

Table2Text. Varun Gangal and Miruna Clinciu are part of this group. Miruna Clinciu was responsible primarily for DART and Varun Gangal for ToTTo while maintaining a close correspondence and understanding between them to ensure all steps, such as code structure, preprocessing primitives, baselines were as uniform as possible.

Simplification. Dhruv Kumar, Mounica Maddela, and Wei Xu contributed to the GEM Simplification task. Dhruv Kumar created the data cards for the datasets, added Wiki-Auto and Turk/ASSET datasets to TFDS, and integrated the SARI metric into the GEM evaluation framework. Mounica Maddela created baselines for the task and added the Turk benchmark corpus to Hugging Face and TFDS. Wei Xu helped in the organization and planning of the task setup.

Automated Evaluation. Ondrej Dusek wrote the base code and included BLEU, Meteor, ROUGE, and referenceless metrics (the latter based on code supplied by Emiel van Miltenburg). He also prepared reference sets for E2E, Czech Restaurants and WebNLG. Sebastian Gehrman included BLEURT and BERTScore and prepared the reference sets. Dhruv Kumar included SARI and adapted the code for source-based metrics. Nishant Subramani helped with code refactoring. Miruna Clinciu , Emiel van Miltenburg and Thibault Sellam provided feedback and participated in discussions.

Human Evaluation. Samira Shaikh was the point of contact for this working group. She led the discussions to make progress on the group goals. She also worked with the group to select the general evaluation criteria as well as the criteria for dialogue and simplification tasks. Khyathi Chandu and Miruna Clinciu worked on selecting evaluation criteria for the summarization task and participated in the group discussions. Simon Mille provided support on using the criteria taxonomy and the annotated evaluation sheets for selecting and defining the criteria to use; worked on selecting the D2T criteria. Vitaly Nikolaev and Sashank Santhanam worked on selecting evaluation criteria for dialog and simplification tasks. João Sedoc worked with the group to select the evaluation criteria in general as well as the specific ones for dialog and simplification. He also helped to select among annotation interfaces. Anastasia Shimorina worked with the group to select the evaluation criteria and participated in the discussions. Chris Emezue, Sebastian Gehrmann, Khyati Mahajan, and Yufang Hou participated in discussions.

Website and Submission System. Aman Madaan, Moin Nadeem, Hendrik Strobelt, and Sebastian Gehrmann are part of this group. Sebastian Gehrmann developed the website. Aman Madaan wrote the initial version of the result presentation. Hendrik Strobelt leads the visualization effort for interactive exploration of results.

Model Infrastructure. Yacine Jernite wrote the initial script template for evaluating and fine-tuning Hugging Face models with the CommonGen example. Sebastian Gehrmann generalized the script to work with other datasets. Tosin Adewumi wrote a script for fine-tuning the DialoGPT model for dialogue datasets. Juan Diego Rodriguez worked on the infrastructure to fine-tune mBART on MLSum.

Data Sheets and Statements. Salomey Osei, Pawan Sasanka Ammanamanchi, Juan Diego Rodriguez, Sebastian Gehrmann, Yacine Jernite, and Angelina McMillan-Major are part of this group. The Data Sheet structure was adapted from a combination of designs created for the Hugging Face Datasets library by Angelina McMillan-Major and Yacine Jernite and one written by Sebastian Gehrmann. Juan Diego Rodriguez and Yacine Jernite wrote initial statements for ASSET and Com-monGen respectively. The feedback on those was used to improve the structure of the final template.

Challenge Sets. Simon Mille, Emiel van Miltenburg, Kaustubh Dhole, Varun Prashant Gangal, Saad Mahamood, and Laura Perez-Beltrachini proposed and discussed ideas of interest for the datato-text and the text-to-text tasks. Emiel van Miltenburg, Saad Mahamood, and Simon Mille work on the creation of the data-to-text datasets, while Varun Prashant Gangal, Kaustubh Dhole and Laura Perez-Beltrachini work on the text-to-text datasets.

Crowdsourcing New Data. Chris Emezue, Rubungo Andre Niyongabo, Aremu Anuoluwapo, Khyathi Chandu, Yufang Hou, Samira Shaikh, Varun Prashant Gangal, and Dimitra Gkatzia are members of this group. Khyathi Chandu worked on identifying where the current datasets fall short to motivate the crowdsourcing of data for a new task. Based on the suggestions from collaborators, she wrote two task proposals in the domains of longform text, conversations, and data-to-text that address an array of challenges in generation and easily scale to multiple languages. Samira Shaikh participated in the discussions and gave feedback on the task proposals in the pilot study phase. Dimitra Gkatzia looked into potential resources for crowdsourcing. Chris Emezue and Rubungo Andre Niyongabo explored potential low-resource African languages for crowdsourcing. We are in the process of piloting the tasks internally. 

BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

MLQA: Evaluating cross-lingual extractive question answering

A diversitypromoting objective function for neural conversation models

Visual question generation as dual task of visual question answering

Shuguang Liu, Fan Yang, Rangan Majumder, and Ming Zhou. 2020. XGLUE: A new benchmark dataset for crosslingual pre-training

CommonGen: A constrained text generation challenge for generative commonsense reasoning

ROUGE: A package for automatic evaluation of summaries

How can we accelerate progress towards human-like linguistic generalization?

Generating wikipedia by summarizing long sequences

Multilingual denoising pre-training for neural machine translation

The Ubuntu dialogue corpus: A large dataset for research in unstructured multiturn dialogue systems

Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance

Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges

A human evaluation of amr-to-english generation systems

Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics

Qingsong Ma, and Ondřej Bojar. 2020b. Results of the WMT20 metrics shared task

On faithfulness and factuality in abstractive summarization

The natural language decathlon: Multitask learning as question answering

The first multilingual surface realisation shared task (SR'18): Overview and evaluation results

Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

The third multilingual surface realisation shared task (SR'20): Overview and evaluation results

Measuring the diversity of automatic image descriptions

AmbigQA: Answering ambiguous open-domain questions

Personalized machine translation: Predicting translational preferences

Abstractive text summarization using sequenceto-sequence RNNs and beyond

Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

The E2E dataset: New challenges for end-to-end generation

Towards a decomposable metric for explainable evaluation of text generation from amr

Bleu: a method for automatic evaluation of machine translation

ToTTo: A controlled table-to-text generation dataset

Analysing data-to-text generation benchmarks

XCOPA: A multilingual dataset for causal commonsense reasoning

Dynasent: A dynamic benchmark for sentiment analysis

Data-to-text generation with entity modeling

DART: open-domain structured data record to text generation

Exploring the limits of transfer learning with a unified text-to-text transformer

Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer

Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset

CoQA: A conversational question answering challenge

A structured review of the validity of BLEU

Building natural language generation systems

Beyond accuracy: Behavioral testing of NLP models with CheckList

Beyond leaderboards: A survey of methods for revealing weaknesses in natural language inference data and models. CoRR, abs

MLSUM: The multilingual summarization corpus

Answers unite! unsupervised metrics for reinforced summarization models

BLEURT: Learning robust metrics for text generation

A mathematical theory of communication

BIG-PATENT: A large-scale dataset for abstractive and coherent summarization

What should I ask? using conversationally informative rewards for goal-oriented visual dialog

Results of the WMT15 metrics shared task

How to compare summarizers without target length? pitfalls, solutions and re-examination of the neural summarization literature

A dataset and evaluation metrics for abstractive compression of sentences and short paragraphs

Asking and answering questions to evaluate the factual consistency of summaries

Superglue: A stickier benchmark for generalpurpose language understanding systems

GLUE: A multi-task benchmark and analysis platform for natural language understanding

Challenges in data-to-document generation

Optimizing statistical machine translation for text simplification

Aditya Barua, and Colin Raffel. 2020. mt5: A massively multilingual pre-trained text-to-text transformer

MultiWOZ 2.2 : A dialogue dataset with additional annotation corrections and state tracking baselines

PEGASUS: pre-training with extracted gap-sentences for abstractive summarization

Personalizing dialogue agents: I have a dog, do you have pets too?

Bertscore: Evaluating text generation with BERT

Chinese poetry generation with recurrent neural networks

DI-ALOGPT : Large-scale generative pre-training for conversational response generation

High-level Task, e.g., data-to-text, or summarization

entity tracking/generation, referring expression generation, surface realization, content selection

Communicative goal, e.g., provide specific information, or entertainment, or accomplish a task

Wikipedia, or news articles

Language(s)

en-US, es-MX 10. Input modality, e.g., text, graph, table, images 11

Data Quality / potential Issues, e.g., noisy, clean, biased, code-mixing (different languages/writing systems)

Evaluation strategies (in original paper / papers that use dataset)

Alex Context NLG (Dušek and Jurcıcek

Bangla Natural Language Image to Text

SQUAD Question Generation

SR'11, SR'18, SR'19

Ubuntu Dialogue Generation

Visual Question Generation

Criteria Selection Survey As part of our selection process, we queried all GEM members about the utility of tasks and selection criteria. The questions below were included in the survey

• For each suggested task

• We should exclude tasks that are the focus of a shared task in 2021

• We should exclude tasks that were the focus of a shared task since 2020

• We should exclude tasks that were ever part of a shared task

LDC or ELRA)

• We should exclude datasets that are not freely available for download

• We should exclude tasks that require encoding anything but text (e.g., images or graphs)

• We should include # tasks in GEM

• X% of the tasks should feature non-English language(s)

• Diversity of tasks is more important than focus on an NLG task (by including multiple datasets for the same task)

• We should include noisy and clean datasets

We should include low-and high-resource datasets

• We should prefer tasks with non-iid test sets or specific challenge sets

• We should prefer tasks with test sets with multiple references

• If we include an NLG task (e.g., simplification or data2text), we need multiple datasets for that task

We should include a set of tasks with no clear evaluation strategy

• We should focus on tasks with reliable automatic metrics

The authors of this paper not named in the groups participated in initial discussions, participated in the surveys, and provided regular feedback and guidance. Many participants commented on and helped write this paper. We additionally thank all participants of INLG 2019, the Generation Birdsof-a-Feather meeting at ACL 2020, the EvalNL-GEval Workshop at INLG 2020, and members of the generation challenge mailing list of SIGGEN for their participation in the discussions that inspired and influenced the creation of GEM.

Participants were required to provide information for the following categories when suggesting a dataset for GEM.