key: cord-0504450-n1u92qp7
authors: Goel, Karan; Rajani, Nazneen; Vig, Jesse; Tan, Samson; Wu, Jason; Zheng, Stephan; Xiong, Caiming; Bansal, Mohit; R'e, Christopher
title: Robustness Gym: Unifying the NLP Evaluation Landscape
date: 2021-01-13
journal: nan
DOI: nan
sha: 9b54941de1e21826ecc28b32730ac3f69991ede4
doc_id: 504450
cord_uid: n1u92qp7

Despite impressive performance on standard benchmarks, deep neural networks are often brittle when deployed in real-world systems. Consequently, recent research has focused on testing the robustness of such models, resulting in a diverse set of evaluation methodologies ranging from adversarial attacks to rule-based data transformations. In this work, we identify challenges with evaluating NLP systems and propose a solution in the form of Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks. By providing a common platform for evaluation, Robustness Gym enables practitioners to compare results from all 4 evaluation paradigms with just a few clicks, and to easily develop and share novel evaluation methods using a built-in set of abstractions. To validate Robustness Gym's utility to practitioners, we conducted a real-world case study with a sentiment-modeling team, revealing performance degradations of 18%+. To verify that Robustness Gym can aid novel research analyses, we perform the first study of state-of-the-art commercial and academic named entity linking (NEL) systems, as well as a fine-grained analysis of state-of-the-art summarization models. For NEL, commercial systems struggle to link rare entities and lag their academic counterparts by 10%+, while state-of-the-art summarization models struggle on examples that require abstraction and distillation, degrading by 9%+. Robustness Gym can be found at https://robustnessgym.com/

Advances in natural language processing (NLP) have led to models that achieve high accuracy when train and test data are independent and identically distributed (i.i.d.). However, analyses suggest that these models are not robust to data corruptions [Belinkov and Bisk, 2018] , distribution shifts [Hendrycks et al., 2020 , Miller et al., 2020 , or harmful data manipulations [Jia and Liang, 2017] , and they may rely on spurious patterns for prediction. In practice, these vulnerabilities limit successful generalization to unseen data and hinder deployment of trustworthy systems. A consequence of this is the proliferation of public-use systems that were later revealed to be systematically biased [Hamilton, 2018 , Hao, 2019 , Kayser-Bril, 2020 , Knight, 2019 , Stuart-Ulin, 2018 , such as recruiting tools biased against women.

1. Named Entity Linking (Section 5.1) We compare commercial APIs from MICROSOFT, GOOGLE and AMAZON to open-source systems BOOTLEG, WAT and REL across 2 benchmark datasets (WIKIPEDIA, AIDA). We find that commercial systems struggle to link rare or less popular entities, are sensitive to entity capitalization and often ignore contextual cues when making predictions. MICROSOFT outperforms other commercial systems, while BOOTLEG displays the most consistent performance across a variety of slices. On AIDA, we find that a simple heuristic NEL method outperforms all commercial systems.

2. Summarization (Section 5.2). We propose and implement 5 subpopulations that capture summary abstractiveness, content distillation, information dispersion [Grusky et al., 2018] , positional bias, and information reordering [Kedzie et al., 2018] . We compare 7 models on the CNN-DailyMail dataset across these subpopulations. All models struggle on summaries that discard content, require higher amounts of abstraction or contain more entities. Surprisingly, models with very different prediction mechanisms make similar errors, suggesting that existing metrics are unable to capture meaningful performance differences.

Robustness Gym continues to be under active development, and we welcome feedback and suggestions from the community.

We describe the problem of evaluating machine learning models, motivate a shift towards continual evaluation, lay out 3 challenges today in making this shift, and situate this in the context of existing tools and work.

Generally, validation of a trained model for a task consists of evaluating the model on a set of examples that are drawn from the training distribution [Bishop, 2006] . Assuming identical train and test distributions (i.i.d. data), validation performance estimates the model's performance at test time.

In practice, the train and test distributions can be different [Taori et al., 2020] . This distributional shift is a natural consequence of changing real-world conditions and evolving expectations of the model's capabilities.

For instance, a model that detects entities in news articles will become outdated as new entities emerge over time. Standard validation overestimates true performance in this case, since it does not preempt performance degradation due to changing conditions. Researchers and practitioners in these circumstances often rely on intuition and an understanding of their domain to create evaluations and perform model selection.

Recent work suggests that models often exploit spurious correlations when making predictions and are not robust when evaluation moves beyond i.i.d. data [Hendrycks et al., 2019] . This lack of robustness makes models susceptible to failure under even the slightest distributional shifts [Miller et al., 2020] , or when deployed [Stuart-Ulin, 2018 ]. Systematic and continual evaluation is necessary to understand the model's limitations, and as we discuss next, standard evaluation practices often fall short.

We view evaluation as a continual process from the practitioner's perspective. In practice, constant reevaluation is necessary in order to assess if a model should continue to be used in light of new information about its limitations. By contrast, traditional evaluation addresses challenges that relate to generating a static artifact of the model's performance (e.g., computing an aggregate measure of performance on a test set [Bishop, 2006] or more fine-grained measures using a suite of tests [Ribeiro et al., 2020] .

Prior work on the construction of evolving benchmarks introduced dynamic evaluation, allowing a community of practitioners to collaboratively build challenging benchmarks. We focus here on the individual's perspective, and how to equip them with tools that support the continual evaluation paradigm. This raises a fresh set of challenges that are not traditionally addressed by standard evaluation.

Next, we identify three of these challenges-the paradox of choice, idiomatic lock-in and workflow fragmentation-and highlight how existing tools and research fall short of addressing them.

Ideal. Given a practitioner's task, needs, constraints and prior knowledge, give them guidance on what evaluation to run next.

Challenges. Evaluation is a complex, unstructured process for practitioners, since it can be confusing to choose what evaluation to run next. These decisions are frequent in continual evaluation. Here, practitioners accumulate an understanding of their model's limitations and manage changing needs, which should (ideally) While there are no easy answers to these questions, we initiate a study of how practitioners can systematically make these decisions (Section 3.1), by identifying key decision variables such as their task, evaluation needs, resource constraints and history of evaluations.

Ideal. Equip the practitioner with flexible tools to create and utilize evaluation examples that are best suited to the evaluation they want to run.

Once developers decide what they want to evaluate, they can suffer from lock-in to a particular idiom of evaluation after they adopt a tool. Our analysis suggests that most tools and research today serve a subset of 4 evaluation idioms:

1. Subpopulations. Identifying subpopulations of a dataset where the model may perform poorly.

Example: short reviews (< 50 words) in the IMDB review dataset.

Transformations. Perturbing data to check that the model responds correctly to changes.

Example: substituting words with their synonyms in the IMDB review dataset.

3. Attacks. Perturbing data adversarially to exploit weaknesses in a model.

Example: adding the word "aaaabbbb" to the end of reviews makes the model misclassify. Example: authoring new movie reviews in the style of a newspaper columnist.

These idioms are not exhaustive, but shed light on how evaluation is typically conducted. In Table 1 , we use this categorization to summarize the tools and research available today for the natural language inference (NLI) task. All of these limitations can make it difficult for practitioners, who are forced to glue together a combination of tools. Each tool meets different developer needs, and has its own abstractions and organizing principles, which takes away time from users to inject their own creativity and expertise into the evaluation process.

We address these challenges with Robustness Gym (Section 3.2), which uses an open-interface design to support all 4 evaluation idioms, and provides a simple workflow to scaffold users.

Ideal. Enable practitioners to store, version and share evaluation data, communicate findings and collaborate to take advantage of others' work.

Challenges. As practitioners evaluate, they need to keep track of progress and communicate results. Evaluation tools today let users save their progress, but provide no support for semantic versioning [Preston-Werner, 2013] and sharing findings. This is made more difficult when trying to consolidate evaluations and results across multiple tools. General-purpose data storage solutions solve this problem, but require significant user effort to customize.

Reporting findings can be difficult since there is no consensus on how to report when performing evaluation across multiple idioms. Generate report by evaluating model in testbench.

Iterate and refine the evaluation.

Intermediate Advanced Figure 2 : Illustration of the Contemplate → Create → Consolidate loop (Section 3).

Our findings are summarized in Table 2 . Only a small fraction (6.0%) of models carry model cards with any evaluation information. Qualitatively, we found low consistency in how users report findings, even for models trained on the same task. This suggests that it remains difficult for users to report evaluation information consistently and easily.

In Section 3.3, we describe the support that Robustness Gym provides for versioning evaluations in testbenches, and easily exporting and reporting findings with Robustness Reports.

To address the challenges we highlighted in the previous section, we propose the Contemplate → Create → Consolidate loop for performing continual evaluation. In this framework, practitioners will 1. Contemplate (Section 3.1) what evaluation to run next, using guidance on key decision variables, 2. Create (Section 3.2) slices of data for evaluation using Robustness Gym, 3 . Consolidate (Section 3.3) findings using the testbench and reports in Robustness Gym. Figure 2 illustrates this continual evaluation loop, which we describe in more detail next. 7 

As we highlighted in Section 2.3 (Paradox of Choice), practitioners may find it difficult to choose the appropriate evaluation among the large number of possibilities. We provide guidance to practitioners by focusing on key decision variables: the task, the evaluation goal, the resource constraints, and the history of prior evaluations. We connect these variables to decisions about which evaluation idiom and particular evaluations may be most appropriate. We emphasize that there is no "silver bullet" here, and our goal is to initiate research in how to make these decisions systematically. Figure 2 visualizes and enumerates these decision criteria (top-left), embeds them in our evaluation loop, and highlights the actions available to the user in the form of which evaluation idiom and specific evaluation to choose (top-right). We describe the decision criteria below.

Task. We consider scenarios when the practitioner's task can suggest evaluations that are already known or easily available: • Input/output structure. The structure of the task may constrain the types of evaluations that may be performed. For example, subpopulations based on lexical overlap may only be applied when the task is a function of two or more inputs (e.g., natural language inference accepts as input a premise and hypothesis). Prior research on similarly structured tasks can provide inspiration on a new or understudied task.

Evaluation goals. We consider 3 broad evaluation goals: testing generalization (spurious correlations, sensitivity to noise, distributional artifacts), detecting bias (gender bias, dependence on sensitive attributes), and ensuring security (vulnerability to malicious users). The practitioner's interest in these goals should influence the evaluations they choose.

• Generalization. Predefined out-of-distribution data splits may be used to evaluate a model's ability to generalize outside of the specific dataset set on which it was trained [Gardner et • Detecting bias. Depending on the task, evaluation sets may be available to test for a model's bias with respect to particular protected attributes (e.g., gender bias in coreference in the case of Winogender and Winobias ). If no existing datasets exist, they may be synthesized by performing hand-crafted transformations with respect to particular protected attributes or subpopulations that contain particular groups considered.

• Security. A user might be interested in security and understanding their system's vulnerabilities, for example, a spammer may try to use adversarial attacks to bypass a spam email filter [Biggio et al., 2013] . Towards this end, the user should focus their evaluations on adversarial attacks.

Resource constraints. Constraints are central to determining the evaluations feasible for a practitioner.

• Compute. If compute is limited (e.g., no GPUs are available), subpopulations may be most appropriate since they can reuse predictions while attacks should be avoided since they can be extremely computeintensive.

• Data. Access to data can be a bottleneck that dictates what evaluations are possible. Some tasks may require the use of proprietary or protected data (e.g., clinical notes in hospitals, or customer data in a company, making procurement and use more difficult). Transformations applied to existing data, such as with generative modeling, can be valuable in narrowing the data gap in this case.

• Human resources. Some evaluation strategies require a large amount of manual effort (e.g., creating custom evaluation sets). Evaluation strategies that require constructing hand-crafted rules (e.g., subpopulations), may also be time consuming. Standard transformations (e.g., paraphrasing), that augment existing datasets may help alleviate these efforts, or automated approaches to creating synthetic datasets (e.g., few-shot generation using GPT-3 [Brown et al., 2020]), may be preferred.

• Expertise. A user's expertise will determine whether they are able to create custom evaluations versus relying on existing ones. Domain expertise may be required to author custom evaluation sets or write custom rules for generating subpopulations. Technical expertise may be needed to write customized code for certain types of robustness tests (e.g. adversarial attacks), and should be sought if required.

Prior evaluations. The history of prior evaluations and the stage of robustness testing will also influence the choice of the next evaluation to perform. We describe 4 evaluation strategies that practitioners can use to guide continual evaluation efforts.

• Easy → Hard. Initial evaluations might focus on simple tests such as robustness to standard transformations (e.g., synonym substitution). Models shown to be robust against these simpler tests might then be tested on harder challenge sets or adversarial attacks.

• Coarse → Fine. Early evaluation should typically focus on coarse evaluations with large slices of data (e.g., performance on long vs. short inputs). Later stages of evaluation should drill-down into fine-grained slices of relevance to model deployment (e.g., queries about the Beatles in a question-answering system).

• Explore → Exploit. Early evaluation stages are more exploratory, as users sift through a large number of slices to search for weaknesses. Over time, it becomes more clear where a model is more or less performant, and later evaluation stages can exploit this knowledge to develop a more fine-grained understanding of performance.

• Generic → Targeted. Initial evaluations can draw on prior knowledge and community know-how of common evaluations. As evaluation proceeds, focus shifts to developing new evaluations that are most appropriate to the user's goal of deploying their model.

As evaluation proceeds, users should consider keeping prior evaluations as a form of regression testing [Wahl, 1999] . Much like in software, changes to the model should not degrade performance on slices where the model previously performed well. 

As highlighted in Section 2.4 (Idiomatic Lock-In), practitioners can get locked into a single tool that supports only a few evaluation idioms. We introduce Robustness Gym (RG), a toolkit that enables broad evaluation across multiple idioms. Figure 3 provides a visual depiction of the abstractions in RG while Python examples for RG are in Tables 4, 5 and 6 of the appendix. At a high level, RG breaks robustness testing into a two-stage workflow:

1. Caching information. First, practitioners typically perform a set of common pre-processing operations (e.g., tokenization, lemmatization) and compute useful side information for each example (e.g., entity disambiguation, coreference resolution, semantic parsing) using external knowledge sources and models, which they cache for future analysis.

A large part of practitioner effort goes into generating this side information-which can be expensive to compute-and into standardizing it to a format that is convenient for downstream analysis. This layer of complexity can make it difficult for them to share their evaluation with others.

RG Support. CachedOperation is an abstraction in RG to derive useful information or generate side information for each example in a dataset by (i) letting users run common operations easily and caching the outputs of these operations (e.g., running the spaCy pipeline [Honnibal et al., 2020] ); (ii) storing this information alongside the associated example so that it can be accessed conveniently; (iii) providing a simple abstraction for users to write their own operations. RG Support. SliceBuilder is an abstraction to retrieve available information for an example and create slices of data from them by (i) providing retrieval methods to access inputs and cached information conveniently when writing custom code to build slices; (ii) providing specialized abstractions for specific evaluation idioms: transformations, attacks and subpopulations.

This breakdown naturally separates the process of gathering useful information from the nuts and bolts of using that information to build slices. Table 3 contains examples of CachedOperations and SliceBuilders that will be available in Robustness Gym.

Robustness Gym relies on a common data interface provided by the datasets library from HuggingFace , which is backed by Apache Arrow [Foundation, 2019] . This ensures that all operations in Robustness Gym interoperate with HuggingFace models, and can be exported easily.

As highlighted in Section 2.5 (Workflow Fragmentation), users can find themselves consolidating evaluation results across several tools and evaluation idioms. Robustness Gym addresses this fragmentation by providing users a TestBench abstraction. Using this, users can assemble and version a collection of slices, which represents a suite of evaluations. Robustness Gym tracks the provenance of these slices, making it possible to identify (i) the data source that the slice originated from; (ii) the sequence of SliceBuilders by which a slice was constructed. This makes it possible for another user to reproduce or redo analysis in a collaboration, through sharing of a TestBench.

Robustness Gym also provides a general-purpose tool for creating Robustness Reports for any model on a TestBench. Users can also use Robustness Reports on their own, allowing them to generate reports for evaluations that are not performed in RG.

To incentivize standardization in reporting, RG includes Standard Reports for several tasks. The Standard Report is comprehensive, static and is backed by a TestBench that contains slices from all evaluation idioms. It can either be generated in a PDF or L A T E X format to be added to the appendix of a paper 2 . Reports reduce user burden in communicating findings, and make it easier to standardize reporting in the community. In the future, Robustness Gym will also include an interactive tool for generating reports that allows users to pick and choose slices of interest based on their robustness goals and constraints.

In this section, we discuss how users with varying expertise can use Robustness Gym to perform continual evaluation and robustness testing. We describe user personas at 3 skill levels-beginner, intermediate, and advanced-and explain a possible path through the Contemplate → Create → Consolidate process for each of them. In every case, we assume that the user's goal is to analyze the performance of an NLI model. Figure 2 illustrates how these user personas can be situated into this workflow.

Contemplate. The user's goal is to perform exploratory robustness testing for the NLI task. Because the user is new to NLP and robustness testing, they lack the knowledge to choose specific slices or write custom slices. Therefore they decide to run the Standard Report for NLI.

Create. The user is able to create the report with a few clicks in the RG interface. They select "Standard Report", "Ternary Natural Language Inference" (task), "SNLI" (dataset), "BERT-Base" (model), and click "Generate Report".

Consolidate. The Standard Report, shown in Figure 4 . The user gleans several initial insights from this report. For example, they see that the model is vulnerable to common typing mistakes due to low accuracy on the KEYBOARDAUG slice; the predicted class distribution column further reveals that this noise causes the model to predict contradiction significantly more frequently than entailment or neutral. The user is able to easily share the generated PDF of this report with their colleagues, with whom they can iterate on additional robustness tests for misspellings.

Contemplate. This user is interested in exploring gender bias in NLI models. Specifically they would like to test cases where specific gendered pronouns are present in the premise or hypothesis. They are willing to write minimal code to instantiate existing SliceBuilder classes with custom parameters but do not want to write the code from scratch. Therefore they decide to create slices using built-in subpopulation SliceBuilders.

Create. The user applies the existing HASPHRASE class in order to create subpopulations with female pronouns in the hypothesis: Consolidate. The user generates a report for immediate analysis and makes the TestBench available on GitHub in order to collaborate with the broader community.

Contemplate. This user is interested in performing robustness tests for spurious correlations in NLI related to surface-level similarities between premise and hypothesis. They are particularly interested in evaluating whether models rely on the premise and hypothesis being of similar length in order to detect entailment. As they are performing a novel analysis, they plan on writing custom logic to create the appropriate slices. They consider two types of slices: subpopulations and transformations, as described below.

Create. The user utilizes the existing SCORESUBPOPULATION class, which constructs subpopulations using arbitrary scoring functions. They create a custom scoring function len_diff, which returns the absolute difference in length between the hypothesis and premise, and then create a SliceBuilder for the subpopulation of examples that score in the top 10% as follows:

The user also utilizes existing SliceBuilders such as the LEXICALOVERLAP class, which creates subpopulations based on the lexical overlap between premise and hypothesis. Additionally, they transform the dataset using classes such as EASYDATAAUGMENTATION [Wei and Zou, 2019]. They can then compose this transformation with the custom SliceBuilder described earlier to create a larger evaluation set.

Consolidate. The user generates a report for immediate analysis, and also generates an appendix for a paper to share results with the research community. They make their code and testbench available on GitHub so that others may reuse and refine their approach.

We validate the Contemplate → Create → Consolidate workflow with Robustness Gym through a real-world case study. We conducted a 3-hour long virtual case study with a member of the team that built the Einstein sentiment system which is part of Salesforce's cloud offerings. 3 Pre-study questionnaire. Our pre-study questionnaire elicited information on the team's task (e.g., sentiment modeling, question answering, etc.), what metrics they use for evaluation (e.g., accuracy, F1, etc.), how they evaluate robustness (e.g., standard validation/testing, out-of-distribution data, bias testing, attacks) and what their evaluation goals are (e.g., security, generalization). Their responses suggest that their evaluations were mainly on a proprietary validation set that included some out-of-distribution data, and their main interest was in understanding the potential bias for identity groups.

We also asked them to rate on a Likert scale (1 − 5), whether they would "like to evaluate the robustness of [their] model more thoroughly than [they] do today." (agreement 4/5) and "would benefit from having a library that gives [them] the tools to evaluate the robustness of [their] models." (agreement 5/5). The format and other details about the questionnaire are in Appendix A.1.

Study. The study took place during the COVID-19 pandemic and was conducted virtually with the user from the sentiment team. Due to difficulties in guiding them virtually, one of the authors shared their screen and 3 https://einstein.ai/products/community-sentiment Temporal Preposition @ hypothesis Quantifier @ hypothesis Possessive Preposition @ hypothesis Negation @ premise (Naik, 2018) Negation @ hypothesis (Naik, conducted all the experimentation throughout the 3-hour period.

We followed the Contemplate → Create → Consolidate loop, and we highlight the key steps that we took through the study period.

• Contemplate (1). We first identified resource constraints-the user provided us with their evaluation data, and gave us black-box access to their model 4 . We used a CPU-only MacBook Pro for all computation. Since the team had previously analyzed the sensitivity of their model to mentions of race/gender/religion/ethnicity, we decided to first verify performance on subpopulations of their dataset with identity-sensitive words.

• Create (1). We constructed slices for evaluating performance using a SliceBuilder that searched for identity words. We found no degradations compared to the average performance of the model on nine identity-sensitive words.

• Contemplate (2). Next, after discussion with the user, we considered whether the model could have performance disparities along different topics.

• Create (2). We next evaluated the model on subpopulations that contained topic-specific words. We found that the model did poorly on some topics, with performance degradations of up to 18%.

• Contemplate (3). Next, we set out to understand whether the model was robust to input perturbations. The user highlighted instances of input noise that they wanted to gather more information on.

• Create (3). We used 4 different transformations for simulating typing errors and paraphrasing text, and found that performance degraded by 6%.

• Contemplate (3). Lastly, they wanted to investigate whether the model was robust to larger distributional shifts in the inputs.

• Create (3). We downloaded and used an open-source sentiment dataset, and found that performance degraded by 5%.

• Consolidate (1). We collated all the slices into a testbench, and generated a report to share with other members of their team.

We performed 3 iterations of (Contemplate → Create), resetting the evaluation objective after each iteration, and using Robustness Gym to investigate them. Overall, we evaluated the system on 172 different subpopulations, 1 open-source evaluation set from the internet, and 4 different transformations, all in the 3-hour period. We observed a total of 12 subpopulations where performance degraded significantly. This performance deterioration occurred under all 4 types of transformations as well. Lastly, since we did not have access to the model for training, we made prescriptions for augmentation-based training to improve performance on examples where the model underperformed.

Post-study questionnaire. We conducted a post-study questionnaire with the user, where we asked them to provide feedback on Robustness Gym and the overall study. We elicited feedback on "how likely [they] were to incorporate Robustness Gym in [their] workflow" (very likely 5/5), and the perceived "ease of use of Robustness Gym" (high 5/5). In feedback related to the utility of the 4 evaluation idioms in Robustness Gym, they found subpopulations to be "very insightful", and were enthusiastic about the ability to perform various evaluations in a single tool. Lastly, the robustness report gives information on how the team could make improvements and work towards adopting continual evaluation for their system.

Robustness Gym makes it easy for researchers and practitioners to perform novel analyses of existing tasks and models. To demonstrate this, we use Robustness Gym to investigate fine-grained performance on 2 tasks-named entity linking (NEL) and text summarization. For NEL, we present the first fine-grained analysis of NEL across 3 widely used commercial APIs, and 3 state-of-the-art academic systems. For summarization, we analyze 7 state-of-the-art models for text summarization trained on the CNN/DailyMail dataset.

We analyze the fine-grained performance of commercial and state-of-the-art-systems for named entity linking (NEL). NEL is a fundamental component of both search and question-answering systems such as conversational assistants, and has a widespread impact on the performance of commercial technology. Given some text, the NEL task involves the identification of all entity mentions, and contextualized linking of these mentions to their corresponding Wikipedia entries, e.g., "She drove a Lincoln to Lincoln" would link the first mention of Lincoln to Lincoln_Motor_Company and the second mention of Lincoln to Lincoln,_Nebraska. Each identified mention (e.g., "Lincoln") is typically mapped to a candidate list of Wikipedia entries (e.g., all "Lincoln"-related Wikipedia entries) before disambiguation. Our goal is to use Robustness Gym to understand where existing NEL systems fall short. Metrics. For WIKIPEDIA, we compare performance on recall 9 . For AIDA, we compare performance on Macro-F1. 9 WIKIPEDIA is sparsely labeled and we do not report precision or F1 scores, which can be misleading. • kg-relation contains examples where the entity being linked is related to another entity in the sentence. This serves as useful contextual information that can aid disambiguation.

• one-of-the-two contains examples where the gold entity is one of the two most popular candidates in the list of candidates, and both have similar popularity. These examples require careful use of context to disambiguate.

• share-1-type contains examples where the sentence contains 3 consecutive entities that share the same type affordance. These type affordances can be used as contextual cues for disambiguation.

• strong-affordance contains examples where the sentence has words that are highly associated (as measured by tf-idf) with the gold entity's type(s). Again, these words can be used as contextual cues.

• unpopular contains examples where the gold entity is the second or less popular entity in the list of candidates, and the most popular entity is at least 5× more popular than the second. These examples require the model to overlook popularity in favor of preferring a more uncommon entity.

Lastly, we also consider performance on popular entities which correspond to examples where the entity mention corresponds to one of the top 800 most popular entity mentions.

Bootleg is best overall. Overall, we find that BOOTLEG is the best-performing system, while MICROSOFT is the best-performing commercial system. BOOTLEG outperforms other systems by a wide margin, with a 12 point gap to the next best system (MICROSOFT), while MICROSOFT in turn outperforms other commercial systems by more than 16 points.

Performance degrades on rare entities. For all systems, we find that performance on head slices is substantially better than performance on tail/toe slices. BOOTLEG is the most robust across the set of slices that we consider 10 . In particular, we note that GOOGLE and AMAZON struggle on tail and torso entities, while MICROSOFT's performance degrades more gracefully. GOOGLE's model is particularly adept at popular entities where it outperforms MICROSOFT by more than 11 points.

For AIDA, we compare performance on Macro-F1, since AIDA provides a dense labeling of entities (and therefore computing precision is meaningful). Similar to WIKIPEDIA, we find that BOOTLEG is the bestperforming system overall on AIDA, while MICROSOFT is the best-performing commercial system. Figure 6 : Robustness Report for NEL on AIDA. Performance reported using the Macro-F1 metric.

Performance on topical entities. Interestingly, all models appear to struggle on some topical slices (e.g., on the NFL slice), all models degrade significantly, with BOOTLEG outperforming other models by 20+%. Both GOOGLE and MICROSOFT display strong performance on some topics, (e.g., GOOGLE on alpine sports and MICROSOFT on skating).

Popularity heuristic outperforms commercial systems. Somewhat surprisingly, we find that POP outperforms all commercial systems by 1.7 points. In fact, we note that the pattern of errors for POP is very similar to those of the commercial systems (e.g., performing poorly on NBA, NFL and NHL slices). This suggests that commercial systems sidestep the difficult problem of disambiguating ambiguous entities in favor of returning the more popular answer. Similar to WIKIPEDIA GOOGLE performs best among commercial systems on examples that contain the most popular entities (top 10% entity popularity).

Overall, our results suggest that state-of-the-art academic systems substantially outperform commercial APIs for NEL.

Next, we analyze the performance of state-of-the-art summarization models using Robustness Gym. We selected summarization as an instance of a text-generation task to demonstrate the versatility of Robustness Gym for prediction tasks beyond classification or sequence labeling. For example, we show how slices can be computed based not only on the input text but also the ground-truth label and other cached information.

We present a unified view of robustness testing of summarization systems that is inspired by a diverse set of approaches to this problem [Grusky et Slices. Below, we define several heuristics for identifying subpopulations of summarization datasets for robustness testing. See Appendix A.3 for additional details.

• abstractiveness is the degree to which the reference summary requires consolidating and reframing content from the source document [Grusky et al., 2018] . Summaries range from extractive, where a subset of sentences is directly selected from the source document to abstractive.

• distillation is the degree to which the reference summary discards content from the source document.

Highly distilled summaries require models to carefully select what to present in the summary.

• position is the average location-in the source document-of where information in the summary comes from. High positions require models to use information that appears later in the source document.

• dispersion is the degree to which the reference summary uses content that is distributed broadly across the article versus concentrated in a particular region. High dispersion requires the method to attend to multiple parts of the source document to consolidate information.

• ordering is the similarity with which content in the source and reference summary are ordered. Summaries that change or reverse the ordering of content require models to reason over how to best present contextual information.

We also consider slices based on the length of the source document and the number of contained entities, which serve as proxy measures for the complexity of content to be summarized.

We include a Robustness Report in Figure 7 , and describe results below.

Models struggle to abstract and distill. All models perform worst on the "most distilled" subpopulation, i.e. on examples where the model must discard a large amount of information in the article to construct a summary. Models also struggle on the examples that required the most abstraction. In contrast, both extractive and abstractive models excel on extractive examples ("least abstractive").

Abstractive models have less positional bias. Extractive models have large gaps in performance between examples where the summary can be constructed using the early ("earliest positions") vs. late ("latest positions") of the article e.g. all extractive models have a 9+ point gap between these subpopulations. Abstractive models have smaller gaps, e.g. PEGASUS has a gap of only 5.9 points.

Errors are highly correlated. All summarization models, whether extractive or abstractive, degrade and improve on the same populations of data. This is surprising, since these models use quite different prediction mechanisms e.g. abstractive models like T5 appear to offer no relative advantage on the "most abstractive" examples compared to the Lead-3 baseline (both models are 9 points worse than their overall performance). We note that the development of reliable evaluation metrics in summarization continues to be an active area of research, and it is likely that current evaluation metrics are unable to capture some meaningful differences that may exist. Figure 7 : Robustness Report for summarization on CNN-DailyMail. Performance reported using the ROUGE-1 metric.

Our work is related to many machine learning research areas including AutoML, Ethical ML, Interpretable ML, as well as Error Analysis.

AutoML. Automated machine learning is an area of research that focuses on creating tools that help remove the manual efforts in building machine learning systems [Snoek et al., 2012] . Traditionally, these have focused on data wrangling, feature and model selection, and hyperparameter optimization. More recently with hardware acceleration, AutoML has expanded to include neural architecture search (NAS) [Pham et al., 2018] . Although AutoML aims to provide tools for efficient and robust models, it only focuses on training and not evaluations [Feurer et al., 2015] . Robustness Gym on the other hand focuses on removing the manual effort in evaluations of machine learning models across a suite of robustness tests. Robustness Gym is complementary to these tools in that it enables users to understand likely performance degradations and preempt those before they become errors.

We introduced Robustness Gym, an evaluation toolkit that supports a broad set of evaluation idioms, and can be used for collaboratively building and sharing evaluations and results. To address challenges faced by practitioners today, we embedded Robustness Gym into the Contemplate → Create → Consolidate continual evaluation loop. Our results suggest that Robustness Gym is a promising tool for researchers and practitioners.

variety of metrics may be decoded. Sharing this abstraction not only reduces code reuse, but also lowers the computational cost when performing multiple evaluations.

We also define a match function, which returns the index i of the sentence in the article with greatest similarity to the summary sentence s j :

Based on these formalisms, we define 3 metrics:

Position. The mean position of the matched sentences in the article:

This metric is inspired by previous work showing that summarization models may be biased toward sentences at the beginning of an article [35, 38, 43] .

Dispersion. The degree to which summary sentences match content that is distributed broadly across the article versus concentrated in a particular region. We define dispersion as the variance of the position of the matched sentences:

where µ is the mean match position, which equals position(A, S), defined earlier. This metric is related to Extractive Fragment Density [24] , which measures the degree to which extracted text in the summary comes from a contiguous sequence versus being broadly sourced from the article.

Order. The similarity in ordering between the summary sentences and the matched article sentences. Specifically, we compute the Spearman rank correlation between the positions of sentences in the reference summary and the positions of their matched counterparts in the article:

order(A, S) = spearman((match(j)) N j=1 , (j) N j=1 )

This metric is inspired by prior work in summarization evaluation that studied the effects of shuffling sentences in the source article, revealing a significant degradation in performance in news articles compared to other domains [38] .

Code. We provide example code snippets for Robustness Gym in Tables 4 (CachedOperation), 5 (Slice-Builder), and 6 (TestBench, Report), below. L A T E X Report. Figure 8 is an example of a report generated in a L A T E X format. The code for the figure was auto-generated and the figure was simply included in the appendix. Temporal Preposition @ hypothesis Quantifier @ hypothesis Possessive Preposition @ hypothesis Negation @ premise (Naik, 2018) Negation @ hypothesis (Naik, 2018) Figure 8 : Robustness report for textattack/bert-base-uncased-snli model on SNLI dataset. The report lays out scores for each evaluation, broken out by category. Citations: [7, 10, 46, 48, 54, 82] .

Note: the L A T E X figure and caption above is auto-generated using "report.latex()".

Fairsight: Visual analytics for fairness in decision making

Crosscheck: Rapid, reproducible, and interpretable model evaluation

One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques

Synthetic and natural noise both break neural machine translation

Evasion attacks against machine learning at test time

Pattern recognition and machine learning (information science and statistics)

A large annotated corpus for learning natural language inference

Fairvis: Visual analytics for discovering intersectional bias in machine learning

Slice-based learning: A programming model for residual learning in critical data slices

Using the framework

The pascal recognising textual entailment challenge

Banditsum: Extractive summarization as a contextual bandit

The reusable holdout: Preserving validity in adaptive data analysis

Summeval: Re-evaluating summarization evaluation

Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). ArXiv, abs/1006

Efficient and robust automated machine learning

Empirical evaluation of pretraining strategies for supervised entity linking. ArXiv, abs

Arrow: A cross-language development platform for in-memory data

Evaluating nlp models via contrast sets

Bae: Bert-based adversarial examples for text classification. ArXiv, abs

Stress-testing neural models of natural language inference with multiply-quantified sentences

Breaking nli systems with sentences that require simple lexical inferences

Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies

Annotation artifacts in natural language inference data

Amazon built an AI tool to hire people but had to shut it down because it was discriminating against women

Twitter is investigating after anecdotal data suggested its picturecropping tool favors white faces

Facebook's ad-serving algorithm discriminates by gender and race

Using pre-training can improve model robustness and uncertainty

The many faces of robustness: A critical analysis of out-of-distribution generalization

Teaching machines to read and comprehend

spaCy: Industrial-strength Natural Language Processing in Python

Are natural language inference models imppressive? learning implicature and presupposition

Adversarial examples for evaluating reading comprehension systems

Earlier isn't always better: Sub-aspect analysis on corpus and system biases in summarization

Learning the difference that makes a difference with counterfactually-augmented data

Google apologizes after its Vision AI produced racist results

Content selection in deep learning models of summarization

Rethinking AI Benchmarking

The Apple Card Didn't 'See' Gender-and That's the Problem

Neural text summarization: A critical evaluation

Natural language inference from multiple premises

Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ArXiv, abs

SemEval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment

Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference

Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference

The effect of natural distribution shift on question answering models. ArXiv, abs

Model cards for model reporting

Textattack: A framework for adversarial attacks in natural language processing

Explaining machine learning classifiers through diverse counterfactual explanations

Stress test evaluation for natural language inference

Analyzing compositionality-sensitivity of nli models

Adversarial nli: A new benchmark for natural language understanding

Interpretml: A unified framework for machine learning interpretability

Bootleg: Chasing the tail with self-supervised named entity disambiguation. ArXiv, abs

Efficient neural architecture search via parameter sharing

From tagme to wat: a new entity annotator

Collecting diverse natural language inference problems for sentence representation evaluation

Exploring the limits of transfer learning with a unified text-to-text transformer

Snorkel: Rapid training data creation with weak supervision

EQUATE: A benchmark evaluation framework for quantitative reasoning in natural language inference

Beyond accuracy: Behavioral testing of nlp models with checklist

How well do nli models capture verb veridicality?

Gender bias in coreference resolution

Conjnli: Natural language inference over conjunctive sentences

Behavior analysis of NLI models: Uncovering the influence of three factors on robustness

SherLIiC: A typed event-focused lexical inference benchmark for evaluating natural language inference

Data augmentation for discrimination prevention and bias disambiguation

Practical bayesian optimization of machine learning algorithms

Microsoft's politically correct chatbot is even worse than its racist one

Measuring robustness to natural distribution shifts in image classification. ArXiv, abs

The language interpretability tool: Extensible, interactive visualizations and analysis for nlp models

Rel: An entity linker standing on the shoulders of giants

An overview of regression testing

Universal adversarial triggers for nlp

Allennlp interpret: A framework for explaining predictions of nlp models

Superglue: A stickier benchmark for general-purpose language understanding systems

Eda: Easy data augmentation techniques for boosting performance on text classification tasks

The what-if tool: Interactive probing of machine learning models

A broad-coverage challenge corpus for sentence understanding through inference

Transformers: State-of-the-art natural language processing

Huggingface's transformers: State-of-the-art natural language processing

Errudite: Scalable, reproducible, and testable error analysis

Neural extractive text summarization with syntactic compression

Do neural models learn systematicity of monotonicity inference in natural language?

Openattack: An open-source textual adversarial attack toolkit

Manifold: A model-agnostic framework for interpretation and diagnosis of machine learning models

Pegasus: Pre-training with extracted gap-sentences for abstractive summarization

Gender bias in coreference resolution: Evaluation and debiasing methods

language modeling, summarization, and others), what metrics does the team use for evaluating their models (accuracy, P/R/F1, exact match, BLEU, ROUGE, or other generation metrics, and others), how they evaluate robustness (standard val/test datasets, out-of-distribution examples or datasets for generalization testing, axiomatic bias tests, adversarial attacks, and model cards). The form also asked the user to rate on a Likert scale of 1-5, 1 being strongly disagree and 5 being strongly agree, the following two statements: "I would like to evaluate the robustness of my model more thoroughly than I do today

At the end of the study, the team rated "very likely" for both ease of use and eagerness for using Robustness Gym as part of their workflow. One question was about rating the usefulness of the 4 evaluation idioms in the study. The team rated subpopulations and adversarial attacks as 5/5, transformations as 4/5, and eval sets as 3/5 on a scale of 1-5, 1 being "not useful" and 5 being "very useful". For the key takeaways of the study, the team found subpopulation slices as being "very insightful

A.2 Named Entity Linking

For the AIDA test-b dataset, we follow [18] to split each passage in the dataset into examples

The degree to which the reference summary is abstractive versus extractive [24], based on the proportion of n-grams in the reference summary that are not in the article. Formally, we define the abstractiveness of a summary S

Note that rouge precision (A, S) equals the proportion of n-grams in the reference summary that are also in the article. The abstractiveness metric is essentially the complement of the Extractive Fragment Coverage metric

The degree to which the reference summary is distilled from a larger quantity of content

We also consider 3 fine-grained metrics that rely on the similarities between sentences in the article and sentences in the reference summary. For these metrics, we define a sentence-similarity matrix M , where M i,j is a similarity score (e.g. Rouge-1) between sentence a i in the article and sentence s j in the summary. We provide the sentence-similarity matrix M as a built-in abstraction in Robustness Gym

This work was part of a collaboration between Stanford, UNC, and Salesforce Research and was supported by Salesforce AI Research grants to MB and CR. KG and NR conceived the idea of Robustness Gym. KG, NR, and JV made significant overall contributions to the toolkit. ST and JW ran initial experiments on some NLP tasks. SZ and CX provided useful feedback. MB and CR provided detailed guidance on the NLP/robustness and MLSys areas, respectively. We are thankful to Han Guo, Laurel Orr, Jared Dunnmon, Chris Potts, Marco Tulio Ribeiro, Shreya Rajpal for helpful discussions and feedback. VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S. Government.