key: cord-0162274-81cvyli1
authors: Kennedy, Chris J.; Bacon, Geoff; Sahn, Alexander; Vacano, Claudia von
title: Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application
date: 2020-09-22
journal: nan
DOI: nan
sha: a888bcf2cc7ad2c0d597344b3ee7e0ceb9847469
doc_id: 162274
cord_uid: 81cvyli1

We propose a general method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT). We decompose the target construct, hate speech in our case, into multiple constituent components that are labeled as ordinal survey items. Those survey responses are transformed via IRT into a debiased, continuous outcome measure. Our method estimates the survey interpretation bias of the human labelers and eliminates that influence on the generated continuous measure. We further estimate the response quality of each labeler using faceted IRT, allowing responses from low-quality labelers to be removed. Our faceted Rasch scaling procedure integrates naturally with a multitask deep learning architecture for automated prediction on new data. The ratings on the theorized components of the target outcome are used as supervised, ordinal variables for the neural networks' internal concept learning. We test the use of an activation function (ordinal softmax) and loss function (ordinal cross-entropy) designed to exploit the structure of ordinal outcome variables. Our multitask architecture leads to a new form of model interpretation because each continuous prediction can be directly explained by the constituent components in the penultimate layer. We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers to measure a continuous spectrum from hate speech to counterspeech. We evaluate Universal Sentence Encoders, BERT, and RoBERTa as language representation models for the comment text, and compare our predictive accuracy to Google Jigsaw's Perspective API models, showing significant improvement over this standard benchmark.

Across fields of knowledge (science, engineering, medicine, etc.) phenomena of interest are often labeled by humans as discrete, dichotomous variables: speech is toxic or not, an MRI scan shows cancer or is clear, a stoplight is red or green. Simple variables may be inherently binary or ordinal in nature. For example, an online advertisement might be clicked by a person viewing a webpage, or might not -a partial click between 0 and 1 does not make sense. But if a discrete variable represents a concept with complexity or granularity, those {0, 1} values may reflect the simplification of an underlying continuous spectrum or latent variable. Consider an online message that is rated for its sentiment using the discrete, ordered labels of "positive", "neutral", or "negative" (B. Liu 2015) . Those labels can reasonably be viewed as summarizing what is ultimately a continuous, infinitely divisible spectrum ranging from extremely negative to extremely positive sentiment.

Physical quantities such as temperature and weight can be measured as interval variables where magnitudes are meaningful: we can subtract the measurements of two units and have an estimate of distance on the original scale. We can make factual statements such as "today's temperature is 5 degrees warmer than yesterday" or "I lost 10 pounds after dieting". The development of physical measurement systems required major, long-term investments in scientific and engineering that led to a theory of those physical systems (Chang 2004) . Thanks to those fixed scales we can receive a weather report with specific temperatures provided, or stand on a scale and view our current weight. But in the fields of machine learning, natural language processing, and many other areas, current practice would provide a binary prediction. As a mental exercise, what if today's expected weather forecast were reported as two values: hot or cold? Classification models could only estimate P r(Weather = Hot | Data) and we might hear: "Today's weather forecast: expected hot, with 55% probability, but a 45% chance of being cold." It sounds absurd, but that is how we treat many variables in science today, including our construct of interest in this paper: hate speech. "Classification" of discrete outcome variables has seemingly become synonymous with "supervised learning", so rarely are continuous variables analyzed. Our question becomes: how can we aspire to emulate physical scales like temperature, which we take for granted in our daily lives, and construct interval measurements for arbitrary variables?

A method to estimate a continuous spectrum for human-generated variables could be valuable for several reasons. If the variable is recorded in order to determine the application of interventions, a continuous measurement would allow many different thresholds to be defined with varied policy prescriptions. A binary outcome would support only two policy alternatives, and research based on the binary version would lose most of its underlying value if the implicit threshold of interest changed (Streiner 2002) . When the construct of interest is an outcome variable, such as in a randomized trial or an observational study, having a continuous measurement will increase statistical power compared to a dichotomous or ordinal variable (Cohen 1983; Senn 2003) . If the variable is a confounder in a causal inference analysis, a continuous measurement would more strongly reduce residual confounding compared to a discretized version of that variable (Royston et al. 2006) . Similarly, if the variable is a pre-treatment adjustment variable in a randomized trial, the variable will more effectively improve precision if continuous rather than binary (Dawson et al. 2012) .

In this work we propose a methodology to construct continuous, interval variables by combining two complementary techniques: many-facet Rasch measurement, a form of item response theory, and supervised deep learning. Rasch measurement theory involves the application of a probabilistic multilevel statistical model to create continuous, interval scales out of multiple survey questions, measured as binary or ordinal variables. When those survey items (or "components") are completed by human test-takers we need a way to automate the human ratings for unseen future data, otherwise we would need human test-takers to review any future observation in order to create its measurement on the scale. Supervised deep learning provides an automated approximation of the human ratings by learning to predict those ratings on arbitrary new observations, which also serves as a new form of model explanation. The deep learning model can alternatively predict the continuous scale directly and skip the intermediate step of predicting the ratings. Our methodology can be applied to any variable that can be theorized as a continuous spectrum and is based on human ratings. The underlying source data is also general: it might be text, images, videos, time series, waveform, or audio. As long as the underlying data source is unstructured and reviewed by humans, the deep learning model has the potential to approximate those human ratings for automation purposes.

We apply this novel method to the measurement of hate speech, a social problem that has received extensive attention from policymakers, researchers, and firms. While existing attempts to detect hate speech use predictions derived from discrete labels, our method measures the hateful content of speech by scaling multiple labels of multiple components.

Our manuscript is structured as follows: we begin by providing background on the theoretical foundations of hate speech and previous machine learning approaches. We then describe our methodology, starting with the theoretical development of a hate speech spectrum, which is operationalized as a survey instrument, the collection of social media comments and crowdsourced labeling process, use of faceted Rasch theory to statistically transform the labeled data into an interval variable, and application of deep learning models to predict that interval variable using only the text of the comment. We next present the results of our work, which consist of the Rasch scaling followed by the deep learning performance evaluation. We conclude with a discussion of current limitations and the many possible extensions of this method. We hope that this work can help spur wider usage of Rasch measurement theory to go beyond simplified discrete variables and instead develop continuous, interval variables across fields.

We demonstrate the effectiveness of our method in measuring hate speech, a complex linguistic phenomenon with consequential social and political impacts. The accurate measurement of, and intervention to counter, hate speech may prevent psychological harm, dissipate extremist groups and prevent downstream violent, even genocidal events. However, the difficultly of identifying this rare and complex phenomenon has made interventions costly, error-prone, and rare.

The harm in hate speech is significant. Within niche extremist communities, hate speech serves as a mechanism for recruitment and as a radicalizing influence, building cultures of hatred that can lead to hate crimes and terrorist violence (Tsesis 2002) . In areas with long-term ethnic conflicts, hate speech may precipitate or otherwise foment mass genocidal violence (R. A. Wilson 2017) .

Even given the many harms of hate speech, legislation, corporate content moderation, or outright censorship to combat hate speech also cause harm. Laws intended to protect groups that are minorities in numbers within the power structure can be used against the very people they were intended to protect (Strossen 2018) . Research on online hate speech is often conducted behind corporate walls given the centrality of large internet firms in addressing its harms. Open, transparent, and reproducible research on hate speech is vital to balancing the relative harms of hate speech against content removal.

Computationally analyzing hate speech is a challenging task for several reasons. First, there is no consensus on a systematic definition of hate speech (Sellars 2016) or existing measurement instrument. Second, the scale of communication and the different linguistic forms it takes necessitates algorithmic approaches. Third, hate speech constitutes a small proportion of online communication (1% or less), making it difficult to create well-powered training datasets for algorithmic approaches. Fourth, human reviewers that generate the training data often do not agree on how to label training observations, even when given detailed criteria and intensive training to evaluate speech (Kiela et al. 2020) . Finally, extremist groups often use sarcasm, ambiguity, and coded language to obscure the hateful nature of their communications, to evade moderation, and ultimately, to organize actions. The combined impact of this mixture of challenges is that we cannot currently track hate speech in a consistent manner over time.

Like many social or linguistic phenomena, definitions of hate speech are contested and vary over temporal and geographic contexts. Even if a definition were readily agreed upon, hate speech would be difficult to measure numerically. Many existing approaches use a binary theoretical construct-speech either contains hate speech or not. In identifying hate speech, whether at small, medium, or large-scale, human raters, novice or expert, label texts based on a common definition. Due to the binary construct, these labels are also frequently binary and in large-scale computational analyses, out-of-sample predictions produce a prediction along the binary construct with associated error. Table 1 summarises some prominent existing approaches, noting the number of construct levels and outcome measurement granularity. Our approach improves on the existing literature in two ways: in our integration of the hate speech and counter-speech literature, we create a theoretical construct with 8 levels, allowing for more categories of hateful content in speech. We then create a labeling instrument with 32-48 measurement levels. These levels, which are described in the labeling instrument section in detail, reduce measurement error by using multiple items per construct and allow us to accurately distinguish between the 8 theorized construct levels. We validate the measurement of these levels in section 4.2.

The scale of online communications has made computational approaches a necessity for characterizing hate speech patterns. In particular, machine learning for hate speech analysis has become a well-studied topic over the past decade. Early work (e.g. Warner et al. 2012 ) used simple, non-deep learning algorithms for estimation, such as naive Bayes, support vector machines, or linear regression. Text features were typically unigram TF-IDF scores, and sometimes additional syntactic features like part-of-speech tagging (Davidson et al. 2017) . The field gradually transitioned to word embeddings, often pre-trained, where the best featurization approach was comment word embedding averages or paragraph embeddings (Djuric et al. 2015; Nobata et al. 2016) . Those features could then be combined with gradient boosted decision trees or Bayesian additive regression trees. 1 More recently there has been a shift to deep learning methods, beginning with long short-term memory networks (LSTMs), convolutional neural networks (CNNs), and gated recurrent unit networks (GRUs), that either use pre-trained word embeddings or learn their own embeddings from the raw text (Badjatiya et al. 2017; Zhang et al. 2018) . Adding attention mechanisms to deep learning is beginning to be used in hate speech research (Founta et al. 2018 ), but their successor transformer units (Vaswani et al. 2017) have not yet been widely adopted for hate speech research.

1 These were our best results from a pilot study we completed in October 2017.

Model-based transfer learning methods provided a further breakthrough in supervised natural language processing (Ruder 2018) , including best in class architectures OpenAI's GPT-2 (Radford et al. 2019) , fast.ai's ULMFiT (Howard et al. 2018) , and Google's BERT (Devlin et al. 2018) . Those contextualized word representation methods are only beginning to be used in hate speech research. The algorithms column of Table 1 summarises the existing approaches in prominent previous work. Developing a labeled corpus to produce an accurate measurement tool for hate speech on social media requires overcoming two obstacles: the rarity of hate speech across all posts and differences in linguistic styles of different social media platforms. We overcome the first challenge by pre-processing 54 million total comments with a hypothesis score model, then labeling comments from different quantiles of likely hate content. To overcome the second problem, we sample comments from three platforms: YouTube, the most used social media platform globally, Twitter, a platform with heavily political content that engage with identity, and Reddit, a longer-format forum home to some radical communities. The 50,000 labeled comments across 3 platforms form the most representative, and one of the largest training sets of hate speech data.

Reviewer reliability is a particularly difficult challenge in hate speech research. Definitional variation, along with differences in labeler knowledge, context, and ideology can result in inconsistent labels. Prior approaches refine their definition and measurement interest to try to maximize inter-coder reliability. We take a different approach, adjusting the final continuous score based on estimated labeler bias from our faceted Rasch IRT model.

We use our method to measure hate speech, a complex social phenomenon that is difficult to define, measure, and analyze at scale. Our methodological approach begins with constructing a systematic conceptualization of hate speech in online communication, operationalized through a survey instrument, and proposes an efficient, debiased prediction algorithm using deep learning. In the following sections, we describe the process of construct theorization using a reference set of comments, the measurement of sub-constructs (components) using a labeling instrument, the crowdsourcing of labeling to a network of interlinked reviewers, the debiased scaling of labeled data using faceted Rasch measurement theory, and the integration of the Rasch model into a novel multitask deep learning architecture for debiased, explainable, interval prediction.

Hate speech has many definitions across academic disciplines, legal and regulatory doctrines, and in common vernacular. When attempting to systematically measure a social concept, "I know it when I see it" will vary greatly on who is doing the identification: life experience, familiarity with language, and historical context all vary across individuals, whether experts or laypeople. Difficult-to-measure phenomena like hate speech are an ideal application of our method, but require careful theorization and translation to measurement instruments.

We draw from the legal definition of hate crimes in the United States that protects against discriminatory actions targeting one of the following protected groups: race, religion, ethnicity, nationality, gender, sexual orientation, gender identity, and disability. In identifying groups within these broad categories, we include subjugated groups that have been discriminated against in the United States, as well as power-dominant groups who have not. Targeting of a group or an individual on the basis of their membership in a group is common to most definitions of hate speech (Sellars 2016) . Not only do we adopt this convention, but we allow for intersectional or overlapping identities to be selected for further analysis. We consider intersectional identities and the possibility of compounding hate speech directed at an individual who belongs to multiple groups.

Speech can also lead to individual acts of violence and when targeted against a group, genocide and extermination. The "dangerous speech" framework ties the effects of hateful speech to actions that it can incite (Benesch et al. 2018) . Dehumanization, such as radio broadcasts in Rwanda referring to the Tutsi people as cockroaches, is directly linked to later genocidal killing of that group. Incitement towards violence is a narrowly defined concept under US law, and the dangerous speech framework that we use takes a broader view of the link between cause and effect. Sellars (2016) points out that the accumulated affects of anti-Semitic or racist speech can have multi-generational impacts on the well-being of individuals in a group born long after hateful speech was original created. Given the complexities of these concepts, we focus on calls to individual violence or collective extermination, with the idea that these are the final step after expression of hate and deeming a group inferior or inhuman. Table 2 describes the eight levels of our theorized hate speech scale. The positive levels on the scale designate hate speech of increasing severity. Unlike many existing scales, our typology includes both neutral and positive identity speech, represented by 0 and negative values, respectively. Following Anti-Defamation League (2016) and Stanton (2013) , we place speech supporting the systematic killing of a specific group as the most severe form of hate speech. Viewing other types of hate speech as pathways to genocide, we pay special attention to individuals threats of violence and dehumanization that may justify violence.

From a manual review of social media comments, we curated a reference set, a small corpus of example text for each conceptual level. We selected 10 comments to serve as examples of each of our theoretical levels, totalling 80 comments. In concert with construct development using existing literature, we manually reviewed thousands of comments from our corpus, oversampling on comments with high hypothesis scores. We also selected reference set comments for each level that yielded a diversity of target groups, text length, and linguistic styles. Iteratively, we selected comments that we Genocide Support for or intention of systematically killing all or a large number of a protected identity group 4

Violence Threat or support of physical force or emotional abuse intended to hurt or kill members of a protected identity group 3

Dehumanization Depriving a protected group of human-like qualities, such as comparison to an animal, insect, or disease 2 Hostility Unfriendliness or opposition to a protected identity group, such as through slurs, profantiy, or insults 1 Bias Inclination or preference against a protected identity group, including prejudice 0 Neutral Descriptive or other non-harmful references to identity groups -1 Supportive Respectful, prideful, or other solidarity-based messaging about a protected identity group(s) -2

Counterspeech Response to hate speech that seeks to undermine its impact and standing felt best exemplified levels of hate speech, and when we found ambiguities, used the comments to refine the definitions of each level. This allowed us to identify distinguishing features of speech and create measurement items to capture these distinctions. These reference set comments served two purposes: to aid in construct development by providing clear examples between levels and as empirical scaffolding to link human raters in our crowdsourced measurement. We describe the importance of the reference set for construct theorization and measure development below and its integration with linking labelers in Section 3.4. Example comments from each level of the reference set are shown in Table 3 .

In order to evaluate the comments we needed one or more human reviewers (also called "annotators", "labelers", "judges", "raters", or "moderators") to examine the comment text and provide data to estimate where each comment fell on each of our theorized components of hate speech. Most labeling tasks for training data give the labeler a detailed definition of the construct and then ask them to assign a binary label to each data point (e.g. designating a block of text as hate speech or not, designating whether an image contains a stop sign). This approach has to two shortcomings: labelers cannot indicate uncertainty and if the construct has multiple components that labelers differ on, the label does not indicate which element they disagree in. The approach described below overcomes both of these issues by decomposing the construct of hate speech into multiple labeling items and by giving labelers Likert-style response options to incorporate uncertainty. The labeling instrument, similar to a a survey instrument, has three sections: 1) 6/33 identity target items, which establish whether the comment targets a protected group , 2) scale items that measure the content of the comment along several distinguishing features of hate speech, and 3) a set of demographic questions asked about the labelers. The target items and scale items are asked for each of the comments that labelers review, and then are followed by the demographic items. One of the scale items, sentiment, is asked before the target items, to get one measure for items that target a non-protected group or no group at all. If no identity groups were mentioned they were not asked any remaining scale items and proceeded to the next comment. If at least one identity group was mentioned they were asked to specify the sub-identity group(s), 2 and then asked the remaining scale items. All comments were also rated on a binary hate speech item sourced from Siegel et al. (2019) to allow comparison to the current best practice in binary hate-speech measurement. Differences in labeler knowledge and views make consistent annotations difficult to obtain. We address differences in labeler knowledge by providing a dictionary tool for niche slurs that appear in the comments we showed to labelers. Using a new dictionary, slur words were underlined in our survey user interface. If the user moved their mouse over the underlined slur word they would be shown a tooltip stating "This word may be a slur against identity group [X]" (see Figure 1 for an example). This user interface feature was intended to reduce response variation due to varying awareness of slur terms, as well as to make noticeable any coded slur language in the comment (for more details on the problem of covert slurs see Magu et al. 2017) . After rating the comments reviewers were asked a series of demographic questions about themselves, followed by an optional free response feedback item. The demographic items included the reviewer's gender, education, race, year of birth, income, religion, sexual orientation, and political ideology.

We sourced our comments from three major social media platforms: YouTube, Twitter, and Reddit. We chose these platforms for their popularity, as respectively, they are used by 73%, 22%, and 11% of U.S. adults (Perrin et al. 2019) . Prior work on hate speech has often focused on a single platform, commonly Twitter, but our goal was to study hate speech in a variety of settings and to ultimately build an algorithmic model to accurately measure hate speech across multiple platforms (Fortuna et al. 2018) . We used public APIs to download recent comments posted to each site. Comments were considered eligible for labeling if they were written primarily in English and were not too short (< 4 characters) or too long (> 600 characters) after removing URLs, phone numbers and contiguous whitespace. Our comment collection took place between March and August 2019.

On Reddit we collected all comments from the real-time stream of the subreddit "/r/all". For Twitter, we collected tweets from Twitter's streaming API, which is a random sample of all tweets on Twitter. YouTube required additional consideration because one must first select videos and then download comments associated with the selected videos. We searched for videos within proximity of the top 300 most populated U.S. cities in order to focus on videos originating in the U.S. and most likely to contain English comments with U.S.-based authors. From those videos we then downloaded all comments and responses.

Hate speech is a rare phenomenon, estimated at less than 1% of online comments when viewed as a binary outcome, so randomly sampling from the collected comments would not have been efficient. That is, the outcome in the labeled data would be highly imbalanced at < 1% hate speech and 99% non-hate speech, which would make it difficult for statistical machine learning analysis to find patterns that differentiate between hate speech and non-hate speech and costly for our labeling process. Instead, we used a sampling method that would increase the relevance of the labeled comments to our theorized levels of hate speech; we targeted an even distribution of labeled comments across our 8 levels (12.5% each). We also wanted to avoid common shortcuts to increase rates of hate speech in labeled text, such as filtering on slur terms or Twitter hashtags. Those approaches would artificially reduce the linguistic variation in the comments and allow the deep learning to learn those shortcuts (confounded associations) without capturing true patterns (i.e. causal relationships), which is known as the "Clever Hans" effect (Heinzerling 2019; Niven et al. 2019) . In an effort to maximize the generalizability of our deep learning algorithm, we maintained a positive probability of selection for all sampled comments (i.e. no comments would be excluded based on their word usage).

Our sampling method relied on two dimensions for stratified sampling: 1) a relevance estimate of how likely the comment was to contain a target identity group, and 2) a hypothesis score for how hateful the comment was estimated to be. Both scores were built from a pilot set of 4,000 labeled comments, using pre-trained Universal Sentence Encoder representations (TensorFlow) plus a genetically optimized prediction head (Olson, Urbanowicz, et al. 2016) . For identity prediction the genetic optimization algorithm selected a multilayer perceptron model while for the hypothesis score it selected a random forest. 3 With each future iteration of the project we can leverage the models developed in the prior iteration to improve the stratified sampling efficiency.

We used the identity relevance and hate speech hypothesis scores to create five stratification bins: 1) irrelevant (i.e. estimated to contain no references to identity groups), 2) relevant and low on predicted hate speech score (potential counterspeech or positive identity speech), 3) relevant and moderate on predicted hate speech score (neutral), 4) relevant and high on predicted hate speech score (low or moderate intensity hate speech), and 5) relevant and very high on predicted hate speech score (violent hate speech). We heavily oversampled bins 2, 4, and 5, and undersampled bins 1 and 3. Because this stratification scheme covered all comments, each comment had a positive probability of being sampled, but we improved the likelihood of labeling comments that were some form of hate speech or counterspeech. As in a case-control study, this biased sample could be re-weighted back to the original population of comments through inverse probability weighting (Horvitz et al. 1952) . We incorporated platform sample size targets such that our labeled data consisted of 40% sourced from Reddit, 40% from Twitter, and 20% from YouTube.

The sampled comments were then compiled into groups of 4 "original comments", stratified across our bins so that each group contained comments across the hypothesized hate speech spectrum. Each comment group was randomly allocated to 4 comment batches to ensure 4 ratings per comment, and each batch also included 6 reference set comments stratified across our 6 reference set levels. This design was chosen to generate a single network across all raters, which were linked through the comment groups plus the random selection from the reference set. In Figure 2 we show a simplified example of overlapping comment ratings that yield a single network across reviewers. This experimental design further ensured that every reviewer would receive comments across our hate speech scale; we eliminated the risk that by random chance some raters would only review comments in a narrow range of the scale. We connected the labeling instrument, hosted on Qualtrics, to a comment batch server using web service requests in order to reserve comment batches and then mark them as completed. The comment batch server was hosted as a Python serverless function in Google Cloud with a MySQL database backend. Each reviewer was given a random comment batch of 26 comments that was not already reserved by another worker and that had not yet been completed. If a given comment batch was not completed with 10 hours it was returned to the pool of unreserved comments.

Human reviewers were recruited from Amazon Mechanical Turk to complete our labeling instrument hosted on an external site. Each labeler was given 26 comments-6 reference set comments and 20 "original comments"-to label. Median time to complete the instrument was 49 minutes. Participants were compensated $7 for their participation in our study, yielding a median pay rate of $8.57 per hour. A manual review of the worker feedback on the task showed high satisfaction with the compensation for the task, and appreciation that the results would contribute to an understanding of social media conversations.

Our scaling procedure, as described in this section, converted the collection of ordinal ratings from crowdsourced human reviewers into a continuous, linear hate speech scale. We identified Rasch item response theory (IRT) as the appropriate psychometric framework to analyze the ratings and to transform them into the continuous score.

We selected the Rasch family of item response models because their theoretical properties have provably distinct advantages over other forms of IRT, leading to their elevated status as "the necessary and sufficient process for measurement" (Wright 1992) . Rasch (1960 Rasch ( , 1980 first described the requirements for objective measurement. Only Rasch models are founded on Fisherian sufficient statistics (Fisher 1934 ) that allow the latent variable, rater, and item parameters to be estimated separately using additive models (Andrich 2011) .

Notably, Rasch models satisfy the five requirements of invariant measurement (Engelhard Jr 2013, Ch. 1) as translated into our hate speech application:

1. Item invariance: measurement of comments must be independent of the particular items (survey questions) used for data collection.

2. Non-crossing comment response functions: a more hateful comment must always have a higher probability of a more hateful response option than a less hateful comment.

3. Comment invariance: The calibration of items and response options must be independent of the particular comments that were labeled.

4. Non-crossing item response functions: A comment must have a higher (more indicative of hatefulness) expected response on an easier item (lower difficulty) compared to a harder item (higher difficulty).

Comments, items, item responses, and raters must be simultaneously located on a single underlying continuous latent variable.

The benefit of the invariant measurement is that the resulting scale is not specific to the particular survey items, raters, or comments that we analyzed in this study. Instead, we have created a measurement system for hate speech that can be applied to future text, incorporate improved survey items, and use different raters, while maintaining the ability to analyze all such objects on the original scale we have constructed in this initial work. In other words, we are able to construct a measurement device for hate speech that provides stability of meaning over a long time period, rather than an arbitrary metric that is valid only for our currently acquired data and instrument. More detailed discussions of the "Rasch rationale" are provided in M. Wilson (2004, Ch. 6 ) and Wright and Masters (1982, Ch. 1) . The partial credit model (Masters 1982 ) was the appropriate model within the Rasch family because our labeling instrument did not use the same set of response options for each item. We avoided the two-parameter or three-parameter families of IRT models such as the graded response model (Samejima 2016) or generalized partial credit model (Muraki 1992) . Those non-Rasch models do not admit the invariance properties of Rasch measurement. Specifically, the estimation of item discrimination parameters (aka "slope parameter") in a 2-or 3-parameter model implicitly results in the item difficulties no longer being separable from person (comment) abilities (Wright and Masters 1982, Ch. 1, p. 8) . That result can be visualized by comparing the item characteristic curves from such models and noting that they must cross, because they do not have the same shape (M. Wilson 2004, Ch. 6, p. 110 -113) .

We further avoided principal component analysis and factor analysis because they also do not transform raw scores into objective measures. Rather, they analyze the item responses as though they were already linear measures, when in fact they are only ordinal categorical ratings (Wright 1996) . Moreover, neither method is sample invariant: they generate results that are intrinsically limited to the specific comments, items, and raters that were observed (Rasch 1953) .

In the late 1980s Linacre extended the Rasch family of models to include judge-mediated assessments (Linacre 1987 (Linacre , 1989 , which applies to our labeled dataset generation where a human reviewer completed a survey in which they analyzed textual comments. Treating the rater as an additional facet of the scaling procedure enabled the estimation of a rater "fixed effect" or "severity" parameter with the same hate speech scale units as the comment scores, item difficulty estimates, and item step thresholds. This severity parameter can be viewed as an estimate of survey interpretation bias, where raters vary in how aggressively or loosely they interpret the scale items. Those rater fixed effects then no longer influence the statistical estimation of the comment abilities, item difficulties, or item steps. The result is a more objective estimate of the hate speech score for an individual comment that is independent of the severity of the raters who happen to be assigned to review that comment. Extensive details on faceted Rasch models can be found in Engelhard Jr and Wind (2018) , Eckes (2015) , and Linacre (1989) .

With the faceted partial credit model the probability of a given response to an item can be written formally as the following equation (Eckes 2015; Linacre and Wright 2002) :

where:

• p nijk is the probability of comment n being rated as response k by rater j on item i,

• p nijk−1 is the probability of comment n being rated as response k − 1 by rater j on item i,

• θ n is the ability of comment n,

• δ i is the difficulty of item i,

• α j is the first-order bias ("severity") of rater j,

• τ k is the difficulty of receiving rating k relative to rating k − 1.

Faceted Rasch models include a noteworthy implication for inter-rater (kappa) reliability: it is not essential that different raters provide the same responses to an item when analyzing a certain comment. There is a growing body of related literature showing that a single "true" labeled response is often unrealistic (Aroyo et al. 2019; Geva et al. 2019; Palomaki et al. 2018) . Instead, we want within-rater consistency in item interpretation so that their estimated severity acts as a strong summary measure of their individual style of rating. In fact, for raters with very different estimated severities, we would expect them to provide different ratings on an item when analyzing the same comment -that would be consistent with the measurement model (and common sense). Reliability of ratings and unbiasedness are two distinct phenomena; for example, a given comment may exhibit high reliability due to multiple raters agreeing on an biased assessment (Henning 1996) . This is a marked psychometric departure from prior studies of hate speech or other supervised natural language processing topics, which have commonly relied on inter-rater reliability as the primary quality metric for dataset labeling (Ross et al. 2016) .

We used Facets software (Linacre 2019) to conduct the many-facet Rasch scaling. A sequence of four scaling estimates were conducted, in which increasing percentages of low-quality raters were removed, and response options were collapsed to reduce 11/33 noise in the estimates. The details of the rater quality analysis and response collapsing are described in the appendix.

After scaling, we trained an algorithm to estimate a mapping from the raw text to our latent hate speech score. Deep learning has shown the best performance for this type of task, provided that sample size is not too small. The current best architectures are based on transfer learning (Pan et al. 2009 ) and Transformer units, such as T5, ALBERT, and RoBERTa. These architectures consist of a language model that is first trained in an unsupervised fashion on large amounts of general text, typically millions of Wikipedia articles and about 10,000 books. This teaches the algorithm the meaning of words within their context, allowing it to read new types of text.

We then conducted the supervised learning: we supplied the raw comment text as our input observations, with the ultimate goal of predicting the latent hate speech score. We made four novel changes in our supervised approach: 1) rather than predict the hate speech score directly, we instead predicted the responses to each survey item using a multitask architecture (i.e. multiple outputs within a single model) (Ruder 2017) , 2) each survey item was directly analyzed as an ordinal outcome using the consistent rank logits method of ordinal softmax activation and ordinal cross-entropy loss (Cao et al. 2019) , 3) we supplied the rater's estimated survey interpretation bias (severity) as an additional non-text input to allow the model to adjust its understanding of the likely item response, and 4) we tagged known slurs in the raw text of the comments, giving the deep learning models comparable information to what the human annotators were provided. The individual item response predictions could then be transformed into latent scores using the estimated IRT parameters. See Figure 3 for a depiction of this architecture. We believe that this approach represents a new way of integrating deep learning with item response theory for measuring phenomena that can be decomposed into multiple items that are reviewed by human raters. We now describe the details of those steps.

The simplest supervised learning strategy would be to predict the continuous hate score directly, and to ignore the intermediate item ratings that led to that score. With this architecture the data structure would consist of one observation per comment. The estimated severity (bias) for each labeler would not need to be incorporated as an auxiliary input because the score was already debiased through the IRT scaling. The loss function could simply be mean-squared error and the architecture could consist of a Transformer-based representation subnetwork applied to the raw text, followed by one or more dense layers to learn the mapping to the continuous score. In lieu of the dense layers, the language representation could be applied to generate a summary feature vector for each observation, and traditional machine learning such as XGBoost , BART (Chipman et al. 2010) , SuperLearner ensembling (Polley et al. 2019) , etc. could be applied to predict the continuous score. We implemented both the full neural architecture (dense layers) as well as the traditional machine learning structure with TPOT optimization (Olson and Moore 2019) as benchmark options.

We hypothesized that an improved supervised learning architecture would instead attempt to predict the human rating on each of the survey items (as listed in Table  A1 ). Those intermediate predictions could then be transformed into the continuous score using the IRT parameters estimated during the scaling process. The exciting insight of this approach is that the items facilitate what could be called directly Comment text is fed into a deep natural language processing algorithm to convert it into a fixed vector representation. Then a series of fully connected layers learns how to combine that vector representation into latent variables that can predict the response to each item on the labeling instrument. The fully connected layers also take as input the estimated rater bias for each comment rating to adjust their expected item response predictions. The predicted item responses are then transformed via IRT into the continuous hate speech score. Comment text is fed into a deep natural language processing algorithm to convert it into a fixed vector representation. One or more hidden layers, or traditional machine learning algorithms, learn a function to map that fixed representation to minimize meansquared error loss for predicting the continuous hate score. supervised concept learning. In a typical neural architecture the final hidden layer has developed the highest level concepts that best predict the outcomes in the output layer based on the architectural hyperparameters: loss function, number of hidden units, activation function, random initialization of weights, dropout, and optimization over multiple epochs of a certain batch size. Exactly what those final concepts mean is not immediately obvious, and is the result of a greedy stochastic process that reflects a local optimum. In contrast with our method we know from our theorization and development of the construct the exact concepts that need to be learned as well as the nonlinear function that transforms those constructs to the continuous score, and we have labeled data for each of those concepts: the survey item responses from the human labelers. This is a powerful shortcut in the supervised learning process that largely eliminates the need for architecture search or hyperparameter optimization in that final concept layer of the neural network. We know how many hidden units there are, the form of the loss function for those units, and the nonlinear activation function to transform those units into the final output. And it provides direct supervision during the optimization process for those concepts: we don't have to rely on backpropagation of errors from the final continuous output.

While there could be a separate model for each item, a multitask architecture was advantageous for three reasons: efficiency, generalizability, and convenience. Multitask architectures can improve efficiency and generalization because they bias the network to learn a shared representation of concepts that explains multiple related outputs (Caruana 1997; Goodfellow et al. 2016, §7.7) . Multitask architectures are gaining increasing adoption across deep learning tasks, include face analysis (Ranjan et al. 2017) , language modeling (CITE), self-driving cars (Karpathy 2019), etc. We examined the relatedness of items in our scale through a correlation heatmap ( Figure  8 ) and found strong levels of correlation, supporting the likely benefit of multitask learning. Multitask items offer convenience through packaging multiple outputs into a single architecture, reducing the lines of code needed for training and prediction.

Each item served as an output in the network, and the label was an ordinal Likert-style variable with 5 response options typically, such as: {strongly disagree, disagree, neutral, agree, strongly agree}. The loss function for each output (task) would benefit from acknowledging the ordinal nature of the outcomes: predicted probability mass assigned farther from the true label is worse than probability mass allocated to the adjacent label(s) (Hou et al. 2016) . For example, when a human rater labels a comment as "strongly disagree" on the "calls for violence" item, the predicted probability of "strongly agree" should contribute much more to the loss than "disagree", because it is a much worse prediction. This has been recognized as a desire for unimodal probability distributions when conducting ordinal classification (Beckham et al. 2017; Costa et al. 2008) . Cross entropy, the standard loss function used for discrete labels, does not do this. It only examines the probability placed on the true label and encourages that probability to be maximized.

We incorporated the method of Cao et al. (2019) termed consistent rank logits (CORAL), to explicitly handle the ordinal nature of predicting the rating on each item. When applied to an item with k possible ratings (e.g. 5 for a Likert-style item), the CORAL method decomposes an ordinal regression into k − 1 binary classification tasks in which the model learns to predict if the expected response is greater than or equal to each possible rating; the highest rating value is skipped because its 14/33 probability prediction will be the remainder. The loss function becomes the (possibly weighted) sum of the cross-entropies of each of these binary classification tasks, which we call ordinal cross-entropy. The set of binary classifiers uses shared weights combined with separate trainable bias parameters, which guarantees that the output yields a unimodal probability distribution over the possible item ratings; for this reason we term the ordinal activation function ordinal softmax. We implemented a Keras version of the CORAL method for this work (Kennedy 2020).

Once the multi-task deep learning model has been trained to generate item-level predictions (i.e. probability distribution over each response option), those predictions need to be fed into an anchored faceted Rasch model to be converted into the predicted hate score. The simplest approach would be, for each of the 10 items, to predict the response option (label) with the highest estimated probability. For example, if the model estimated that the genocide item would be answered as "yes" at 10% probability and "no" at 90% probability, the model's label prediction would be "no". So the model would generate estimated item labels for each comment, and those predicted ratings would be scaled by running the partial credit scaling procedure with item and response parameters anchored (fixed) to the values from the original faceted Rasch scaling. The measurement model for this deep learning scaling differs from the Rasch scaling we used to create the continuous outcome variable because we have only a single rater -the deep learning model itself. That is the reason that we use the partial credit model for this step, rather than a faceted Rasch model that estimates a severity parameter for each rater.

There are two key downsides to the partial credit scaling as described:

1. The number of unique predictions generated by the partial credit model is limited to the cardinality of the raw scores. This is because the raw score (the sum of the individual item ratings) is a sufficient statistic for the continuous latent variable generated by Rasch measurement models. In our case the final item setup had 33 possible raw scores (0 -32), so only 33 unique point estimates could be generated by the partial credit model transformation. Although certainly more granular than predicting a yes/no or ordinal label, that seems somewhat coarse when covering a continuous spectrum from -8 to +5, which could limit the predictive performance of this version of the model.

2. Possibly more importantly, in selecting the most probable rating we discard the information contained in the predicted probability distribution for each item, i.e. how confident the model is about a predicted item rating for a given comment. That information seems quite valuable, and is missing from the original labeled data from the human raters -we only know the single response option that they selected (presumably the highest probability response option from their mental model of the rating task). 4 Incorporating that probability information will also improve the sensitivity of overall model performance to improvements in the multi-task item predictions: architecture improvements will translate into improved probability predictions more efficiently than improved label (rating) predictions.

How best to solve these problems then, generating more precise predictions that incorporate the confidence information from the predicted probability distributions? We chose to take a plausible value sampling approach, which resolved both issues. Rather than select the most probable response for each item, we sampled many possible item ratings for each comment based on the probability distribution for each item. Each "plausible value" for an item rating was selected at random from the possible response options, with discrete probability equal to the model's predicted probability for each response option. In other words, we asked the model to rate each comment many times, and to select each item rating by a probability-weighted sample of the response options. Those replicated ratings were then scaled simultaneously in the anchored partial credit model to yield an estimated hate score for each comment. We tested different replication counts (1, 2, 4, 8, 10, 16, 24, 32, 64, 128, 256) to examine the trade-offs of predictive performance and computation time.

3.6.6 Dynamic adjustment via revised slur tagging As noted previously, the raw comment text was processed so that any known slurs that had been highlighted during the annotation process were also tagged for use in the deep learning model training. This gave the models the potential to incorporate slur tagging into their contextual language representation, if it was found to be useful. While our intent with the slur tagging was primarily to improve the quality of the labeled data, it has an interesting implication for the deep learning side as well. In effect, it partially decouples the understanding of slur terms from the model's predictions, meaning that we do not have to keep fixed the slur tagging as it was implemented during human annotation. This has a major advantage: as coded slur terms and other covert language evolve in the future, or as we simply develop a more strongly theorized list of slur terms, we can update our slur dictionary to reflect those changes and improve our model's ability to measure contemporary hate speech, even when those terms were never seen in the original training data. We call this dynamic adjustment of slur tagging during prediction.

For example, our training data predates the coronavirus epidemic, so the training data contains no examples of COVID-related hate speech, such as the racist phrase "kung flu." However, if we add that phrase to our slur dictionary and tag it appropriately in future raw comments, our model prediction then incorporates that side knowledge to better measure the hatefulness of comments that adopt that novel racial slur. We can also do the chronological reverse: when applying our model to older text, such as historical literature, we can modify the slur tagging preprocessing to reflect the slurs as they were used at that time. This slur tagging process can further benefit from improvements in the tagging methodology: model-based tagging, rather than simple dictionary lookups, can reduce false positives where context demonstrates that a word or phrase is unlikely to intended as a slur.

We now report on our observed results from the evaluation of the allocation of comments to raters (judging plan), application of faceted IRT to our labeled comments (scaling), and our accuracy at predicting the hate score using deep learning on the raw comment text.

We evaluated the network linkage across the raters, original comments, and reference set comments. We confirmed that our batch creation procedure generated a single linked network, with no disjoint subsets, that would allow the estimation of rater severity ("fixed effects"). The diameter of the network was 6, meaning that any two points were connected by traveling across no more than 6 edges. The average distance was 3.6, meaning that any two nodes were typically about 4 edges apart.

Our faceted partial credit model achieved a case (comment) reliability of 0.94. Our estimated rater separation reliability was also 0.94, suggesting that our judging plan resulted in high accuracy at estimating the individual severity for the labelers. Following Linacre (1999) , we confirmed that for each item the average hate score increased monotonically with more hateful response options.

The calibrated Wright Map showing our scale across comments, items, item steps, and raters is shown in Figure 5 . (Table 4 and Figure 5 ) were consistent with our hypothesized item difficulties, which speaks to the construct validity of our instrument. Our item fit statistics exhibited reasonable fit to our assumed Rasch model. Both infit (inlier sensitive) and outfit (outlier sensitive) mean-squares fell We reviewed the scaling results on our reference set comments, which were each rated 500 -1200 times. Figure 6 displays the ability point estimates for each reference set comment grouped into its qualitative level. The average score within each level showed the expected monotonic increase in hate speech for increasing levels. However, the level pairs 2 & 3 and 4 & 5 did not show large absolute differences in their average hate speech scores. We also noted that some comments appeared to be better moved into adjacent levels based on their hate speech scores. These results imply that we should revisit our theorization for the reference set, review the criteria for the comments to determine if they are better placed in adjacent levels, and consider substituting ambiguous comments with more clearly exemplary comments. We may want to simplify the reference set by merging levels 2 & 3 and levels 4 & 5. Figure 6 . Evaluation of scaling on reference set Ability estimate for each reference set comment. The horizontal blue line is the average score within the theorized qualitative level.

We examined the distribution of the hate speech score across our three platforms (Figure 7) . YouTube and Reddit looked very comparable. Twitter showed a similar distribution but shifted to the left, with a noticeable reduction in scores at the high end of our hate speech scale. This result suggests that we may need to improve the allocation of Twitter comments into batches in order to increase the percentage of comments on the hateful side of the spectrum. Our observed results may also indicate that Twitter already conducts some sort of automated filtering of hateful comments before allowing the comments to be downloaded through their API, or that the Perspective API is less accurate on Twitter comments. Alternatively, this may suggest differential item functioning of our labeling instrument for Twitter comments, if they have linguistic characteristics that cause the raters to interpret them differently from comments sourced from YouTube or Reddit. It is important to note though that these results reflect the unweighted distribution of our labeled sample, which was intentionally skewed during the comment collection and batch creation process. Therefore these distributions should not be seen as population estimates of the true score distribution on each platform, but rather as descriptions of our training data. Future work will apply weighting corrections to the training data to estimate population-level parameters. A review of the correlation across items, shown in Figure 8 , was suggestive that a multi-task deep learning architecture could benefit from sharing information across each item rating prediction.

After scaling, we used supervised deep learning to estimate a mapping from the raw text to the continuous hate speech score. Our current best models used a RoBERTa-Large contextual language base to process the raw text into a vector representation (Y. Liu et al. 2019) , followed by a 64-unit hidden layer with 10% dropout, and included fine-tuning of the language representation subnetwork using HuggingFace software's Keras implementation (Chollet et al. 2015; Wolf et al. 2019) . Results from our model testing are shown in Table 5 . Direct prediction of the continuous hate score has currently achieved the lowest root mean-squared error (RMSE), although our proposed multitask networks that are transformed via IRT achieved comparable performance and slightly lower mean absolute error with the benefit of explainability. In the multitask outcome representation we did not find a benefit from ordinal modeling of the items compared to categorical modeling. Table 5 . Supervised learning results.. FE = feature extraction (i.e. freezing the weights in the language representation layers). FT = fine-tuning of a transformer-based architecture, as compared to feature extraction. WWM = whole-word model, referring to the form of masked language modeling. CV = cross-validation. RMSE = root mean-squared error. Corr = linear (Pearson) correlation. MAE = mean absolute error.

Prediction Algorithm Outcome Representation a Because this is a probability prediction for a binary outcome, we linearly calibrate the probability output to our continuous outcome by fitting an OLS regression on the training set with the probability output as the only feature, then apply the prediction of that regression estimator to the validation data.

Grounded by theorization, our empirical results showed that we were successful in both major sub-projects of this work: 1) creating a 10-item labeling instrument 20/33 processed with faceted Rasch modeling to yield a continuous scale for hate speech, including counterspeech, and 2) predicting that scale on unseen social media comments using Transformer-based natural language processing. That prediction could be accomplished either by directly predicting the hate score from the raw text, or by taking advantage of the Rasch transformation by first predicting the human ratings on the ten constituent components of hate speech (i.e. the multitask architecture), and then aggregating multiple possible ratings of each comment (plausible values) into a predicted hate score with a fixed-parameter Rasch scaling. The two prongs of this work allow arbitrary text to be placed on an interval spectrum ranging from genocidal hate speech on one extreme to supportive identity speech on the other extreme. In sum, we successfully created an initial measurement system for hate speech. Future work will no doubt provide further improvements to the scale creation and prediction algorithm, as well as the source data collection and the comment labeling system.

While we have focused on our hate speech problem of interest, the methodology we have developed is applicable whenever humans review data to make a judgment, and that summary judgment could be based on multiple component parts. That data could be images or video just as well as text, or other data structures such as time series. Within natural language processing, sentiment analysis is one of the most widely used supervised learning problems; we suspect that the application of our method to measure sentiment would lead to substantial improvements in that subfield. Automated essay grading might be an even more straight-forward application of our method. Possible video applications include the automated evaluation of technical skill derived from videos of surgical procedures, where a multi-component labeling instrument is already used by human judges, called "OSATS" (Martin et al. 1997) , and the assessment of the quality of surgical wound closure (Blencowe et al. 2019) . Numerous other applications could likely be developed in future years.

Our work thus far is promising but with some known limitations. The ten items used for scaling deserve further comparison and development; we may be able to remove certain items or add additional items to increase reliability. During the scale development our estimated item fit statistics encouraged the collapsing of response options to improve invariance. This suggests that reviewers had difficulty with consistently differentiating between response options for several items. This is quite reasonable given that our response options were primarily Likert-style "strongly agree" to "strongly disagree" -those options are inherently subjective and ambiguous. Additional theorization, qualitative review, and pilot testing of improved response options could lead to better measurement characteristics, particularly for the items with greater estimated difficulty such as genocide and violence. Clearer, more objective response options that are consistently interpreted should then result in reduced variance in the rater severity parameter, and increased precision for the latent variable. That increased precision may better separate the latent variable estimates for the different theorized levels in our reference set. Further refinement of the reference set and theorized levels may help to provide even more granular distinct levels of measurement. Incorporation of the rater bundle model into our estimation procedure will also provide a more accurate estimate of instrument reliability (M. Wilson and Hoskens 2001) .

In addition to those incremental improvements, our leveraging of item response modeling opens up powerful methods for the analysis of bias and fairness. The field of measurement has long examined bias in exam questions through the lenses of differential item functioning (DIF) and differential rater severity (Myford et al. 2003) . DIF methods can test if one of our 10 components of hate speech does not operate 21/33 consistently for certain subsets of comments. For example, if reviewers are more likely to rate comments with a racial dialect as being disrespectful (Sap et al. 2019) , we can statistically identify the issue and work to create alternative item wording that mitigates the bias. If certain individual reviewers interpret the items more harshly for comments that include sexist speech, we can also identify that statistically and correct for that bias in our analysis. These types of analyses are standard practice when analyzing item response theory scales, and can be executed within a formal statistical framework that is comprehensive for the wide variety of identity groups that are included in our labeled data. Evaluation and improvement of bias & fairness will be a significant part of our future work in this line of research.

With regard to the deep learning modeling, we have not yet reached the limits of supervised performance using our existing training data. The language representation backend can be continually upgraded as new language models are developed, such as T5 (Raffel et al. 2019) . In-domain pretraining of the language model on our own corpus of social media comments would likely improve performance, which is an upgrade we are currently exploring. Both the multi-task architecture and implementation of ordinal loss and activation likely have untapped performance improvements, such as those that might be discovered through neural architecture search or through customized task weighting (Klyuchnikov et al. 2020) . The integration of the IRT transformation with the multi-task predictions bears substantial future experimentation to answer key questions. Can we approximate the IRT transformation through an integrated neural subnetwork? Is plausible value sampling the best way to capture uncertainty in the predicted ratings? What about using the item ratings as a pretraining objective for a two-stage model that predicts the continuous score directly, or integrating the item ratings and the continuous score into a single multi-task network? Robustness to misspellings or other adversarial orthography would be yet another helpful improvement (Sun et al. 2020) . Lastly, numerous low-level optimization improvements are likely possible for the training of deep models, including learning rate scheduling, next-generation optimizers (e.g. Yogi - Zaheer et al. (2018) ), and differential learning rates.

We have a number of longer term goals for future work in our hate speech research agenda. Most urgently, there is a need to apply this measurement technology to causal inference problems such as estimating the effects of policy changes, current events, and user interface interventions on hate speech, as well as purely descriptive work to report on temporal hate speech trends. Extending our current model to English-speaking countries outside of the United States (e.g. the United Kingdom, Canada, or English-speaking African states) will require a broader theorization of hate speech for variation in culture and configurations of vulnerable populations. Expanding our platform sources will likely continue to be fruitful for generalizability and linguistic robustness; we hope to include data from Facebook, Instagram, Wikipedia, Twitch, and WhatsApp in later work. Expanding to additional languages, especially Sinhala, Khmer, Arabic, Hindi, and Portuguese, will allow our work to be relevant to developing countries where ethnic conflict, genocide, and extremist violence may be more overt than in the United States. Application of our model and methodology to low-resource languages will be facilitated by our dynamic slur tagging approach, where existing slur lexicons can facilitate early pilot models prior to the development of a full labeled corpus. Releasing our models to select partners through an API will facilitate incorporating them into browser plugins, social media platforms, and other user interface interventions. We also plan to apply our models to extremist literature, such as The Turner Diaries (Pierce et al. 1978 ) and Hitler's Mein Kampf, to better understand the role of hate speech in radicalizing literary works.

In this paper we described the development of a novel, holistic methodology for measuring hate speech in a scalable, debiased, explainable manner. Based on prior literature, we theorized eight qualitative levels on a scale ranging from genocidal hate speech to counterspeech and collected empirical observations as examples of each level (the reference set). We developed a labeling instrument to record ordinal ratings on 10 components of hate speech through a reviewing process. We collected online comments from three major social media platforms (YouTube, Twitter, and Reddit) and sampled them in such a way as to focus our labeling on comments more likely to be hate speech or counterspeech, but maintaining generalizability by ensuring that all collected comments had a positive probability of selection in our sampling procedure. We created a crowdsourcing-based labeling procedure to allocate comments to reviewers and yield a network linking all reviewers to each other through overlapping comment reviews, facilitating the estimation of the survey interpretation bias of each reviewer (a rater "fixed effect"). We fit the faceted Rasch partial credit model to create a sample-invariant scale for hate speech that placed comments, survey instrument items, and raters on the same continuous metric, and adjusted the estimated comment hate speech score for the estimated survey interpretation bias of the raters who happened to rate that comment. The statistical diagnostics from the Rasch model allowed us to evaluate the quality of each reviewer and remove crowdsource workers with low-quality responses. Finally, we applied supervised, multitask, Transformer-based deep learning with rater bias as an auxiliary input, followed by an IRT nonlinear post-processsing transformation with plausible value sampling, to learn an estimator that maps raw text to the hate speech score in a robust, explainable manner. That deep learning model was encouraged to gain a more general understanding of language through training on data from three separate social media platforms.

Separately, each of these steps represents a novel contribution to the hate speech literature. In combination, we believe our methodology proposes a paradigm shift in the understanding and measurement of hate speech, and in supervised learning of human-labeled data more broadly. We hope that our work will encourage other researchers to adopt Constructing Measures-style theoretical development & measurement in the study of complex social phenomena, including a transition from dichotomous or ordinal outcomes to continuous, linear scales estimated via Rasch-based item response modeling, corrected for survey interpretation bias of reviewers, and integrated into explainable multitask deep learning architectures. For future updates on our project, data, and models, please visit hatespeech.berkeley.edu.

This work was funded by the UC Berkeley D-Lab, the Anti-Defamation League, the Berkeley Institute for Data Science, and Google Jigsaw.

A.1 Scale items from labeling instrument Table A1 . Scale items from labeling instrument. These items operationalize our hate speech construct and are aggregated into a continuous score via Rasch measurement theory. The tenth item, a binary label for hate speech, was included for benchmarking purposes, however because it was part of the annotation process we also incorporated it into our scaling procedure.

Item Does this comment contain hate speech, defined as "bias-motivated, hostile and malicious language targeted at a person/group because of their actual or perceived innate characteristics, especially when the group is unnecessarily labeled (e.g., "jew bankers", "nâĂŞ hooligans")." The comment contains hate speech if slurs or derogatory language are used toward any person or group regardless of whether or not the target is a member of the slur-target group. (e.g. a comment calling Trump a faggot or a n***** still contains hate speech).

a This item is converted to a binary value of 0 if no identities are present and 1 if any identities are present. It is not directly used in the scale but if a comment is not flagged as having at least one identity group target then the remaining scale items are skipped in our labeling instrument.

b The question wording and response options for this item are sourced from Siegel et al. (2019) .

A.2 Identity group targets from labeling instrument 29/33 People with physical disabilities (e.g., use of wheelchair), people with cognitive disorders (e.g., autism) or learning disabilities (e.g., Down syndrome), people with mental health problems (e.g., depression, addiction), visually impaired people, hearing impaired people, no specific disability

Our data structure for the supervised deep learning treated each review of a comment as an observational unit. For a typical comment that was reviewed four times there would be four observations in our training data. Each observation would have the same raw comment text but a unique rater severity for each reviewer, plus the corresponding item ratings that the reviewer provided for that comment. This differs from prior studies which have generally aggregated the data to have one observation per comment, taking the mean or mode of the binary outcome label(s) from the reviewers, and usually discarding comments where inter-rater agreement did not exceed a certain threshold. See Table A3 for an example of the supervised learning data structure. Due to this repeated measures (or hierarchical) data structure we could not use simple randomization to create training/test splits. By random chance it would be likely that some of the reviews for each individual comment would be in the training set and some in the test set, making the test set performance no longer an unbiased estimator of performance on future unseen comments (i.e. due to "data leakage" across the training/test partitioning). Therefore it was important to conduct clustered randomization at the comment level when creating training/test splits, taking into account the hierarchical data structure so that reviews of a given comment would all 30/33 be assigned either to training or to test, and not both.

Crowdsourced labeling is thought to generally provide research-quality data with improved diversity over convenience samples such as undergraduate students (Berinsky et al. 2012 ). However, it remained important to evaluate the quality of raters that created our labeled data. A subset of workers seek to maximize their effective hourly compensation by completing Amazon Mechanical Turk tasks as quickly as possible, even though their answers may be less accurate than a slower completion speed. That is a rational pursuit of self-interest, provided that it does not lead to their work being rejected upon review. In the case of our study, and academic research in general, ethical considerations precluded us from rejecting the compensation of any study participants, even if their answers were not usable. Experienced crowdsourced workers might be aware of that protocol for academic studies, which could incentivize response satisficing. The main effects of poor quality labeler data include increased variability in the item fit statistics and ability scores, and reduced reliability estimates for the scores, raters, and items.

In order to evaluate the quality of raters' responses, we first examined the percentage of comments for which the rater had flagged one or more identity groups as being the targeted of the message. Our experimental design ensured that all comment batches contained six comments from our reference set, excluding the neutral level. That meant that at least 6 out of the 26 comments in the batch were known to include an identity group target (23%). The remaining 20 "original" comments were sourced from a pool of comments that were stratified on our identity model. 90% of those comments exceeded a threshold of 82% probability of containing an identity group, with the remaining 10% of comments being downsampled from those that scored less than 82% probability. The threshold of 82% was chosen based on manual inspection. The average predicted identity probability in comments exceeding the threshold was 90.4%, and whereas for comments below the threshold the average was 27.1%. Therefore the expected identity rate in the 20 original comments was 60%, or 12 comments. The overall 26-comment batch was expected to contain roughly 18 comments with identity group targets (71%). Figure A1a displays a histogram of the identity rate across all 11,143 raters. It is a bimodal distribution, with a peak at 73%, corresponding to 19 comments out of 26 being flagged as having an identity, and at 4%, corresponding to 1 out of 26 comments being flagged as having an identity. As noted previously, each batch consisted of at least 23% of comments containing an identity group target.

We layered on a second dimension of rater quality: the infit mean-squared statistic, a rater fit diagnostic that is calculated during the Rasch scaling (Linacre 2002) . Infit mean-squared has a minimum value of 0 and a maximum of infinity, with an expected value of 1 (or 0 on the log scale). Raters with an infit mean-squared greater than 1 had more randomness or noise in their responses than expected by the Rasch model. Those with a statistic less than 1 had less randomness than expected, suggesting that they may have favored certain response options. Values greater than 2 have been interpreted as degrading the measurement system (ibid.).

We reviewed potential exclusions on both infit mean-squared statistic and the identity percentage statistic. We chose to exclude raters with an infit mean-squared statistic exceeding 1.9 or less than 0.37, or with an identity rate less than 20%. Those thresholds led to 24% of raters being removed, leaving 8,472 as providing acceptable data quality. The post-exclusion scatter plot appeared to have a reasonable bivariate normal distribution and the smoothed identity rate became nearly flat across the infit mean-squared statistic.

(a) Rate of flagging one or more identities in labeled comments (b) Comparison of identity rate versus infit mean-square statistic, with lowess smooth in blue.

(c) Potential cutpoints for excluding lowquality raters (d) Updated rater distribution after applying exclusions Figure A1 . Analysis of rater quality and exclusion of low-quality raters 32/33

Raters were also excluded based on the number of demographic questions that were skipped, number of submissions from each unique IP, reverse geocoding of the response IP to not be located in the United States, matching of the IP to known proxies, duration to complete the survey, or an extreme value for estimated survey interpretation bias. After all exclusions we included the labels from 7,619 raters out of 11,143 total (31.6% excluded). More details on these filters will be included in future versions of this document. Following all rater exclusions we re-fit the item response model and used those estimates as our primary scaling result.

Our inclusion of a binary hate speech item in our labeling instrument allows the comparison of our interval measure to what would be possible with only a binary item response. An initial question might be: can rater agreement on the hate speech item approximate the magnitude provided by the continuous scale? Figure A2 shows that rater agreement on a single hate speech item is unfortunately a poor approximation of an interval measure. The statistical association is moderate (correlation = 60%), though highly significant (p < 0.00001), with rater agreement explaining 36% of the variance of the measure. In addition to other benefits (invariance, debiasing, explainability, interval measurement), the continuous measure contains approximately three times the information of the single-item rater agreement approximation. Figure A2 . Insufficiency of a binary hate speech item. Rater agreement on a binary hate speech item fails to capture the magnitude or extremity of speech on a continuous hate speech spectrum.

Rating scales and Rasch measurement

Empowering young people in the aftermath of hate

Crowdsourcing Subjective Tasks: The Case Study of Understanding Toxicity in Online Discussions

Deep learning for hate speech detection in tweets

Unimodal probability distributions for deep ordinal classification

Dangerous Speech: A Practical Guide

Evaluating online labor markets for experimental research: Amazon. com's Mechanical Turk

A qualitative study to identify indicators of the quality of wound closure

Rank-consistent ordinal regression for neural networks

Multitask learning

Inventing temperature: Measurement and scientific progress

Xgboost: A scalable tree boosting system

BART: Bayesian additive regression trees

The cost of dichotomization

The unimodal model for the classification of ordinal data

Automated hate speech detection and the problem of offensive language

Dichotomizing continuous variables in statistical analysis: a practice to avoid

Hate me, hate me not: Hate speech detection on Facebook

Bert: Pre-training of deep bidirectional transformers for language understanding

Measuring and mitigating unintended bias in text classification

Hate speech detection with comment embeddings

Introduction to many-facet Rasch measurement

Invariant measurement: Using Rasch models in the social, behavioral, and health sciences

Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments

Containing Papers of a Mathematical and Physical Character 144

A survey on automatic detection of hate speech in text

A unified deep learning architecture for abuse detection

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

Deep learning

NLP's Clever Hans Moment has Arrived. The Gradient

Accounting for nonsystematic error in performance ratings

A generalization of sampling without replacement from a finite universe

Squared earth mover's distance-based loss for training deep neural networks

Universal language model fine-tuning for text classification

Multi-Task Learning in the Wilderness

CORAL: Ordinal Regression in Keras

The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes

Nas-bench-nlp: Neural architecture search benchmark for natural language processing

An extension of the Rasch model to multi-facet situations

Investigating rating scale category utility

What do infit and outfit, mean-square and standardized mean?

Construction of measures from many-facet data

Sentiment analysis: Mining opinions, sentiments, and emotions

Roberta: A robustly optimized bert pretraining approach

Detecting the hate code on social media

Objective structured assessment of technical skill (OSATS) for surgical residents

A Rasch model for partial credit scoring

A generalized partial credit model: Application of an EM algorithm

Detecting and measuring rater effects using many-facet Rasch measurement: Part I

Probing Neural Network Comprehension of Natural Language Arguments

Abusive language detection in online user content

TPOT: A tree-based pipeline optimization tool for automating machine learning

Automating Biomedical Data Science Through Tree-Based Pipeline Optimization

A Case for a Range of Acceptable Annotations

A survey on transfer learning

Share of U.S. adults using social media, including Facebook, is mostly unchanged since

The Turner Diaries

Language models are unsupervised multitask learners

Exploring the limits of transfer learning with a unified text-to-text transformer

An all-in-one convolutional neural network for face analysis

On simultaneous factor analysis in several populations

Probabilistic models for some intelligence and attainment tests

Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis

Dichotomizing continuous predictors in multiple regression: a bad idea

An overview of multi-task learning in deep neural networks

Graded response models

The risk of racial bias in hate speech detection

Annual Meeting of the Association for Computational Linguistics

Defining hate speech

Disappointing dichotomies

Trumping Hate on Twitter? Online Hate Speech in the 2016 US Election Campaign and its Aftermath

The Ten Stages of Genocide. Genocide Watch

Breaking up is hard to do: the heartbreak of dichotomizing continuous data

Hate: Why we should resist it with free speech, not censorship

Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT

Destructive messages: How hate speech paves the way for harmful social movements

Attention is all you need

Detecting hate speech on the world wide web

Constructing measures: An item response modeling approach

The rater bundle model

Incitement on trial: prosecuting international speech crimes

HuggingFace's Transformers: State-of-the-art Natural Language Processing

The International Objective Measurement Workshops: Past and future

Rating scale analysis

Adaptive methods for nonconvex optimization

Detecting hate speech on twitter using a convolution-gru based deep neural network

We thank Nora Broege, Laura Jakli, Ben Gebre-Medhin, Christopher Hench, Simal Ozen Irmak, Aniket Kesari, Renata Barreto-Montenegro, Aaron Culich, and Kimberly Izar for their valuable contributions to the project. For research assistance we acknowledge the thoughtful work of Andy Chu, Chengzhi Huang, Jinlin He, Eshaan Pathak, and Violet Yao. For helpful comments and collaboration we are grateful to Rachel Rosen, Sebastian Raschka, Victor Vargas, Brittan Heller, D. Alex Hughes,

Ahmad Sultan, Adam Anderson, David Mongeau, Karen Draney, Daniel Kelley, Perman Gochyev, Alan Hubbard, Mark van der Laan, and especially Mark Wilson. We thank our item panel pilot testers and study participants for their comment reviews, particularly given the emotionally taxing nature of the material.