key: cord-0469984-yuldhlgy
authors: Nghiem, Huy; Morstatter, Fred
title: "Stop Asian Hate!": Refining Detection of Anti-Asian Hate Speech During the COVID-19 Pandemic
date: 2021-12-04
journal: nan
DOI: nan
sha: 148ce20939eaf39b69b9d19da4551caf9d5af8d7
doc_id: 469984
cord_uid: yuldhlgy

*Content warning: This work displays examples of explicit and strongly offensive language. The COVID-19 pandemic has fueled a surge in anti-Asian xenophobia and prejudice. Many have taken to social media to express these negative sentiments, necessitating the development of reliable systems to detect hate speech against this often under-represented demographic. In this paper, we create and annotate a corpus of Twitter tweets using 2 experimental approaches to explore anti-Asian abusive and hate speech at finer granularity. Using the dataset with less biased annotation, we deploy multiple models and also examine the applicability of other relevant corpora to accomplish these multi-task classifications. In addition to demonstrating promising results, our experiments offer insights into the nuances of cultural and logistical factors in annotating hate speech for different demographics. Our analyses together aim to contribute to the understanding of the area of hate speech detection, particularly towards low-resource groups.

The coronavirus pandemic has led to an unprecedented disruption in the daily lives of millions globally since its rise at the end of 2019 (Wu, Chen, and Chan 2020) . Nations have carried out extensive strategies to slow down the spread of the largest outbreak in recent history. Countries such as the United States, England, India etc. have implemented multiple waves of lock-downs and mask-mandates to keep the virus in check. Nevertheless, the negatives impact of the disease lie beyond medical casualties, giving rise to a multitude of social, economical and psychological challenges (Chu et al. 2020) .

Historically, outbreaks have been associated with xenophobia and othering of certain groups . That the virus first emerged from Wuhan, China, along with allegations of the virus chiropteran origins have only served to fuel xenophobic attitude against Asians in Western countries (Joubin 2020) . In the United States (U.S.), Asian Americans not only have to contend with the risk of infection, but also the target of prejudice and discrimination of their ethnicity (Tessler, Choi, and Kao 2020) .

In March 2020, the FBI issued warnings that "hate crimes against Asian Americans likely will surge across the United States," since "a portion of the US public will associate COVID-19 with China and Asian American populations" (Margolin 2020) . In June 2020, reports showed that over 80% of self-reported anti-Asian hate crimes occurred outside of their private residence (Tessler, Choi, and Kao 2020) . Despite these aggravating trends, there has been a lack of concerted efforts at a federal level to address anti-Asian sentiments in the U.S. (Campbell and Ellerbeck 2020) .

Anti-Asian xenophobia not only manifests in physical settings, but also on social media. Since the virus first emerged, offensive terms, such as "China virus" or "Kung flu", have been used in lieu of the its scientific name. A study on Twitter by (Hswen et al. 2021) finds that tweets that contain the hashtag "#chinesevirus" are more than twice as likely than those with "#covid19" to contain anti-Asian sentiments. That even prominent political figures adopted these terms further encouraged their usage and incited discriminatory associations (Budhwani, Sun et al. 2020) .

Since social media has become a vital channel of communication in the last decade, the dissemination of negative information through this channel poses psychological and mental health risks for its users. (Zhong, Huang, and Liu 2021) found association of increased mental health toll and social media usage of residents of Wuhan, China -the first epicenter of the outbreak. (Tahmasbi et al. 2021 ) also discovered significant rise of old and new Sinophobic slurs on Twitter induced by the coronavirus pandemic.

Major social media platforms have implemented various strategies to combat the negative consequences of abusive and hate speech. Facebook primarily relies on human moderators to review offensive content, a reactive approach that poses psychological health risks for its workers while others (Twitter, Instagram) deploy an automatic filter to warn users of explicit content (Ullmann and Tomalin 2020) . More passive alternative approaches include reliance on users to report offensive content.

Hate speech detection thus has received increased attention due to the relevance in application in social media. Mirroring the relative lack of resources to address anti-Asian hate crimes, there are comparatively much fewer Asianfocused literature in this area. (Ayo et al. 2020; Misra et al. 2020) . In this paper, we attempt to address this gap in research. Our primary contributions are as follow:

• Creation of an annotated COVID-relevant Twitter corpus on abusive and hate speech on multiple dimensions • Performing a comparative exercise on 2 schemes to reduce bias in data annotation • Development and analyses of multiple models to automatically detect anti-Asian abusive and hate speech. Our work contributes to the field of automatic hate speech detection, an arguably less traumatic approach for human users to preemptively flag sensitive content. Furthermore, analyses of our models' performance reveal insights into the unique challenges in detecting abuse and hate speech towards low-resource demographics.

Hate Speech Detection Hate speech detection is a complex and challenging task. A set of unifying standard definitions is yet to exist, leaving much room for subjectivity for discerning what constitutes hate and abusive speech. There are also overlapping -and at times competing -definitions from related works on offensive, toxic, hostile or prejudice speech that present added difficulty for generalizability (MacAvaney et al. 2019) . In this work, we use the following definitions put forth by (Founta et al. 2018 ) based on extensive literature review: • Abusive Speech : "Any strongly impolite, rude or hurtful language using profanity, that can show a debasement of someone or something, or show intense emotion." • Hate Speech: "Language used to express hatred towards a targeted individual or group, or is intended to be derogatory, to humiliate, or to insult the members of the group, on the basis of attributes such as race, religion, ethnic origin, sexual orientation, disability, or gender." Literature on hate speech often focuses on manifestations of racism, sexism and discrimination against minority groups (Fortuna and Nunes 2018). Models on hate speech detection have also grown in diversity and complexity, ranging from Logistic Regression, Bayesian Networks to Genetic Algorithm and Deep Neural Networks, from singular to ensemble classifiers (Ayo et al. 2020) . Researchers have also investigated adjacent topics, such as cyberbullying and toxicity classification (Kim et al. 2021; Obadimu et al. 2019 ). Twitter has been popular source of data for these works, followed often by Reddit, Facebook, and Gab (Madukwe, Gao, and Xue 2020) . More recent works have even explored non-textual data, such as pictures and videos (Das, Wahi, and Li 2020; Wu and Bhandary 2020) . Bias in Data Construction of hate speech datasets typically requires human annotations. Clarity of instructions and annotators' background can both influence the labelling process. Biases in annotation may be amplified by algorithms and models, resulting in unfair or discriminatory targeting of certain groups if deployed in the real world (Mehrabi et al. 2021) . (Davidson, Bhattacharya, and Weber 2019) studied of 5 common hate speech datasets revealed inherent racial bias in annotation against tweets written in African American Vernacular English (AAVE), which were further magnified in downstream classifier's predictions.

Various studies have explored ways to reduce annotators' bias. (Vidgen and Derczynski 2020) stressed the importance of presenting clear guidelines for annotators to reduce subjectivity and biases in developing data for hate speech classifiers. (Zampieri et al. 2019 ) presented a hierarchical approach that first asked annotators to determine whether a post is offensive, then subsequently identify the type (group vs. individual) and identity of the targets. Using on a corpus of posts taken from the newspaper Times of Malta, (Assimakopoulos et al. 2020) performed a comparative experiment on annotating type and target of hate speech between simple binary and hierarchical annotation schemes inspired by Critical Dialogue Analysis. In both of these studies, hierarchical scheme tends to result in higher rater agreement by directing annotators to recognize implicit linguistic means in detecting hate speech. Lack of Asian-oriented Literature Researchers have progressively contributed more datasets in recent years. (Vidgen and Derczynski 2020) surveyed 63 datasets on hate speech and realted topics in multiple languages with English being the most numerous. Among them, earlier works typically offer only binary label on the speech type, whereas later works tend to explore multi-label schemes. In addition, there are comparatively fewer resources on data with annotations on the specific target of hate speech. (Madukwe, Gao, and Xue 2020) analyzed 17 well-known datasets, and only 1 of them identified the targeted demographics of abuse, including Asians (Warner and Hirschberg 2012) . Generally, hate speech datasets do not often explore granularity at the target level.

The lack of available annotated hate detection data extends especially to Asians and related subgroups. Though Asians are often victims of racism in Western countries, training data on hate speech against this group is in scare supply (Ayo et al. 2020; Fortuna and Nunes 2018) . As racism-fuelled attacks on Asians grow increasingly common in real life and on social media as the COVID-19 pandemic progresses, there emerges a need for scientific literature to address these issues. (Ziems et al. 2020 ) contributed a handlabelled dataset of 2400 tweets to identify hate and counterhate speech. (Vidgen et al. 2020 ) created a dataset of 20,000 tweets relevant to COVID-19 with labels that categorize them into hostility, criticism, meta-discussion or nonrelated to East Asia/Asians. Our work also contributes a labelled Twitter dataset, albeit with different methodologies to be discussed below.

We select our corpus from the Public Coronavirus Twitter Dataset . Starting on January 28, 2020, the collectors leveraged Twitter's API to collect tweets that contain relevant keywords. This repository is updated on a weekly basis 1 , and contains over 150 million Tweet IDs at this article's time of writing. As hate speech occurrences are proportionately very rare on Twitter, we utilize a boosting approach proposed by (Founta et al. 2018 ) that involve sampling using targeted keywords. First, we collect a set of anti-Asian phrases that include both existing (e.g.: ch*nk, g*ok) and emergent, COVID-related (e.g.: china lied people die, chinavirus) from Hatebase.org and relevant literature (Tahmasbi et al. 2021; Ziems et al. 2020; Vidgen et al. 2020; Keum and Miller 2018) . Second, we also collect a similar set for anti-Black phrases. Since AAVE elements are fairly common in tweets and have been shown to correlate with labelling bias, we decide to incorporate anti-Black phrases to explore their potential influence on annotation (Davidson, Bhattacharya, and Weber 2019; Blodgett, Green, and O'Connor 2016) . After inspection, the final anti-Asian collection contains 49 phrases, the anti-Black 56 phrases (detailed in Appendix 6. To the best of our knowledge, this list -though non-exhaustive -consists of the most representative phrases targeted towards Asian and Black demographics in hate speech literature.

Each candidate tweet is tentatively categorized as either anti-Asian or anti-Black if it contains a phrase of the corresponding set, Interracial if phrases from both sets are present, and Normal if none applies. We restrict our sampling pool to only original tweets written in English from the general corpus. To compensate for the natural imbalance of hateful tweets (the positive classes), we sequentially select at random a tweet from each of the 4 classes until we reach 500 tweets for each month from July to December 2020. The final corpus has 3000 tweets.

Template Design We design separate templates for 2 annotation schemes: standard (STD) and hierarchical (HER) to determine the better method at mitigating potential bias in labelling. Both templates contain a collapsible instructional pane at the top, followed by the tweet's content and prompts to identify the following attributes:

• (Speech) Type: Abusive, Hate, Normal

• Level of Aggression: Not Aggressive, Somewhat Aggressive, Very Aggressive

• Target: Neither, Only anti-Asian, Only anti-Black, Both Anti-Asian and Anti-Black Figure 1 provides corresponding illustrations. While much work has been done on the categorization of abusive and hate speech, we include the 2 other attributes to further enhance granularity. Since hate speech is an umbrella category, it is important to also identify who and how strongly the negativity is directed. Though Asians and Blacks consist of diverse sub-groups, for the scope of this work, we do not focus on this level of distinction.

Inspired by (Sap et al. 2019 ), we prime annotators at the beginning of the instructions: "Note on Dialect and Context: Some words may be considered offensive generally, but not in specific contexts, especially when used in certain dialects or by minority groups. Please consider the context of the tweet while selecting the options below."

We then provide definitions and an example for each option. In an effort to enhance consistency in the field, we decide to use verbatim the definitions as provided by (Founta et al. 2018) and (Chatzakou et al. 2017) for the questions on Type and Level of Aggression, respectively. In the STD template, the items are arranged in the aforementioned order. We phrase the prompts instructively (refer to Figure 1 ). Annotators are allowed to select multiple options for Type and singular for other questions. However, if Normal is selected for Type, then Not Aggressive and Neither must be selected for the other 2 questions.

In contrast, the HER template has a few modifications. Based on works on hierarchical scheme by (Assimakopoulos et al. 2020) and (Zampieri et al. 2019) , we rephrase the prompts as questions and present them in a different order to direct annotators' consideration on different aspects laid out in the preceding definitions. We first inquire annotators about the Level of Aggression communicated by the tweet. Only if a choice other than Not Aggressive is selected then subsequent questions are displayed. Otherwise, subsequent questions would return default options upon submission.

The task of identifying the target of aggression is split into 2 parts: first to identify whether the target is an Individual or Group. If the former, we present the follow-up question : "Is the individual targeted due to their affiliation to a racial group, and if so, which one?". If the latter, we urge the annotators to consider context and dialect of the tweet again as they identify the target's race ( Figure 1 ). These wordings are meant to help annotators deliberate more holistically. Note that the options are rephrased to be consistent with the questions, but are still equivalent to their STD counterpart. Finally, we ask annotators to identify the tweet's Type. Similar to its counterpart in the STD template, this question allows selection of both Abusive and Hate options, and only Normal otherwise. We present this question last to allow annotators the opportunity to deliberate based on their preceding options.

Crowdsourcing We solicited the templates as separate jobs to workers on Amazon Mechanical Turk (AMT), a popular crowdsourcing platform for this type of task. A Human Intelligence Task (HIT) consists of a single set of responses on a template attached to a tweet drawn from our corpus. Each HIT requires 3 sets of annotations from workers to be considered completed. We required workers to possess certain qualifications to work on our tasks: HITs approval rate at least 95%, locations in either Canada or the U.S, and consenting to viewing adult themes. In addition, we asked annotators to complete a pre-screen survey on their demographic background. Workers received $.06 for successful submission of a HIT and was capped at approximately 300 HITs to ensure diversity of opinions. The corpus was annotated exhaustively using the STD template first, and then the HER version. Since AMT template only supported UTF-8 encoding, we stripped the tweets of incompatible characters before uploading. We manually review and approve all HITs after completion. Figure 2 shows the demographic breakdown between the cohorts. The STD cohort consists of 86 workers who annotated the tweets on the standard template, the HER of 52 who worked on the hierarchical. There are 11 workers who are part of both cohorts; we did not reject them from participating in the HER tasks since they only contributed 284 (3.2%) assignments on the preceding STD tasks. We observe the apparent gender parity, the generally high level of educations, and the majority of individuals in the 25-49 age range in both cohorts. Most notable is the dominant representation of workers who identify as White in either cohort. Chisquare test of independence reveals statistically significant association for every pair of workers' demographic variables and annotation items (α=0.05), with the exception of age vs. race of target group and gender vs. race of target individual for the HER template. However, we note that these 2 questions on race are heavily conditioned on the preceding questions, where the defaults are usually observed. Our annotation approach differs markedly with respect to the 2 aforementioned Asian-oriented datasets. (Ziems et al. 2020 )'s dataset is annotated entirely by the authors. This dataset categorizes tweets into Hate, Counterhate, or Neutral. On the other hand, (Vidgen et al. 2020) collected tweets from 11 to 17 March 2020 and enlisted 26 annotators with at least undergraduate education, age between 18-25, mostly female (75%), all of whom with extensive prior training on hate speech. Also these annotators came from Europe and South America. Both of these works employed annotators with close to expert-level background. In contrast, we implement measures to capture a diverse pool of opinions to be consistent with real-world setting. Furthermore, while these works investigate related, but ultimately different aspects of anti-Asian prejudice during COVID-19, we deliberately use annotation labels consistent with recent cornerstone works on hate speech to contribute to the area's convergence of literature (Founta et al. 2018; MacAvaney et al. 2019 Since HER approach has a 2-part format for Target questions, we only consider the answer for the question corresponding to whether Individual or Group is chosen in the preceding question. We observe that DH yields higher interrater agreement for each item when comparing against its DS counterpart. We derive the final set of annotation for each tweet using majority voting. For tweet Type, 'Abusive' is the default when ties occur with 'Hate'. We explore potential annotator bias between the 2 approaches. More specifically, we focus on annotator bias with respect to AAVE usage, as have been explored by (Davidson, Bhattacharya, and Weber 2019; Sap et al. 2019) . We extract pAAVE, the probabilities of the tweet's text coming from this dialect and calculate point biserial correlation between each category's one-hot encoded variable and pAAVE for each set (included in Table 2 ). All correlation coefficients are statistically significant at α = 0.05 level, with the exception for Target=both category, which have negligible prevalence in either set. We observe a general decrease in magnitude of inverse correlation between pAAVE and categories Normal, Neither/NA and Not Aggressive from DS to DH.

On the other hand, the correlation with Hate decreases 20%, and particularly significant with anti-Black at 54%. Correlation with Very Aggressive does increase; however, we observe that it is likely due to a redistribution of some Note Aggressive labels to the other 2 category within the Level of Aggression attribute. Furthermore, we observe a large number of tweets categorized as Hate in the DS set are classified as Normal and Abusive in DH. Similarly, tweets considered anti-Black in DS are mostly re-labelled as Neither/NA in DH, while the number of anti-Asian tweets remain largely stable. These statistics corroborates the notion that using the HER approach results in significantly less likelihood to label tweets with AAVE elements negatively. We therefore consider DH superior due to a reduction of this dimension of bias and use it as the final dataset for subsequent tasks.

To be consistent with the processed content seen by AMT workers, we remove all non-ASCII character from the tweets and convert them into lower case. We further remove all retweet identifiers ('rt'), and substitute hyperlinks with the url token. All patterns of emoticons and emojis are also removed. Mentions of other usernames are replaced with the user tokens. Groups of repetitive patterns are replaced with a single representative. Since hashtags may express relevant sentiments to our tasks, we process them with a specialized approach. We first remove the '#' sign. If a hashtag is in our list of hate phrases or )'s specified keywords, we keep them as is. For other hashtags, we segment them into separate tokens using the Ekphrasis 2 Python library.

Since its inception, BERT (Bidirectional Encoder Representations from Transformers) has gained widespread recognition due to its competitive performances and versatility in various Natural Language Processing (NLP) tasks (Devlin et al. 2018) . (Mozafari, Farahbakhsh, and Crespi 2019) combined pre-trained BERT language models with downstream deep learning architectures and found them capable of generalizing to new datasets in hate speech detection tasks. BERT combined with CNN and LSTM architectures in particular considerably augmented their models' ability to pick up hate-related signals (Mozafari, Farahbakhsh, and Crespi 2019) . RoBERTa, a model that optimizes the training of BERT, has emerged and outperformed its predecessor on several benchmarks (Liu et al. 2019 ). In our experiments, we combine RoBERTa with these deep learning architectures to fine-tune on down-stream classification tasks. Baseline This model uses Support Vector Machine (SVM) classifier trained on input created by combining unigram and bigram TF-IDF vectors of 2000 and 1000 features, respectively. SVM is a well-studied Machine Learning model that has been shown to be capable in hate speech detection tasks. (Fortuna and Nunes 2018). 

The following architectures all leverage uncased, base RoBERTa's pre-trained model as the initial component and further fine-tuned for our classification tasks. The pooled output of the classification token ([CLS]) is fed to an intermediate fully-connect layer of size 384 with a Dropout layer and then LeakyReLU activation function, which are then connected to 3 final linear classification layers of size corresponding to the number of labels for each attribute. RoBERTa LSTM In contrast to the preceding model, we use RoBERTa's last hidden state (the sequence of outputs from the last encoder for each input token) as input to the subsequent Bidirectional LSTM layer with hidden size of 768. This recurrent neural network's outputs from each directions are then concatenated and fed to the 3 linear classification layers similarly to RoBERTa NN. RoBERTa CNN After experimenting, we use the outputs of the last 6 encoders of RoBERTa (instead of all 12) as input to the convolutional layer to circumvent computational resource constraints while still preserving comparable performance. A convolutional operation (kernel size 3, stride 1, padding 1) is applied to the matrix constructed by stacking and reshaping these 6 outputs to produce 3 channels. We then apply Max Pool operation (kernel size 3, stride 1) and then LeakyReLU activation function, whose outputs are connected to intermediate and final classification layers. RoBERTa Founta and RoBERTa Vidgen Since Asians are a low-resource demographic in hate speech detection literature, we explore relevant, available datasets for potential of transfer learning on our tasks (Vidgen and Derczynski 2020) . We create a dataset of 6000 tweets from the original (Founta et al. 2018 ) set with equal distribution of categories for Abusive, Hate (Hateful in the original work) and Normal. For (Vidgen et al. 2020 )'s dataset, we examine the author's original definitions and draw a balanced sample of 3000 tweets for the tweets from the following categories and assign them equivalent labels for each of our native attributes as described below:

• Criticism of an East Asian entity: (Abusive, Somewhat Aggressive, anti-Asian)

• Hostility against East Asian entity: (Hate, Very Aggressive, anti-Asian)

• Non-Related: (Normal, Not Aggressive, Neither/NA) After training for 5 epochs using each of the the preceding architectures, RoBERTa LSTM outperforms RoBERTa NN and RoBERTa CNN, yielding the best macro-F1 at 0.80 on the Founta dataset and 0.75 on the Vidgen dataset on the task of classifying speech type. We henceforth refer to them by their respective dataset for nomenclatural brevity. More importantly, these results are comparable to those reported in works using these full datasets, offering evidence of capability to our proposed architectures (Swamy, Jamatia, and Gambäck 2019; Vidgen et al. 2020) . We use these pretrained values to initialize the parameters of common layers before training on our own tasks.

We split the corpus into Train, Validation and Test sets using a 0.7:0.1:0.2 ratio on the randomly shuffled corpus. To compensate for the imbalance in classes (categories), we assign weights to each observation based on their category for each task using the formula w c = N C * nc , where N denotes the total number of observations in the set, C the number of classes, and n c the number of points in the designated class.

We use Task 1, Task 2, Task 3 to refer to the classification of the tweet's Type, Level of Aggression, and Target respectively. The Baseline SVM model was tuned using grid search. For other RoBERTa-based models, each of these tasks corresponds to 3 linear classification layers. They are trained concurrently with respect to Cross Entropy Loss functions, and optimized using AdamW algorithm. We select the best value for the following hyperparamters based on macro-F1 score of Task 1 (final value in bold): learning rate ∈ {1e-5, 2e-5, 3e-5}, batch size ∈ {10, 20, 30}, maximum length of sequence ∈ {150, 200, 225, 300}, dropout rate ∈ {0.1, 0.2, 0.3} . Note that CNN models have their batch size set to 15 due to resource constraint. RoBERTa NN, RoBERTa LSTM and RoBERTa CNN models are then trained for 5 epochs, and 3 epochs RoBERTa Founta and RoBERTa Vidgen. We repeat these training regimens using 5 distinct seeds for each non-Baseline model. Final predictions are set to be the category with the highest logits.

To better explore the performances of our models, we present both category-specific means and standard deviations for Precision (P), Recall (R) and F1 scores averaged over 5 seeds for each non-Baseline model on each task in Table 3 . For Baseline model, only results from the best set of hyperparameters are reported. We omit the result for the category Target=Both since all models are unable to yield correct classification due to the its extremely low prevalence (1 of 600 tweets). In contrast, Table 4 displays corresponding macro metrics, denoted by mP, mR and mF1 respectively.

From Table 4 , our RoBERTa-based models consistently outperform the Baseline almost across the board. We observe that all models are highly capable of classifying the negative major classes in each task, and tend to struggle with positive but minor classes. For Task 1, distinguishing between Abusive and Hate speech proves challenging. This notion is not surprising, as similar struggles have been noted in analogous works (Swamy, Jamatia, and Gambäck 2019; Founta et al. 2018) . For Task 2, non-Baseline models yield balanced results between Precision and Recall with respect to the Somewhat Aggressive category with F1 score exceeds 0.6. The rarer and semantically more extreme Very Aggressive category expectedly has worse results. With respect to identifying the target's racial group, our models perform respectably when the target is Asian.

In Table 4 , we exclude the categories anti-Black and Both from the calculations of macro metrics in Task yield comparable results across all tasks. However, we note that RoBERTa CNN tends to yield a wider range of deviation (notably for mP Task 1, mP Task 2, mF1 Task 2) and also takes significantly longer training time (∼ 270s per epoch in our setting) and computational resources due to higher number of parameters. RoBERTa LSTM strikes the best balance between processing time (∼ 170s per epoch, similar to RoBERT NN's) , parsimony of resources and performance.

More interestingly, models pre-trained on external datasets demonstrate superior performance in at least 1 task. Both RoBERTa Founta and RoBERa Vidgen produce markedly higher macro Recalls for Task 1 compared to the natively trained counterparts. We note that these 2 models' Recalls for Hate Speech far outstrip (> 0.8) those of the native models at the cost of lower Precision. They also produce the highest macro F1 scores (0.64) in Task 2, with RoBERTa Founta yields the highest Recall for Very Aggressive class. That RoBERTa Founta displays improved results for Task 2 is noteworthy since it is only pre-trained on Task 1 using the Founta dataset.

Based on macro results, RoBERTa Vidgen emerges as the best performer across all tasks. This model has the advantage of being fine-tuned on a semantically adjacent dataset in terms of domains. In fact, pre-training on less similar, but still relevant data like Founta still yields appreciable improvements. Our experimental results are consistent with findings by (Swamy, Jamatia, and Gambäck 2019) and (Yin and Zubiaga 2021) , who remarked on the importance of similarity between classes when generalizing across datasets. Our work complements theirs by examining abusive and hate speech at a more fine-grained level, particularly to a low-resource demographic.

We investigate our models' performance on distinguishing Abusive and Hate speech in Task 1. Table 5 is the confusion matrix for RoBERTa LSTM, the model with the most balanced performance with respect to these 2 classes. Among the 64 tweets classified as Abusive, 44% belong to the Normal class while only 8% are Hate. For tweets classified as Hate, 45% actually are Normal and 36% are Abusive. Generally, distinguishing between the minor classes Abusive and Hate is challenging, particularly with respect to the former.

We train a Doc2Vec 3 model to represent each tweet in our dataset as a vector of size 100 and use the tSNE algorithm to reduce them to 2 components, then plot a random sample of 200 tweets for each of the 3 Speech Type categories ( Figure 3 ). Our native-trained RoBERTa models do tend to yield higher Recall scores for this class compared to Abusive (3). In the figure, a considerable portion of Hate tweets also lie elsewhere. Tweets of the Abusive classes are even more dispersed among the 2 other classes, further illustrating the difficulty of correctly classifying them. Distinguishing between Abusive and Hate speech is a challenging task for models, but also for annotators. (Founta et al. 2018 ) noted the non-trivial extent this confusion happened in the creation of their data. We perform a crossdataset experiment to investigate the substantial improvement in Recall for Hate category in models pre-trained on external dataset. Due to the similarities in definitions and annotations, We use the RoBERTa LSTM model trained exclusively on our Founta dataset to classify 600 tweets sampled equally from each class in our DH set, and vice versa. As observed in Table 6 , model trained on Founta data misclassifies approximately 57% of our native Abusive tweets as Hate, and similarly for Hate tweets as Normal. This model also clearly struggles to recognize the Abusive class. On the other hand, model trained on our native data does not flag any tweet in the Founta set as Hate. In fact, this model designates about 50% of the Founta's Abusive and Hate tweets as Normal. (Founta et al. 2018 )'s Abusive and Hate tweets are collected from boosted samples based on sentiment polarity and presence of offensive phrases. Their data contains only single label on the tweet's speech type. Hence, Founta's Abusive and Hate tweets encompass a wide range of offensive sentiments that certainly do not involve COVID-19. In con-trast, ours are collected with respect to particular demographics, and are specifically tied to development of current events. In fact, sensitive phrases that frequently appear in both of our Abusive and Hate categories, such as "chinavirus", are still considered acceptable by users of different political affiliations (Su et al. 2020) . Tweets annotated as Hate by our annotators often express more pointed negative sentiments towards Asians. This notion is consistent with our model's total misclassification of Founta's Hate tweets in this experiment.

These examples further illustrate that abusive and hate speech are anything but monolithic. Because they often target minority groups, there exists an important need for data resources dedicated to them to train reliable detection systems. The rise of anti-Asian sentiments induced by the COVID-19 pandemic has transpired in a relatively short period of time, necessitating the creation of efficient and reliable hate speech detection systems to address this need. Our models provide a viable framework that acts as an initial screener of abusive and hate speech. More specifically, this model could automate pre-facing potential sensitive content with the now common "Content Warning" tags. The inclusion of Level of Aggression attribute may provide complementary adjudication, especially for cases when the boundary between Abusive and Hate is not clear. This approach has the benefit of reducing the potential risk on human viewers, both moderators and audience, while not totally censoring materials even in cases of misclassifications. Our experimental results demonstrate the dynamic nature of hate speech. The advent of the COVID-19 pandemic not only engenders a resurgence of anti-Asian sentiments, but also catalyzes emergence of new racially offensive terms (Ziems et al. 2020) . The existing difficulties of abusive and hate speech detection naturally extend to these developments, such as the subjective nuances in designating ap-propriate labels. (Vidgen et al. 2020 )'s pioneering work has proven to be advantageous to our tasks; however, it also exemplifies the need to consider cultural variations when annotating. We randomly select 100 tweets from each of Vidgen's 2 categories Criticism of an East Asian entity and Discussion of East Asian prejudice to be annotated using the previously described HER experimental setting. Figure 4 illustrates the results. According to (Vidgen et al. 2020 )'s definitions, the former class are for tweets that contain "a negative judgement/assessment of an East Asian entity, without being abusive", while the latter is reserved for tweets that "discuss prejudice related to East Asians but do not engage in, or counter, that prejudice". 34% of Criticism tweets are considered Abusive while 12% as Hate. More interestingly, 29% of Discussion tweets are deemed as Hate speech. In addition to the variance in annotator's cultural and education background, Vidgen's tweets are drawn from the outset of the pandemic (March 2020), whereas ours (July to December 2020) are from periods when COVID-19 was already well under way. As conversations involving the pandemic evolve, so could have the nuances of related hate speech, contributing to the observed discrepancies in annotation.

In reality, efforts to create sufficient datasets are often met with constraints on human or capital resources (Founta et al. 2018) . Delays in development due to these obstacles only exacerbate the damages by hate speech done to the targeted communities, who often already are under-represented in literature. Researchers have explored the potential of transferring learning across datasets to fine-tune different aspects of hate speech detection (Yin and Zubiaga 2021; Mozafari, Farahbakhsh, and Crespi 2019; Madukwe, Gao, and Xue 2020) . In fact, our experiments have demonstrated various degrees of cross-dataset learning dependent on domain similarity. To encourage generalizability, we make the following observations based on our results. First, standardization of definitions is important. Using consistent definitions and categories facilities the assessments of datasets for different purposes. Our work hopes to contribute towards this convergence by following methodological groundwork laid by (Founta et al. 2018) . Second, abusive and hate speech dataset should identify the targeted demographics. As our results show that training models to identify the target of hate speech is feasible even with limited data, having labels at this level of granularity is helpful in triaging resources for under-represented communities. Finally, researchers should also consider the logistical, chronological and cultural factors in the creation and usage of hate speech datasets, as they may non-trivially influence the output.

As previously mentioned, Asians consist of diverse subgroups. The identifiers "Asians" and "Blacks" may conjure different connotations depending on locations, and the struggles each sub-group faces may be disparate. Our work primarily focuses on North-American perspective; we invite other researchers to explore this line of work from other angles. Like many other efforts on annotation of hate speech, ours also has to contend with budgetary constrains. While ideally having more annotators per tweets and a larger cor-pus would be beneficial, our constraints are not atypical for works on under-represented groups (Fortuna and Nunes 2018). Furthermore, our models only consider the tweets' textual content. Possible improvement in performance may result from incorporating multi-modal approaches, such as analysis on associated emojis and emoticons, or incorporating meta-data about the tweet users and their networks. Though not necessarily attaining the class-leading performances in classification as in other more general works, our work aims to address some aspects of anti-Asian sentiments on Twitter, and hopefully provide promising venues for further efforts.

Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis

Machine learning techniques for hate speech classification of twitter data: State-of-the-art, future challenges and research directions

Demographic Dialectal Variation in Social Media: A Case Study of African-American English

Creating COVID-19 stigma by referencing the novel coronavirus as the "Chinese virus" on Twitter: quantitative analysis of social media data

Federal agencies are doing little about the rise in anti-Asian hate crime

Mean birds: Detecting aggression and bullying on twitter

Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set

Social consequences of mass quarantine during epidemics: a systematic review with implications for the COVID-19 response

Detecting Hate Speech in Multi-modal Memes

Racial Bias in Hate Speech and Abusive Language Detection Datasets

Bert: Pre-training of deep bidirectional transformers for language understanding

Large scale crowdsourcing and characterization of twitter abusive behavior

Association of "# Covid19" Versus

Anti-Asian Racism during COVID-19

Racism on the Internet: Conceptualization and recommendations for research

You Don't Know How I Feel: Insider-Outsider Perspective Gaps in Cyberbullying Risk Detection

Anti-Asian Xenophobia and Asian American COVID-19 Disparities

Roberta: A robustly optimized bert pretraining approach

Hate speech detection: Challenges and solutions

In data we trust: A critical analysis of hate speech detection datasets

FBI warns of potential surge in hate crimes against Asian Americans amid coronavirus

A survey on bias and fairness in machine learning

Psychological impact of anti-Asian stigma due to the COVID-19 pandemic: A call for research, practice, and policy responses

A BERT-based transfer learning approach for hate speech detection in online social media

Identifying toxicity within youtube video comment

The risk of racial bias in hate speech detection

Time to stop the use of 'Wuhan virus','China virus' or 'Chinese virus' across the scientific community

Studying generalisability across abusive language detection datasets

Go eat a bat, Chang!": On the Emergence of Sinophobic Behavior on Web Communities in the Face of COVID-19

The anxiety of being Asian American: Hate crimes and negative biases during the COVID-19 pandemic

Quarantining online hate speech: technical and ethical perspectives

Detecting East Asian prejudice on social media

Directions in abusive language training data, a systematic review: Garbage in, garbage out

Detecting hate speech on the world wide web

Detection of Hate Speech in Videos Using Machine Learning

The outbreak of COVID-19: An overview

Towards generalisable hate speech detection: a review on obstacles and solutions

Predicting the Type and Target of Offensive Posts in Social Media

Mental health toll from the coronavirus: Social media usage reveals Wuhan residents' depression and secondary trauma in the COVID-19 outbreak

Racism is a virus: Anti-asian hate and counterhate in social media during the covid-19 crisis

Anti-Asian Phrases batsoup, bioattack, blame china, boycott china, bug men,bugland, ccp, chankoro, chicom, china is asshole, chinais terrorist, china lie people die, china should apologize,china virus, china virus outbreak, chinaflu , chinazi, chi-nese propaganda, chinese virus, ching chong, chinigger, chink, chinkland , chinksect, communism kill, communistchina, fuck china, goloid, gook, gook eyed, gookie, gook-let, gooky eye, insectoid, make china pay, no asian allowed, no chinese allowed, oriental devil, pinkdick, ricenigger, wohan, wuflu, wuhancorona, wuhaninfluenza, wuhanpne-unomia, wuhansars, yellow jew, yellow nigger, yellow peril Anti-Black Phrases africa, african't, boogat, bootlip , canigger, canigglet, can-niglet, chimp out, chimp pack, chimped out, congoid, dindu,dindu nuffin, ghetto monkey, golliwog, hoodrat, jigga, jig-ger, jigro, jigroes, kneegroes, kneegrow, mammy, mook, moolinyan, mulatto, nappyhead, negro, niger , nigga, nigger, nigger knock, nigger rig, niggera, niggerdick, niggerette, niggerfag, niggerization, niggerize, niggerton, nig-gertown, niggerville, niggerwool, pavement ape , picaninnies , picaninny, piccaninnies, piccaninny, pickaninnies, pickinninies, pickinniny, sheboon, shit heel, spear chucker, suspook, yard ape