key: cord-0490366-v89w6lud
authors: Hou, Yanfang; Putten, Peter van der; Verberne, Suzan
title: The COVMis-Stance dataset: Stance Detection on Twitter for COVID-19 Misinformation
date: 2022-04-05
journal: nan
DOI: nan
sha: de64de4c00ce7d29f00e778ea7a025f4ae4a8501
doc_id: 490366
cord_uid: v89w6lud

During the COVID-19 pandemic, large amounts of COVID-19 misinformation are spreading on social media. We are interested in the stance of Twitter users towards COVID-19 misinformation. However, due to the relative recent nature of the pandemic, only a few stance detection datasets fit our task. We have constructed a new stance dataset consisting of 2631 tweets annotated with the stance towards COVID-19 misinformation. In contexts with limited labeled data, we fine-tune our models by leveraging the MNLI dataset and two existing stance detection datasets (RumourEval and COVIDLies), and evaluate the model performance on our dataset. Our experimental results show that the model performs the best when fine-tuned sequentially on the MNLI dataset and the combination of the undersampled RumourEval and COVIDLies datasets. Our code and dataset are publicly available at https://github.com/yanfangh/covid-rumor-stance

In March 2020, the WHO declared the coronavirus (COVID-19) outbreak a global pandemic. We are also experiencing a COVID-19 infodemic, in which an excess of information is being created and shared to every corner of the world. In the early stages of the pandemic, misinformation surrounding COVID-19 was rampant due to a lack of knowledge about the virus itself and heated debates around how to best deal with it. It included false or misleading information and conspiracy theories towards the origin, scale, prevention, diagnosis, and treatment of the disease [34] . Misinformation causes confusion, leads people to reject public health measures such as vaccination and masks, and promotes unproven treatments [20] .

With growing digitization, social media has become an important channel for information gathering and diffusion. Misinformation comes from a variety of sources, such as news sites, videos, and user posts. All of them can be easily shared on social media and further circulated and discussed by more users. Consequently, misinformation spreads rapidly on social media. At the time of posting, the veracity status of the information is usually not yet verified. Different social media users express different stances on its likely veracity and even share evidence supporting their views. Biber and Finegan [2] define stance as the expression of a speaker's standpoint and judgement towards a given proposition. These stances can be aggregated to measure public opinions and help determine the veracity of the rumors [38] .

In this work 1 , we explore the stance expressed by tweets towards COVID-19 misinformation. Given a misinformation item and a tweet, our task is to classify each sentence pair into one of three categories: Favor, Against and Neither. For instance, considering the following misinformation item and tweet, we can deduce from the tweet that the tweeter is likely in favor of the misinformation:

• We use the models fine-tuned on the MNLI dataset and then fine-tune them on the RumourEval or COVIDLies datasets. We evaluate the model performance on our dataset, and found that the model achieves the best result when fine-tuned sequentially on the MNLI dataset and the combination of the undersampled RumourEval and COVIDLies datasets. We also do a dataset ablation study to determine the extent to which each dataset contributes to solving the stance detection task.

The remainder of this paper is organized as follows. Section 2 presents the available datasets for stance detection and COVID-19 misinformation, and methods for stance detection and limited annotations. Section 3 describes how to construct the COVMis-Stance dataset and additional training datasets. Section 4 states the reproduction experiments of the COVIDLies paper [12] . Section 5 explains our stance detection methods. Section 6 describes the experimental setups and results of our task. Section 7 discusses the possible improvements, followed by our conclusions in Section 8. We add how to construct misinformation items, extract keywords, and implement data sampling strategy in Appendix A.

Dataset Target Context Label Space SemEval-2016 [17] Topic Tweet Favor, Against, Neither FNC-1 [22] Headline Article Agree, Disagree, Discuss, Unrelated FEVER [32] Claim Evidence Supported, Refuted, NotEnoughInfo Rumoureval-2019 [8] Tweet Reply Support, Deny, Query, Comment COVID-19-Stance [7] Topic Tweet In-favor, Against, Neither COVIDLies [12] Claim Tweet Agree, Disagree, No stance In this section, we survey the existing work on data and methods relevant to our task. We present the available datasets for stance detection and COVID-19 misinformation. Also, we investigate stance detection methods and existing solutions to limited labeled data.

We summarize the existing stance detection datasets, shown in Table 1 . As these datasets are used for stance detection in various domains, they differ considerably in input texts, i.e. target and context. Among them, only COVIDLies has COVID-19 misinformation as its target.

The SemEval-2016 Task 6 [17] presented a benchmark dataset for determining whether a tweet is in favor, against, or neither, of a given topic. Five topics are used in the dataset: Atheism, Climate Change is a Real Concern, Feminist Movement, Hillary Clinton, and Legalization of Abortion. Compared to this task, our task is to determine the stance towards a rumor, which is usually an entire statement rather than a short entity or topic.

The 2017 Fake News Challenge Stage 1 (FNC-1) [22] is the task of determining the stance of the news body text towards the news headline. The labels are agree, disagree, discuss (the body text discusses the same topic as the headline but does not take a position) and unrelated (the body text discusses a different topic than the headline). In contrast, our task focuses on social media text rather than news. The tweet is usually shorter than the news body and contains many informal expressions.

FEVER Stance detection could be used for fact-checking. Thorne et al. [32] presented the Fact Extraction and Verification (FEVER) shared task for classifying a claim as supported or refuted by Wikipedia evidence, or notenoughinfo when the retrieved evidence is not relevant or informative. In addition to domain differences from our task, some claims in the FEVER dataset require the composition of evidence from multiple sentences, while each example in our dataset contains only one tweet for a given misinformation item.

The RumourEval 2019 task [8] shared a dataset about the stance expressed in the tweets in a conversation thread towards the rumor mentioned in the source tweet. The labels are support, deny, question (the replying tweet requires additional evidence for the rumor veracity) and comment (the replying tweet does not have a clear stance). The stance of each tweet towards the rumor is inferred from the contextual conversation. In contrast, the rumors in our dataset are not tweets. They are collected from news articles. Also, the tweets are independent and without conversational relationships.

Datasets related to COVID-19 Two types of targets in COVID-19 stance tasks attract great interest: controversial topics and rumors. Glandt et al. [7] released the COVID-19-Stance dataset, a collection of annotated tweets that express the stance towards four targets: Anthony S. Fauci, M.D., Keeping Schools Closed, Stay at Home Orders, and Wearing a Face Mask. Hossain et al. [12] published a stance detection dataset, called COVIDLies, consisting of 6761 tweets with their annotated stance on COVID-19 misinformation.

A number of COVID-19 misinformation datasets have been released since the pandemic outbreak. [5] -FakeCovid [29] ----ReCOVery [37] ---CMU-MisCOV19 [14] ----COVID-19FakeNews [21] ----COVIDLies [12] --- Table 2 : Datasets for COVID-19 misinformation. Includes the information types for each dataset and whether comprising the tweets relevant to misinformation.

stance ourselves. Therefore, we focus specifically on three aspects of the misinformation datasets: the source from which the misinformation is collected, whether relevant tweets are included, and if not, whether the information is provided to facilitate the collection of relevant tweets.

Fact-checking sites Fact-checking websites are an important source for collecting fake news. Cui et al. [5] presented the CoAID dataset including true and fake news articles or claims related to COVID-19. Fake news is collected from the articles on fact-checking sites, and fake claims are from the WHO official website, the WHO Twitter account, and Medical News Today (MNT). Also, they provide relevant tweets by using the titles of news articles as search queries. However, such queries generate a limited number of related tweets due to the long titles, and they are biased to retrieve the tweets that support fake news. Therefore, if we use the CoAID dataset, it is necessary to extend the Twitter data. Shahi et al. [29] published a multilingual fact-checking news dataset called FakeCovid. In the process of collecting data, Snopes and Poynter are used as a bridge to get the links of fact-checking articles. Compared to CoAID, FakeCovid does not include the news articles as referred to in fact-checking articles. This means we cannot utilize any information from news articles to extend tweets.

News sites Some studies do not resort to fact-checking sites but collect news directly from news sites. Zhou et al. [37] proposed the ReCOVery dataset for COVID-19 news credibility research. They first determined unreliable news sites, relying on two sites that rate news media (NewsGuard and Media Bias/Fact Check), and then collected fake news from the sites using keywords. Also, tweets are obtained based on the URL of each news article.

Social media Some studies identify fake information from social media platforms. Memon et al. [14] proposed CMU-MisCOV19, a diverse set of annotated COVID-19 tweets. They used pre-defined keywords and hashtags to retrieve tweets and annotated them according to their topics and veracity. Fake information is mentioned in tweets, similar to RumourEval-2019 [8] . This dataset is more appropriate for exploring the stance of a reply or quote tweet towards the rumor mentioned in a source tweet, while our task does not focus on conversational contexts. Patwa et al. [21] collected COVID-19 news from both social media and fact-checking sites, and manually verified whether they are real or fake. The dataset consists of the content of each post with its annotated veracity, but no additional information.

Wikipedia The COVIDLies [12] dataset described in Section 2.1 includes 86 misinformation items, which are collected from a Wikipedia article about COVID-19 misinformation [34] .

This section describes the stance detection methods in three cases: (1) when the target is a topic; (2) the Fake News Challenge FNC-1 task; (3) fact-checking as stance detection.

In the SemEval-2016 Task 6, Mohammad et al. [17] provided strong baseline results based on an SVM classifier with word n-grams and character n-grams features, but it performs poorly on unseen topics. How to generalize models across topics is an important research direction for stance detection.

Reimers et al. [26] explored the use of contextualized word embeddings (ELMO and BERT) and topic information for argument classification. They use the concatenation of the topic and the sentence as the BERT input and then perform Softmax classification using the [CLS] token from the BERT output. The results show that it improves the macro F1 score by about 15% points over the model without topic information. Based on this study, Reuver et al. [27] investigated the extent to which the cross-topic model is topic-independent. They found the BERT model fluctuates on different topics. In addition, they analyzed the important lexical features through SVM classification, and conjectured that BERT models rely more on topic-specific features for stance detection than topic-independent lexical features.

A first step into the detection of fake news is understanding what a person or medium is saying about a particular topic or news items. So the first task of the Fake News Challenge (FNC) focussed on stance detection (FNC-1) [22] : given a headline and a secondary text discussing the headline, the stance of the secondary text needs to be predicted (agrees/disagrees/discusses/unrelated).

Most participants addressed the problem by constructing neural networks and incorporating multiple hand-engineered features. The second-ranking system [9] , called featMLP, is an ensemble of multi-layer perceptrons (MLP) with six hidden and a Softmax layer each. The input features consist of word unigrams, the similarity of word embeddings, topic models based on non-negative matrix factorization, latent Dirichlet allocation, and latent semantic indexing.

Hanselowski et al. [10] did a retrospective analysis for the methods and found that the most useful features are lexical features, followed by the topic model-based features.

Since the FNC-1 competition, pretrained models have achieved great improvements in NLP tasks. Slovikovskaya et al. [30] investigated the performance of transformer-based models on the FNC-1 task by exploiting two strategies: (1) Based on featMLP, they add the BERT sentence embeddings of two input sequences along with two similarity scores between them as model features. This results in a 2% points increase in macro F1 and a 7% points increase for the most difficult disagree class; (2) The transformer-based models (BERT, RoBERTa, XlNet) are fined-tuned on the FNC-1 data, and they perform 11%-16% higher in macro F1 than the base featMLP.

The FEVER task [32] is commonly seen as a stance detection task. Specifically, the pipeline of the FEVER task usually consists of three components: document retrieval (select documents related to the claim), sentence selection (extract the most relevant sentences as evidence), and claim verification (determine whether the claim is supported or refuted by the evidence). The third component is about stance detection, so we mainly describe its solution.

The winning system of the FEVER competition was proposed by Nie et al. [19] . They framed the three-stage FEVER task as a similar semantic matching problem and proposed the Neural Semantic Matching Network (NSMN), a modification of ESIM [3] . For claim verification, the input sequences are a claim and a set of evidential sentences.

First, each input token consists of the following features: Glove and Elmo embeddings, WordNet embeddings, and semantic relatedness scores from the two upstream stages. Second, a bidirectional LSTM (BiLSTM) layer is used to encode each token. Then, a dot-product attention mechanism is utilized to obtain the aligned token representation. In addition to the original representation, the element-wise difference and product between the encoded and aligned representations are also combined to model complex interactions. Next, the model takes the upstream compound representation and keeps using BiLSTM to construct the representation. Finally, the max-pooling, along with their absolute difference and element-wise multiplication is used for classification. This system achieves a FEVER score of 64.0, about 2.3 times greater than the baseline result.

In recent years, many studies use pretrained models for the FEVER task. Soleimani et al. [31] investigated the effect of BERT for sentence selection and claim verification, and finally achieves a FEVER score of 69.7. Jiang et al. [13] take advantage of the T5 model for claim verification. T5 [24] is a sequence-to-sequence transformer-based model, pre-trained on a multi-task mixture of unsupervised and supervised datasets by reframing all NLP tasks into a unified text-to-text format. Given a claim and a set of candidate evidence, they fine-tune the model with the following input template: query : q sentence 1 : s 1 ... sentence L : s L relevant :

where q and s i are the claim and evidence sentences. query, sentence i and relevant indicate that they are followed by the claim, evidence, and stance strings, respectively. The model will generate one of three tokens: supported, refuted and noinfo. This system finally attains a FEVER score of 75.87.

The lack of large amounts of labeled data is the main challenge for COVID-19 related tasks. In previous work, two ideas are commonly used to address this problem, one is data augmentation and the other is transfer learning, i.e. transferring knowledge from a source setting to a different target setting. As both cover a wide range of topics, here we just describe some advanced methods or methods relevant to our task.

Self-training With a small manually labeled dataset, Miao et al. [15] adopted self-training and knowledge distillation for data augmentation. Specifically, a teacher model is first trained with the manually labeled dataset, and then used to generate pseudo labels for the unlabeled data. After that, a student model is initialized with the identical architecture and parameters as the teacher model and trained with the union of manually and pseudo labeled data. This student model becomes a new teacher model and this process iterates over several times. The experiments show that the student model outperforms the teacher model by about 10% points in terms of accuracy.

Intermediate tasks One way to teach the model abilities is to fine-tune the model on a data-rich intermediate task before task-specific fine-tuning. Pruksachatkun et al. [23] investigated the effect of various intermediate tasks by performing extensive experiments. They found that the intermediate tasks which require high-level inference and reasoning capabilities work best. The MNLI task is also used as an intermediate task in this study, which improves the accuracy by an average of 0.7% points across 10 target tasks.

Prompt learning Prompt or pattern-based training has emerged as an effective method of exploiting pretrained language models for few-shot learning. It reduces different NLP tasks into a masked language modeling problem, thus making better use of the knowledge encoded in the pretrained models. Hardalov et al. [11] explored the effect of this method on few-shot cross-lingual stance detection. The prompt in the stance task has the format:

[CLS]The stance of the following CONTEXT is [MASK] the TARGET.

[SEP]

The CONTEXT and the TARGET are replaced by the corresponding content of each example. For instance, given the context "I am so happy that Donald Trump lost the election." and the target "Donald Trump", the input text is "[CLS]The stance of the following I am so happy that Donald Trump lost the election. is [MASK] the Donald Trump.

[SEP]". The masked token is expected to be against. In few-shot settings, this work transfers knowledge from English stance datasets to multi-lingual stance tasks, which substantially improves the model performance on multi-lingual datasets.

Multi-dataset learning Mixing datasets from different domains or sources is often used to overcome resource limitations and improve the generalization ability of the model. Schiller et al. [28] presented a new stance detection benchmark that learns from multiple stance detection datasets. In the multi-dataset learning (MDL) setting, all datasets share the BERT architecture, but each has its own dataset-specific dense layer on top. Each dataset retains domainspecific information while sharing information through the encoder. With this framework, they fine-tune the model on all datasets simultaneously, showing a sizable improvement in the overall performance compared to the model trained on individual datasets. This study uses 10 stance detection datasets, while our task involves only four datasets. Thus, they have more diverse datasets and more complicated issues to handle. Another difference is that this study combines only datasets for the same task, while one of our training datasets is from a different task.

In this section, we explain the collection and annotation of the COVMis-Stance dataset. Also, we describe three additional datasets used in our experiments: MNLI, RumourEval, and COVIDLies. We compare the datasets to ours and explain why we use them for training. Figure 1 shows the data construction pipeline of COVMis-Stance. Our dataset consists of three components: misinformation, tweets, and labels. Misinformation comes from the CoAID dataset [5] , which includes COVID-19 fake news or claims on websites. To collect tweets related to our misinformation items, we build search queries using news titles, URLs of news or fact-checking articles, and news keywords, and fetch tweets matching these queries. After cleaning and sampling tweets, we annotate each tweet with the stance towards the misinformation item.

Misinformation We use the misinformation items in the CoAID dataset [5] . It consists of fake news and claims related to COVID-19. Fake news was collected from articles on fact-checking sites, and fake claims were from the WHO official website 4 , the WHO official Twitter account 5 and Medical News Today (MNT) 6 . It provides abundant information for each misinformation item, including the titles, URLs, and contents of news articles, the titles and URLs of fact-checking articles, and the tweet IDs retrieved by news titles. For instance, an article 7 on a fact-checking website refutes fake information in a news article 8 , titled "BOMBSHELL: WHO Coronavirus PCR Test Primer Sequence is Found in All Human DNA".

We sort the misinformation items based on the number of tweets retrieved by news titles and URLs, from highest count to lowest, and we select the top 200 items for our task. Each misinformation item is described as a sentence based on the titles of news articles or fact-checking articles. We preprocess the descriptions of some misinformation items to make them clear and specific, for example, splitting a compound misinformation item into multiple items. This is described in Appendix A.1 in detail.

Tweet retrieval To obtain tweets relevant to our misinformation items, we use the following queries to retrieve tweets:

• Titles of news articles: CoAID has provided the tweet IDs retrieved by news titles, so we directly fetch these tweets by IDs.

• URLs of news articles and fact-checking articles: The URLs are from CoAID. If the fake news is from Twitter, we will get the tweet itself and retrieve the tweets containing the URL of this tweet.

• Keywords: The keywords are manually extracted from each misinformation item. Appendix A.2 describes how to construct the keywords specific to the rumors.

Since the Twitter Search API only searches recent tweets published in the past seven days, we use a tool called Snscrape to retrieve tweets. Snscrape 9 is a public, free scraper for social networking services (SNS). It supports the search service on Twitter and returns the discovered items. After obtaining the tweet IDs, we fetch the tweets with detailed information through Twarc 10 , which is a command-line tool and Python library for collecting and archiving Twitter JSON data via the Twitter API.

Data cleaning We found that a large number of tweets with different IDs have the same contents. Also, some tweets are very short. Before sampling for annotation, we clean the data by following the steps below.

1. Some tweets are associated with multiple misinformation items, so the dataset contains duplicate tweet IDs. We remove duplicate tweets and randomly keep one in the dataset.

2. Remove non-English tweets based on the language identification of each tweet.

3. Exclude the tweets with fewer than 10 words. 4 https://www.who.int/ 5 https://twitter.com/WHO 6 https://www.medicalnewstoday.com/articles/coronavirus-myths-explored 7 https://healthfeedback.org/claimreview/human-dna-alone-does-not-produce-a-positive-result-on-the-rt-pcr-test-for-sars-cov-2/ 8 https://pieceofmindful.com/2020/04/06/bombshell-who-coronavirus-pcr-test-primer-sequence-is-found-in-all-human-dna/ 9 https://github.com/JustAnotherArchivist/snscrape 10 https://github.com/DocNow/twarc Table 3 : Examples of COVMis-Stance 4. Exclude the tweets with the same contents. we fit a TF-IDF vectorizer by using the whole tweet data and then compute the cosine similarity for each pair of tweets. If the similarity score is greater than the threshold (80%), we assume both tweets are the same. We cluster the same tweets into one group and randomly keep one tweet for each group. 5. Exclude the misinformation items with fewer than 24 relevant tweets.

Data sampling Our sampling strategy is based on two principles: (1) keeping a good balance in the number of examples between different misinformation items; (2) covering examples of different query types. Here, we consider fact-checking URLs and news URLs as different query types to ensure the number of Against examples. Specifically, we sample 24 tweets for each misinformation item. In these 24 tweets, we randomly select 6 tweets from each query type. However, some query types might not have enough tweets. In this case, if we still need m examples to reach 24, we randomly sample m tweets from the rest of the tweets. According to this strategy, we calculate the probability of being selected for each example, and randomly select examples from the entire set based on these probabilities. Therefore, the number of examples for each misinformation item is not exactly the same, but almost the same. Appendix A.3 describes in detail how to calculate the probability of being selected for each example.

Annotation Our annotation task is to annotate each tweet-rumor pair into one of three labels: Favor, Against, and Neither. The core instructions given to annotators for determining stance are shown below:

• Favor: The tweet is in favor of the misinformation or promotes the dissemination of fake news articles. • Against: The tweet denies the misinformation or promotes the propagation of fact-checking articles.

• Neither: Neither of the above. It is usually one of these cases: (1) The tweet is unrelated to the misinformation; (2) The tweet questions the veracity of the misinformation; (3) The tweet has no clear stance for the misinformation.

In our annotation task, one type of examples is automatically labeled. If the tweet is retrieved by the URL of a fact-checking article, we directly annotate it as Against. The rest of the examples were manually annotated by two annotators. Both annotators are master students studying AI-related programs, and English is their teaching language. For every 12 misinformation items, we calculate the inter-rater agreement on the annotated examples, discuss the examples on which the two annotators disagree, summarize the existing problems, and improve the annotation guideline. If we still cannot obtain a consistent result, we will ask a third person for advice.

We annotated a total of 2631 tweets towards 111 misinformation items, of which 604 Against tweets are automatically annotated. The Cohen's Kappa score for the manual annotation is 0.67, indicating substantial agreement between annotators (0.61-0.80). Some examples are presented in Table 3 . The distribution of labels and query types are shown in h, then a human reading p would infer that h is most likely true. For example, given the premise "A turtle danced", we could infer that "A turtle moved", but the reverse is not certain. Table 5 shows the class distribution of the MNLI dataset. We see that the training set has a nearly equal number of examples on three labels. Similar to the NLI task, our stance detection task also focuses on the relationship between two sentences. We also have a similar label space to the NLI task. In addition, MNLI is a cross-domain dataset, covering ten distinct genres of written and spoken English. Thus, the MNLI dataset can provide a large and diverse set of training instances.

RumourEval The RumourEval-2019 dataset [8] is about the stance expressed in tweets in a conversation thread towards the rumor mentioned in the source tweet. The labels are support, deny, query and comment. Both RumourEval and COVMis-Stance are about the stance in tweets towards the rumors, but differ in four aspects:

• Sources of rumors: Our rumors are from news articles, while the rumors in RumourEval are from tweets. As a result, the source tweets also take a stance towards the rumors, so we exclude the examples where the source tweets deny the rumors.

• Relations of tweets: The tweets in a conversation have a reply relationship for RumourEval, while our tweets are independent.

• Label space: RumourEval defines four labels. It has one more than ours, but these labels do not contradict ours. Thus we directly correspond Query and Comment to Neither in our task.

• Class distribution: The RumourEval dataset has a different class distribution from ours. This can be seen from Table 6 . There is a class imbalance in the RumourEval dataset. Majority examples belong to the Neither class, and the Favor examples are more than Against.

COVIDLies The COVIDLies dataset [12] is a collection of annotated tweets that contain the stance towards a set of misinformation items. Each tweet-rumor pair is classified into one of three labels: Agree, Disagree and No Stance. The statistics of COVIDLies v0.2 are shown in Table 6 . The label distribution is heavily skewed to No stance, and the Agree tweets have a higher proportion than Disagree. Both COVIDLies and our dataset have rumors as targets and social media text, but differ greatly in the class distribution. This is caused by different strategies of tweet selection. For COVIDLies, Hossain et al. measure the similarity between tweets and misinformation items using BERTScore described in Section 4.1, and select the 100 most similar tweets for each misinformation item. Table 7 presents the Favor examples of the above datasets. In the MNLI dataset, the premise seems to be more specific, while the hypothesis is more general. In the COVIDLies or our dataset, a misinformation item is usually a sentence that roughly summarizes the content of a rumor, while the tweet is usually longer and probably describes the event in more detail. Such difference is not obvious in the RumourEval dataset, because in the conversational context, the replying tweet is a response to the previous tweet rather than a restatement.

In this section, we reproduce the stance detection experiments from the COVIDLies paper [12] , because this work also aims to detect stance in tweets towards COVID-19 misinformation. For each misinformation item and tweet pair, the task of this work is to predict whether the tweet Agree, Disagree, or takes No stance with respect to the misinformation.

Hossain et al. frame the stance detection task as a NLI problem, mapping the tweet to the premise, the misinformation to the hypothesis, and the Agree, Disagree and No Stance to Entailment, Contradiction and Neither respectively. They train the models on the MNLI dataset and then evaluate their performance on the COVIDLies dataset. The methods are described as follows.

Logistic regression Two features are separately used for logistic classification: (1) concatenating the unigram and bigram TF-IDF vectors of both sentences; (2) concatenating the average Glove embeddings of each sentence.

Sentence-BERT Reimers et al. [25] proposed Sentence-BERT, which adds a pooling layer to the output of BERT to obtain the representation of each sentence. The average pooling is used in this work. Given a tweet and a misinformation item, they use a Siamese BERT network to obtain the sentence representation u and v respectively. Then they concatenate u and v with the element-wise difference |u − v| as features for Softmax classification. The classifier output o is the probability that the input example belong to each of three labels, given by

W t is the trainable weights with a size of 3n × k, where n is the dimension of sentence embeddings and k is the number of labels.

BERTScore and Sentence-BERT This is a two-step method. It first classifies whether each tweet-rumor pair is relevant. Relevant refers to the tweet either agrees or disagrees with the rumor. Then it determines whether the pair Agree or Disagree. In the first step, it uses BERTScore to measure the relevance of each pair and classifies the pairs with high BERTScores as Relevant. Then the Sentence-BERT model is used to determine Agree or Disagree in the second step.

BERTScore [36] is originally proposed to evaluate the quality of generated text relative to the gold references. Given a candidate sentencex and a reference sentence x, it consists of two components to obtain the similarity score of two sentences: (1) it computes a cosine similarity score for each token in the candidate sentence with each token in the reference sentence using BERT embeddings; (2) The BERTScore includes recall (R), precision (P) and F1. It matches each token in x to the most similar token inx to calculate recall, and each token inx to the most similar token in x to calculate the precision. The scores are given by:

where the recall is the average matching score over all tokens in x, and the precision is the average matching score over all tokens inx. The F1 score is used in the experiments.

Models We reproduced the results of five stance detection models. All models are trained on the MNLI dataset, and the COVIDLies dataset is used as the test set to evaluate the performance of these models. Among them, we retrain the linear models on the MNLI dataset (1 and 2 of the following models). Due to limited GPU resources, we do not retrain the SBERT models, but use the trained models from the original study for prediction (3, 4 , and 5 in the following models). Table 8 shows the label distributions of both datasets. COVIDLies v0.2 has the same source as COVIDLies v0.1, but with more instances. They also have similar label distributions. Another difference is that in the COVIDLies v0.2 dataset we use, username mentions in each tweet are replaced with @username due to Twitter's privacy policy.

Metrics Three metrics are used for evaluation per label: precision, recall and F1 score. The macro average is used to evaluate the overall performance of the models, which is the unweighted mean of the metrics of each label. where f n is the number of false negatives. • F1: This is calculated as the harmonic mean of precision and recall, given by F 1 = 2 · (P · R)/(P + R). Table 9 : Reproduction results of stance detection on COVIDLies. The prefix orig indicates the results from the original study [12] , and the repr indicates our reproduction results. . The score differences are only the result of the test set differences and are not related to models, since the models we use have the same parameters as those in the original study. This is because the linear models are deterministic and do not involve any randomness, and the SBERT models are from the original study. On the other hand, we observe that four of the five models decrease by 0.1%-3.3% points in macro F1 after expanding the test set and anonymizing usernames, and only SBERT increases by 0.5% points in macro F1 (F1 = 32.7 vs 32.2).

In this section, we describe the methods for our stance detection task, including model architectures and training strategies. We use two model architectures for stance detection: SBERT and Cross-Encoder. SBERT has shown to be effective for stance detection on the COVIDLies dataset. The model details are described in Section 4.1. In the following part, we describe the architecture of Cross-Encoder and explain how it differs from SBERT. With the architecture of SBERT and Cross-Encoder, we use the models fine-tuned on the MNLI dataset, and then fine-tune them on the RumourEval or COVIDLies datasets. We explain this fine-tuning process in detail.

BERT Both SBERT and Cross-encoder use BERT as an encoder to obtain token representation. BERT [6] is a transformer-based language model developed by Google for NLP tasks. It can represent each token based on its bidirectional context. This is due to the architecture of BERT, which is a multi-layer bidirectional Transformer encoders [33] . The encoder uses a self-attention mechanism to learn word representations from both left and right contexts. BERT is pretrained on a large amount of BookCorpus and English Wikipedia using a combination of a masked language model objective (some tokens are randomly masked and the model predicts the words) and a next sentence prediction (given a pair of sentences, the model predicts if one sentence follows the other or not). Pretrained models carry abundant language knowledge and the model can be easily adapted to the target task by fine-tuning the model parameters using relatively small data from the target task.

Cross-Encoder For the BERT Cross-Encoder, a pair of sentences are passed to BERT simultaneously. The input starts with a special token [CLS], followed by the concatenation of a tweet and a misinformation item with the [SEP] token delimiting them. The [CLS] token embedding from the BERT output is used as the aggregate sequence representation for classification, since it is used to predict whether a sentence pair is coherent or not during pre-training. After BERT encoding, the [CLS] representation is fed into a fully-connected layer, which projects the hidden size into the label size. We use the Cross-Entropy loss as the optimization objective. Let p(ŷ i ) be the probability distribution in three classes of the i th example, given by where h c i is the [CLS] token representation from the BERT output of the i th example. θ are the BERT parameters, and φ are the parameters of the dense layer with a size of m × k, where m is the dimension of BERT embeddings and k is the number of labels. The Cross-Entropy loss l is computed as follows:

where n is the number of examples in a mini-batch. y i is the label vector of the i th examples, which is 1 for the true class and 0 for the other classes. w k is the manual rescaling weight given to each class, which defaults to 1. Figure 2 shows the architectures of SBERT and Cross-Encoder. There are two main differences in the sentence-pair classification task between SBERT and Cross-Encoder: (1) Sentence representation: both sentences are fed into SBERT individually, while for Cross-Encoder, the concatenation of both sentences is passed to the network. Thus, SBERT can produce independent representation for each sentence, while Cross-Encoder cannot. (2) Token-level interaction: Cross-Encoder applies the attention to the tokens of both sentences across all transformer layers, while SBERT lacks the token-level interaction between two sentences.

The model in our task is fine-tuned sequentially on the MNLI dataset and two stance detection datasets. The reason for fine-tuning in two stages is that these datasets are for different tasks and greatly differ in data size.

Similar task NLI is a high-level understanding task that involves reasoning about the semantic relationships within sentences [4] . It is similar to our task and we could easily correspond our labels to NLI. Also, this task has much larger training datasets, which will help alleviate the problem of limited annotations. Therefore, we use the models fine-tuned on the MNLI dataset. We map Favor, Against and Neither in our task to Entailment, Contradiction and Neutral in NLI respectively.

The NLI task is similar to ours but still in different domains, so we further fine-tune the models on two stance detection datasets, i.e. RumourEval and COVIDLies. Both datasets target rumors and have social media text. However, their class distributions are quite different from our dataset. Therefore, we adopt the following strategies for this problem: • Rescaling class weights [1] : This method takes the cost of prediction error into account. Specifically, the misclassification of the minority class will be penalized heavier than the majority class. The class weight w k is given by w k = max(x)/x k , where x is the vector with the class counts.

• Undersampling: we remove some examples from the majority classes of both datasets. Specifically, we keep all Against examples and randomly select examples from Favor and Neither with a fixed probability, which is equal to the expected sample number divided by the class frequency. The expected sample number is 400 for COVIDLies and 550 for RumourEval. After undersampling, a combination of both balanced datasets is also used in our experiment. Table 10 shows the statistics of the balanced datasets after undersampling.

In this section, we first experiment with how tweets and misinformation correspond to the premises and hypotheses in the NLI task. Second, we use the models fine-tuned on the MNLI dataset, and then fine-tune them on the RumourEval or COVIDLies datasets. We evaluate the model performance on our dataset. Finally, we do a dataset ablation study to identify their contributions to solving the stance detection task.

NLI models Due to limited GPU resources, we did not fine-tune the models on the NLI dataset ourselves, but directly used the publicly available models that have been fine-tuned on the MNLI dataset. The two models are SBERT and Cross-Encoder, respectively.

• mnli-sbert-ct: This model has the architecture of SBERT and was released by Hossain et al. [12] . It is initialized with the COVID-Twitter-BERT weights and fine-tuned on the MNLI dataset. COVID-Twitter-BERT is a BERT-large model, further pretrained on a large corpus of COVID-19 related tweets [18] .

• mnli-bert-ct: This model has the architecture of Cross-Encoder and was provided by Muller et al. [18] . It uses the COVID-Twitter-BERT-v2 weights and is fine-tuned on the MNLI dataset. COVID-Twitter-BERT-v2 is identical to COVID-Twitter-BERT but trained on more data.

Tweet preprocessing Before passing the sentences into the network, we process tweet texts according to the data preprocessing method used in COVID-Twitter-BERT [18] .

• Whitespace: Replace tab(	), newline(

) and carriage return () characters by spaces, and replace multiple spaces with a single space.

• Username: Replace username mentions with "twitteruser". For multiple username mentions, prefix "twitteruser" with the number. e.g. replace "@user @user" with "2 twitteruser".

• URL: Replace URL with "twitterurl". For multiple URLs, prefix "twitterurl" with the number. e.g. replace "http://... http://.." with "2 twitterurl".

• Emoji: Convert emojis into text aliases. e.g. the thumbs-up emoji becomes ":thumbs_up". This is implemented by an existing Python package called Emoji 11 .

Metrics We use precision, recall, and F1 to measure the model performance, the same as in the COVIDLies paper [12] . The definitions of these metrics are described in Section 4.2. In addition, we add the accuracy metric, which is the percentage of correct predictions across all instances. Table 11 : Results of two sentence mappings from NLI to our task. We present the macro F1 of the NLI models on the full RumourEval, COVIDLies, and COVMis-Stance datasets. p, h and mis refer to premise, hypothesis and misinformation respectively.

Experiment We experimentally investigate how the tweet and misinformation in our task correspond to the premise and hypothesis in the NLI task. We apply two NLI models to the full RumourEval, COVIDLies and COVMis-Stance datasets with two mappings: (1) misinformation mapped to premises and tweets mapped to hypotheses;

(2) misinformation mapped to hypotheses and tweets mapped to premises. The NLI models are mnli-sbert-ct and mnli-bert-ct.

Results Table 11 presents the results of different sentence mappings. The NLI models perform better on the stance detection datasets if we map tweets to premises and misinformation to hypotheses, rather than the reverse order. For RumourEval, this mapping performs about 2% points higher in F1 score for both SBERT and Cross-Encoder. This improvement is more pronounced on the COVIDLies and COVMis-Stance datasets. For COVIDLies, the F1 score improves by 5% points for SBERT and 7% points for Cross-Encoder. For COVMis-Stance, the F1 score improves by 8% points for SBERT and 9% points for Cross-Encoder.

We notice that the macro F1 of SBERT on COVIDLies is 2% points lower than our reproduced result in Table 9 , under the condition that the same SBERT model and test set are used for both experiments (F1 = 38.83 vs 40.8). We found that this is because tweets are preprocessed in the sentence mapping experiment, while we did not do so in the reproduction experiment. The reason for the F1 drop after tweet preprocessing may be that the mnli-sbert-ct model was originally proposed for stance detection in COVIDLies, and it may be validated on the COVIDLies dataset without tweet preprocessing.

Experiments We fine-tune the NLI models on each of the following five datasets and evaluate the model performance on our dataset, which is randomly divided into a validation set and a test set in a ratio of 2:8. We use the validation set for parameter tuning and the test set for evaluation.

• RumourEval (full): To alleviate the class imbalance problem in this dataset, we pass to the loss function the class weights described in Section 5.2.

• RumourEval (balanced): The model is fine-tuned on the balanced RumourEval, which is the result of undersampling on the majority classes as described in Section 5.2.

• COVIDLies (full): The class weights are also applied when fine-tuning the model on the full COVIDLies dataset, so that the model can adjust to the class imbalance.

• COVDILies (balanced): The balanced COVIDLies dataset as described in Section 5.2.

• Combined dataset: A combination of the balanced RumourEval and COVIDLies datasets to increase the diversity and number of training instances.

Configurations The parameter settings are described as follows:

• Maximum input sequence length: The input sequence is padded or truncated to 128 tokens for SBERT and 256 tokens for Cross-Encoder. This refers to the SBERT setting in the COVIDLies paper [12] . We also analyze sentence lengths of our experimental datasets. All sentences are within 61 words for COVIDLies and within 60 words for the COVMis-Stance dataset. For RumourEval, about 1% of sentences exceed 128 words. Thus, limiting the sequence length to 128 or 256 tokens should have little impact on model performance, and also be computationally efficient.

• Loss function: Cross-Entropy loss. The class weights are 1 by default. They are rescaled for the full RumourEval or COVIDLies dataset.

• Optimizer: We use Adam optimizer with the learning rate of 2 × 10 −5 , L2 weight decay of 0.01, learning rate warm-up over the first 10% of training steps, and linear decay of the learning rate. Most of these parameters follow the BERT fine-tuning settings for GLUE tasks [6] , except for the warm-up steps, which equals 10,000 in the BERT paper. As we have fewer training steps, we set this value as a ratio rather than an exact number of steps.

• Batch size: Number of samples processed before the model updating. The batch size is set to be 8, which depends on the memory size of our GPU.

• Number of epochs: Number of complete passes through the training dataset. The epoch is set to be 3, referring to the BERT paper [6] . Table 12 presents the stance detection results of all models on our test set. The first lines of SBERT and Cross-Encoder show the results of the NLI models, which are used as the baseline results in our experiments. We make the following observations.

• Best model: The Cross-Encoder fine-tuned on the combined dataset achieves the best result in terms of macro F1. It outperforms the Cross-Encoder baseline by 16% points in macro F1 (F1 = 57.28 vs 41.83).

• Fine-tuning: Fine-tuning the NLI models on RumourEval or COVIDLies improves the model performance but with two exceptions. Both of them are SBERT and fine-tuned on the full but unbalanced datasets. These two datasets are the full RumourEval and COVIDLies datasets. They have a slight decrease in macro F1 compared to the SBERT baseline (F1 = 39.30 for RumourEval, F1 = 39.38 for COVIDLies).

• SBERT vs Cross-Encoder: Overall, the Cross-Encoders outperform the SBERT models with the same data fine-tuning. They perform about 2%-5% points higher in macro F1 than SBERT. This is because the two sentences interact across all transformer layers of Cross-Encoder, while for SBERT, they do not have any interactions before the classifier. From a category perspective, the Cross-Encoders perform better on the Favor and Against classes than SBERT, but do not improve on the Neither class. Specifically, the Cross-Encoders performs an average 5% points F1 higher than SBERT on the Favor class, if excluding the full RumourEval model, which shows a 30% points difference in F1 (F1 = 70. 12 Comparison between queries Table 13 shows the results of the Cross-Encoders for different query types as well as a majority class baseline (each instance is classified into the majority class For the examples sourced by URL retrieval, all models have a lower accuracy than the majority class baseline. Among them, the RumourEval model has the highest accuracy but a 0.7% points lower in F1 than the model with the combined dataset. The COVIDLies model performs much worse than the other two models, with an 18% points lower in accuracy than the RumourEval model and a 10% points lower in F1.

For the examples sourced by keywords retrieval, all models outperform the majority class baseline. The COVIDLies model performs the best, closely followed by the model with the combined dataset. The RumourEval model performs the worst, with a 7% points lower in accuracy than the COVIDLies model and 1% point lower in F1. Table 14 : Results of dataset ablation study. We run each model five times with different random seeds and present the average values of accuracy and macro F1 as well as the standard deviation (SD).

Results Table 14 shows the results of our dataset ablation study. The results of the only-one experiment are not fully consistent with those of the all-without-one experiment. In the only-one experiment, COVIDLies performs the best, slightly better than MNLI, and RumourEval ranks last. However, in the all-without-one experiment, COVIDLies has the least impact, while MNLI has the biggest impact, followed by RumourEval. Table 15 presents some typical examples misclassified by the Cross-Encoder fine-tuned on the combined dataset, We analyze each example, as follows.

• The model incorrectly predicts Neither as Favor. The tweet has a high degree of lexical overlap with the misinformation, but it is irrelevant.

• The model incorrectly predicts Against as Favor. The tweet rewrites the misinformation, but it disagrees with the misinformation.

• The tweet shares the fake news article and expresses a supportive view. It is not literally similar to the rumor, which might lead the model to incorrectly predicts it as Neither.

• The last example is a reply tweet. It seems to refute the rumor, but not quite sure. We found this tweet is retrieved by the URL of a fact-checking article. Also, by checking the contextual conversation, we confirm that it should be flagged as Against. Such cases are difficult for the models due to insufficient information, although the tweet text has some indications.

In this section, we discuss what improvements can be made from five perspectives: data construction, solutions to the examples retrieved by different queries, class imbalance, multi-dataset learning, and differences between SBERT and Cross-Encoder.

We constructed the COVMis-Stance dataset through data collection, sampling, and annotation. There are several aspects that could be improved in this process. • Selection of misinformation items: We select the misinformation items based on the number of related tweets retrieved by news titles and URLs, as this number reflects the interest level on this topic to some extent. However, it sometimes deviates from the actual interest level, because it does not take into account the related tweets that do not share URLs but discuss the same topics. Also, the tweeters might share news articles because of other events rather than the corresponding rumor. On top of that, we observe that the political-related rumors rank relatively high with this method. This might limit the diversity of COVID-19 misinformation items. For such rumors, people's bias towards political figures or parties influences the stance on rumors. This also cannot be used as supporting information for determining the rumor veracity. Therefore, only considering the number of relevant tweets might be insufficient for misinformation selection, and improving the quality of misinformation items might help the study.

• Description of misinformation items: Some misinformation items are not the same as the titles of news articles or fact-checking articles, since we do simple processing for them. However, we expect such processing brings minor improvement since pretrained language models are usually robust to such text noise. In addition, this manual checking takes time as the number of misinformation items increases. Therefore, we can use news headlines as our target without any manual intervention. This also can be used to verify the model's robustness.

• Relevant tweets: We link tweets to COVID-19 rumors through URLs and keywords, but manual construction of keywords will be infeasible with an increasing number of misinformation items. An extended idea is to automate the construction of keywords. Besides, Hossain et al. [12] address it by regarding it as a misinformation retrieval task. Given a tweet, they select the most relevant misinformation item from the pre-defined misinformation items based on sentence similarity. Another idea is to first cluster the tweets that are discussing the same news into a group and then associate them with specific misinformation items.

• • URL: The tweets share news articles and have comments from users. We found some of them hard to understand, because they mention the event details or respond to other events covered in news articles. For such cases, the news body text might be useful, since it helps understand the event to which the tweets are referring and provide additional information about the event.

• Keywords: Compared to URLs, the tweet retrieved by keywords is usually a complete statement. For such instances, we can frame it as a semantic matching problem, i.e., whether two sentences express the same meaning. The NLI and the datasets on semantic similarity are good choices to fine-tune the models.

• Reply: We see replying tweets via Twitter search, so we do not exclude this part from the dataset. However, putting such replies in conversational contexts may give a better understanding,

The severe class imbalance exists in the RumourEval and COVIDLies datasets. We reconstruct a balanced dataset by undersampling the majority classes. The results show that this method is more effective than rescaling class weights in our task. However, undersampling reduces the number of training instances drastically, which has a negative effect on the training of large models. Oversampling the minority class is another solution to this problem, as it will guarantee all available training data is leveraged.

Multi-dataset learning We fine-tune the model on the MNLI dataset and two stance detection datasets sequentially.

We discuss possible improvements for the second phase, where the RumourEval and COVIDLies datasets are directly mixed to fine-tune the whole model.

• RumourEval and COVIDLies perform quite differently in the examples obtained by different query types. Therefore, instead of directly mixing them, other combinations of datasets might be worth experimenting with, such as the method used in [28] , in which multiple datasets share the encoder but have their dataset-specific layers. • The model learns useful capabilities from the MNLI fine-tuning, as can be seen from the result that the model with only MNLI performs closely to that with only COVIDLies. Therefore, in the context where the undersampled RumourEval or COVIDLies datasets are not large, we have the question of whether fine-tuning the whole model is an optimal choice. Parameter-efficient fine-tuning methods might be another option, such as selecting a small number of weights to update or fine-tuning a separate, small network that is tightly coupled with the model [16] .

SBERT vs Cross-Encoder SBERT underperforms Cross-Encoder in our task. The result is expected since two sentences have fewer interactions in SBERT than in Cross-Encoder. However, SBERT is more computationally efficient than Cross-Encoder, since it produces independent sentence representations. When the data becomes large, we could obtain the sentence representation of each misinformation item in advance and store them. When a tweet comes, we only need to calculate the sentence representation of the tweet by deep networks and combine it with the pre-computed sentence representation of the misinformation item for classification. In contrast, Cross-Encoder has to compute the vector representation for each combination of misinformation item and tweet, which will be much slower. Recently, many studies train a Cross-Encoder and distill the knowledge it learns into SBERT for computation efficiency.

In this work, we studied stance detection on Twitter towards COVID-19 misinformation. We constructed a stance dataset, consisting of 2631 tweets with their stance towards COVID-19 misinformation. In addition to the dataset, we establish stance detection models with the SBERT and Cross-Encoder architectures. The results show that Cross-Encoder outperforms SBERT in terms of precision, recall, and F1.

Datastories at Semeval-2017 task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis

Adverbial stance types in english

Enhanced LSTM for natural language inference

Supervised learning of universal sentence representations from natural language inference data

CoAID: COVID-19 healthcare misinformation dataset

BERT: Pre-training of deep bidirectional transformers for language understanding

Stance detection in COVID-19 tweets

SemEval-2019 task 7: RumourEval, determining rumour veracity and support for rumours

Description of the system developed by team Athene in the FNC-1

A retrospective analysis of the fake news challenge stance-detection task

Few-shot cross-lingual stance detection with sentiment-based pre-training

COVIDLies: Detecting COVID-19 misinformation on social media

Exploring listwise evidence reasoning with T5 for fact verification

Characterizing COVID-19 misinformation communities using a novel Twitter dataset

Twitter data augmentation for monitoring public opinion on COVID-19 intervention measures

Recent advances in natural language processing via large pre-trained language models: A survey

Semeval-2016 task 6: Detecting stance in tweets

A natural language processing model to analyse COVID-19 content on twitter

Combining fact extraction and verification with neural semantic matching networks

Office of the Surgeon General et al. Confronting health misinformation: The US Surgeon General's advisory on building a healthy information environment

Fighting an infodemic: COVID-19 fake news dataset

The Fake News Challenge: Exploring how artificial intelligence technologies could be leveraged to combat fake news

Intermediate-task transfer learning with pretrained language models: When and why does it work?

Exploring the limits of transfer learning with a unified text-to-text transformer

Sentence-BERT: Sentence embeddings using siamese bert-networks

Classification and Clustering of Arguments with Contextualized Word Embeddings

Annual Meeting of the Association for Computational Linguistics

Is stance detection topic-independent and cross-topic generalizable? -a reproduction study

Stance detection benchmark: How robust is your stance detection? KI-Künstliche Intelligenz

FakeCovid -a multilingual cross-domain fact check news dataset for covid-19

Transfer learning from transformers to fake news challenge stance detection (FNC-1) task

BERT for evidence retrieval and claim verification

The fact extraction and VERification (FEVER) shared task

Attention is all you need

COVID-19 misinformation

A broad-coverage challenge corpus for sentence understanding through inference

BERTScore: Evaluating text generation with bert

Recovery: A multimodal repository for COVID-19 news credibility research

Detection and resolution of rumours in social media: A survey

We leverage the BERT models fine-tuned on the NLI dataset. The model performance considerably drops without it, indicating that using the data-rich NLI as an intermediate task improves the performance of stance detection. Furthermore, the sentence correspondence from NLI to stance detection has a great impact on the model performance. It achieves better results when the tweet and misinformation are mapped to the premise and hypothesis respectively.We fine-tune the NLI models on the RumourEval and COVIDLies datasets. Both datasets are heavily unbalanced in class distribution, and such difference from our data negatively influences the model performance. The experimental results show that undersampling is more effective than rescaling class weights in resolving this problem in our task. We found that the RumourEval model is better at predicting the examples sourced by URL retrieval, and the COVIDLies model performs better on the examples sourced by keywords retrieval. The Cross-Encoder fine-tuned on the mixture of RumourEval and COVIDLies combines the advantages of both datasets, achieving the best results among all models.For future work, we recommend adopting different methods for the examples retrieved by different query types. Other multi-dataset learning methods should also be explored. In addition, we published our dataset to facilitate cross-dataset evaluation of related studies, i.e., training and testing using different datasets.

Each misinformation item is described as a sentence based on titles of fake news or fact-checking articles. If both titles are provided by CoAID, we manually choose the simpler and clearer one. For example, a fact-checking article has the title of "The RT-PCR test for the virus that causes COVID-19 detects human DNA on chromosome 8, therefore all tests will give a positive result", and the corresponding news article has the title of "WHO Coronavirus PCR Test Primer Sequence is Found in All Human DNA". We use the title of the news article, because it is simpler and does not involve the raw word "chromosome". In addition, we process the descriptions of misinformation items as follows.• Split Some rumors contain several inaccurate points and discussions about these points are also distributed, so we manually split it into multiple items. • Merge Some misinformation items are from the same news but might be reported by different media or checked by different websites. For such cases, we merge them into one message. • Correct Some titles from fact-checking websites are not rumors themselves but corrected information, like the title "No evidence that 5G is being forcibly installed in schools" is the clarification of fake news. Therefore, we change it into "5G is being forcibly installed in schools". • Simplify Some titles are the exact words spoken by some people and we summarize them into one sentence to make them clear and complete. For example, for the title "The U.S. went from 75,000 flu deaths last year in America to almost 0; are there allocation games being played to manipulate the truth ?", we rewrite it into "Trump claims that flu deaths in America are down to almost zero and data is being manipulated".

The keywords extraction follows the following principles. In addition, we also test the keywords on Twitter's search interface to obtain as many relevant tweets as possible.• We first use relevant entity names as queries. For example, for the rumor "Shanghai government officially recommends Vitamin C for COVID-19", the keywords are Shanghai, Vitamin C, and COVID-19.• If the claims are said by famous people, we usually add the individual's name to make it specific. For example, the keywords of "Nobel laureate Luc Montagnier claimed that the coronavirus genome contained sequences of HIV (the virus that causes AIDS)." are Luc Montagnier, coronavirus and HIV".• Some verbs convey the important information in a sentence, indicating the relationship between two entities or the object's action. For the sentence "Trump administration refused to get coronavirus testing kits from the WHO", the keywords are Trump, refused and WHO test kits.• Sometimes the keywords based on the above strategies still give a large number of irrelevant tweets. For such cases, we directly use a sentence as the query. For example, the sentence "Quotes Joe Biden as saying people who have never died before are now dying from coronavirus." are retrieved by the query "people who have never died before are now dying from coronavirus".

According to the sampling strategy described in Section 3.1, each example will be assigned a probability of being selected. Specifically, for each query type of each misinformation item, if the number of relevant tweets does not exceed 6, then the chosen probability is 1; otherwise, the probability is the sum of two items. One is the probability of being selected the first time p 1 = 6/N , where N is the number of tweets of this query type; the other is the probability of being selected the second time p 2 = (1 − p 1 ) · m/N r , where m is the number of tweets stilled needed to reach 24, and N r is the number of remaining tweets for this misinformation item. After determining the selected probability for each example, if a random number between 0.0 and 1.0 is less than the probability, then this example will be selected for annotation; otherwise this will not be selected.