key: cord-0195138-uv42xq6r authors: Sharma, Karishma; Ferrara, Emilio; Liu, Yan title: Construction of Large-Scale Misinformation Labeled Datasets from Social Media Discourse using Label Refinement date: 2022-02-24 journal: nan DOI: nan sha: 791b078a0a38c7b4d5efc16bba4e67ddb499ad0d doc_id: 195138 cord_uid: uv42xq6r Malicious accounts spreading misinformation has led to widespread false and misleading narratives in recent times, especially during the COVID-19 pandemic, and social media platforms struggle to eliminate these contents rapidly. This is because adapting to new domains requires human intensive fact-checking that is slow and difficult to scale. To address this challenge, we propose to leverage news-source credibility labels as weak labels for social media posts and propose model-guided refinement of labels to construct large-scale, diverse misinformation labeled datasets in new domains. The weak labels can be inaccurate at the article or social media post level where the stance of the user does not align with the news source or article credibility. We propose a framework to use a detection model self-trained on the initial weak labels with uncertainty sampling based on entropy in predictions of the model to identify potentially inaccurate labels and correct for them using self-supervision or relabeling. The framework will incorporate social context of the post in terms of the community of its associated user for surfacing inaccurate labels towards building a large-scale dataset with minimum human effort. To provide labeled datasets with distinction of misleading narratives where information might be missing significant context or has inaccurate ancillary details, the proposed framework will use the few labeled samples as class prototypes to separate high confidence samples into false, unproven, mixture, mostly false, mostly true, true, and debunk information. The approach is demonstrated for providing a large-scale misinformation dataset on COVID-19 vaccines. In recent times, malicious and coordinated promotion of misinformation coupled with uncertainties in real-world events, have sparked a plethora of false and misleading information on social media platforms [24] . Social media platforms struggle to eliminate these contents effectively and in a timely manner, and are recently attempting to solve the problem through more crowdsourced approaches to misinformation identification [3] . Twitter introduced 'Birdwatch' in 2021, which allows people to identify tweets they believe are misleading and provide notes with additional context. This is an effort to respond more quickly to diverse false and misleading claims present on the platform. However, this comes with the biggest challenge for Twitter in ensuring that Birdwatch itself does not become prey to malicious coordinated operations. Secondly, due to ideological biases about the truth it is a challenge to build reliable consensus from platforms like Birdwatch [3] . The central challenge in timely misinformation detection, mitigation, and analysis is the difficulty in obtaining labeled misinformation datasets at scale, especially in new domains. Moreover, diverse and evolving false and misleading information based on changing real-world events, constantly surface on social media [22] . In the literature, the two primary approaches to constructing misinformation datasets are based on either collecting available fact-checked claims from organizations like Snopes, PolitiFact, etc. [14, 27] , or utilizing news-source credibility labels based on reliable or unreliable sources listed by fact-checking organizations [21, 34] . The former approach suffers from claim selection bias with a slow and not scalable, human intensive fact-checking process, unsuitable for timely identification in new and evolving domains. The latter approach allows for more diverse, large-scale misinformation labeled social media posts, from a handful of analyzed news sources but can contain inaccurate labels in the dataset. Proposed Approach: We address the above shortcomings with an alternate approach to construct misinformation datasets. We propose to utilize news-source credibility labels as weak labels for social media posts, and use model-guided refinement of labels to construct large-scale, diverse misinformation datasets in new domains. The news-source credibility based labels can be inaccurate at the article or social media post level when the stance of the user does not align with the news-source or article credibility. Therefore, for label refinement, we propose to use self-supervision from any generic misinformation detection model, with social context modeling of the social media posts. In this framework, we use a misinformation detection model trained on the initial weak labels, with uncertainty sampling based on entropy in predictions of the model to identify potentially inaccurate labels and correct for them using self-supervision or relabeling. In addition, we incorporate the social context of the post in terms of the community of its associated user to model user credibility and stance in the discourse. The model-guided refinement is used to surface inaccurate labels iteratively, and minimize human labeling efforts, enabling timely scaling to large misinformation datasets with greater coverage. The model-guided confidence in the labels is used to filter out or correct inaccurate weak labels, and the resulting dataset of social media posts with its engagements are labeled as misinformation/reliable, and associated with a model confidence in its label. The misinformation can be further segregated at finer-grained labels (such as false, mixture, true [22] ) which can be obtained after label refinement with a semi-supervised classification setup [10, 32] in the proposed approach. Specifically, to provide finer-grained labels, we use the few human labeled examples as class prototypes to separate high confidence examples into false, unproven, mixture, mostly false, mostly true, true, and debunk information. More details on the labels and annotation guidelines are discussed later. The approach is demonstrated and applied for constructing and providing the research community with a large-scale public misinformation dataset on COVID-19 vaccines. Contributions: Our contributions developed in this work are: • Model-guided label refinement approach for timely construction of large-scale misinformation datasets. • Label annotation guidelines and flexible framework that can generalize to other misinformation domains. • Evaluation and construction of public misinformation dataset on COVID-19 vaccine social media data from Twitter. In the following sections, we discuss the challenges in misinformation dataset construction, limitations of existing methods, related works, and the proposed approach and experiments, and conclude with a discussion of limitations and future work. Misinformation datasets. Misinformation referring to false and distorted facts on social media has been addressed in numerous studies. A review of misinformation detection, mitigation techniques, and related datasets and tasks is comprehensively surveyed in [22] . The construction of misinformation datasets is a central task to enable research on misinformation detection, mitigation and analysis. Existing misinformation datasets are either general, such as over a specific time period [14] and cover content, social media engagements, and temporal features [27] , or topic-specific datasets such as on the Syrian war [19] . The label scheme of datasets and the type of information collected vary based on the specifics of the task. For instance, for claim verification with external knowledge, datasets include content and evidence collected from the web that support or refute claims in the content [16] . The general detection task requires learning discriminative classifiers for misinformation claims, and usually includes content and its social media engagements, and the labels depend on the distinction made during data collection e.g. fake/real news [5] unreliable/reliable [34] rumors/non-rumors [14] . A comprehensive summary of several popular datasets in terms of their label classes and features is available in [22, 27] . Misinformation detection. Misinformation detection relies on learning discriminative features from labeled datasets, often utilizing the propagation features, content features, and account features [14, 17] . Wang et al. [30] in addition use weak supervision from user's reports to augment labeled misinformation datasets with unlabeled examples for misinformation detection. Shu et al. [26] use weak social supervision to similarly improve misinformation detection, i.e., where social media engagements are abundant but labeled misinformation content is not, modeling the interactions between social media users and contents to improve discrimination of misinformation content. Both these works are similar in flavor, in terms of augmenting the misinformation labeled datasets with auxiliary information to improve detection. In our work, we address how to scale the construction of misinformation labeled datasets using news-source credibility as initial weak labels. Label refinement. Label noise in real-world data is common, and there are many different approaches to detect, remove, or correct it, which are relevant to this work [1, 11, 20] . Some works use local label inconsistencies in the feature space for detection [20] , others utilize the training loss of deep neural network classifiers on the dataset to filter examples with high training loss in early epochs as noisy [1] , or utilize entropy or variance in classifier predictions [8] . Other works focus on making classifiers more robust to label noise in datasets [29] . Active learning works address selection of instances from unlabeled or labeled datasets that are most useful to get human labels for to learn better models from the data, but depend on the presence of an 'oracle' i.e., human labeler, and utilize the expected model change from human labeling to select which instance to pick [11] . Here, we propose to additionally incorporate social context in label refinement, since in social media applications, the structures and context of social media users, as we show, provides relevant, complementary signals. We collected social media posts on COVID-19 vaccines using Twitter's streaming API from December 9, 2020 -Feb 24, 2021 with keywords related to the vaccines ("Vaccine", "Pfizer", "BioNTech", "Moderna", "Janssen", "AstraZeneca", "Sinopharm"). The stream fetches a ∼ 1% sample of all tweets containing at least one of the keywords from the platform in real time. The data collection period started just prior to the first Emergency use authorization of Pfizer-BioNTech COVID-19 vaccine in the U.S. The dataset contains 4,764,701 unique user accounts with 15,158,523 collected tweets. Previous works that use news-source credibility for misinformation labeling, include news sources analyzed by different fact-checking organizations [4] . Bozarth et al. 2020 found that differences in lists based on the fact-checker it is compiled from affect prevalence, but not the temporal trends or differences in narratives of misinformation vs. legitimate contents labeled by these methods. In this work, more than prevalence, we are interesting in curating news sources to provide weak labels covering a diverse set of possible misinformation found in social media posts. Therefore, we compile news-source credibility labels from multiple fact-checking resources, to encompass a wide range of lowcredibility news sources. Following [21] , we collect lists of unreliable and conspiracy news sources from three fact-checking resources: Media Bias/Fact 1 , NewsGuard 2 , and Zimdars [35] 3 . News-Guard maintains a repository of news publishing sources that have actively published false information during the COVID-19 pandemic. The listed sources from NewsGuard, accessed on September 22, 2020 are included, along with low and very-low factual reporting listed as questionable from Media Bias/Fact Check, and sources tagged with unreliable or related labels and conspiracy/pseudoscience from Zimdar's list. List of reliable sources 4 [21] , covering high factual sources is also collected for obtaining the weak labels. In total, we obtained 1380 unreliable (or conspiracy) and 124 reliable sources. This choice of lists provides informative weak labels (ref. Section 4 and 6) but can be replaced or updated with other resources on news-source credibility analysis as needed. On social media, content propagates through the network when accounts engage with posts by re-sharing (retweet), replying (reply tweets), quoting (quote tweets are retweets without a comment). A reply tweet can also be retweeted or quoted, and likewise for quote tweets. Therefore, source posts receive direct and subsequent indirect engagements through propagation over the network. This flow of information is referred to as an 'information cascade' [33] or tweet cascade. We represent it as a sequence of tweets, ordered by their time-stamp. The source post is the first tweet in the cascade. Formally, a cascade can be represented as follows with the user (u), tweet (tw), and temporal (t) features of when the users posted the engagements [18] , Extracting tweet cascades. To extract the content cascades from the collected data, we use the retweet/reply/quote links between the tweets available from its metadata, and construct a directed graph of the tweets. We find the weakly-connected components of this graph, and each corresponds to one tweet cascade [23] . Weak labels using news-sources. The cascade is weakly labeled based on news-source credibility lists if the source post references one of the news sources. The news-source label (unreliable, conspiracy, reliable) is assigned to the cascade as its weak label. We extracted tweet cascades from the collected Twitter data stream sample, keeping 490,638 user accounts that have at least 5 collected tweets in the sampled stream. The total tweets for these accounts is 9M. We weakly labeled the tweets as mentioned, and obtained 10,377 reliable cascades, and 4,267 unreliable or conspiracy cascades. These 14.6k cascades with weak labels will be used to construct the misinformation dataset as described in later sections. Existing works apply two primary approaches to construct misinformation labeled datasets, either from claims verified by fact-checking organizations, or using credibility of the sources publishing the content. Both approaches suffer from drawbacks. We describe the approaches and summarize the drawbacks below. Fact-checking based labeling. One approach to collecting labeled misinformation contents, in the form of news articles, claims, or social media posts, is from contents verified by fact-checking websites (e.g., Snopes, PolitiFact). This approach is frequently used to construct datasets with few hundred or thousand labeled misinformation contents [12, 14, 27] . Then, for the fact-checked claims, social media engagements related to it are collected by search for content keywords using social media API's (e.g. Twitter search API). The matched social media posts containing these keywords are inspected to determine if they are relevant to the content [12] , or the search keywords are refined until reasonably relevant matches of social media engagements are collected [14] . News credibility-based labeling. The other approach used for misinformation labeling is based on credibility of news sources [4] . Social media posts referencing content from any of these sources is labeled based on the source credibility to provide a dataset of unreliable and reliable contents. This is used frequently to identify misinformation posts from social media discourse for timely analysis in new domains [28, 34] . The drawbacks are summarized as follows. • Fact-checking based labeling. It can have claim selection bias, since fact-checkers usually select claims to verify based on relevance or popularity (e.g., PolitiFact 5 ), which can limit the diversity of collected claims for detection, and bias the analysis. Plus it is slow, human-intensive, and less scalable. • News source credibility-based labeling. It scales to many social media posts using a handful of unreliable (questionable) and reliable sources, resulting in more diversity in claims, but has inherent label noise at the article or social media post level. Therefore, it can only provide weak labels. First, we analyze the correlation between the labels from the two approaches. We collected fact-checked claims from Snopes.com 6 and NewsGuard 7 on COVID-19 vaccines. For each fact-checked claim, we find tweet cascades that discuss the claim by searching for text matches to words related to the claim. E.g. "Myth: The COVID-19 vaccine will use microchip surveillance technology created by Bill Gates-funded research. " We search for source tweets with words "chip", "microchip", "surveillance" for matches. If nothing is found, we refine the search with"gates" and sample to check for matches. NewsGuard provides only Myths (false claims), while Snopes provides varying factuality labels (true, mostly true, mixture, false etc.). From the Snopes collection tagged as COVID-19 vaccines, we obtained the claims labeled as one of these types (tagged as factchecked) or labeled as news articles (AP news, The Conversation). Associated Press or AP news are rated as very high factual reporting and least biased by Media Bias/Fact Check. Therefore, we take claims from AP News as reliable. For more reliable claims, we also directly crawl the websites of AP news and NPR news (same factual and bias rating as AP news) for news articles (extracting the article heading, claim/short description, date) using python web-scraping. Snopes, AP news and NPR news together give us 400 claims, which we sample to find matching tweet cascades. We inspected the source tweet and labeled it based on the stance of the tweet to the factchecked claim or reliable news article as 'true, mostly true, mixture, false, mostly false, unproven, debunk', similar to Snopes. We found 256 tweet cascades to label based on the Snopes factchecks and News articles. This forms our evaluation test set of tweets with human expert ground-truth labels. To additionally construct a validation set of human labeled tweets, we used stratified sampling of 150 additional tweets from the 14k tweet cascades and labeled them based on similar annotation guidelines as Snopes, described in the next subsection. In Fig 1, we compare the news-source credibility based labels (unreliable/conspiracy/reliable) with the inspected fact-checked claim based labels for the human labeled tweets in the evaluation test and validation sets mentioned above. Overall, the news credibility labels appear to be well correlated with actual human labels. Individual inaccuracies can still exist, but with the large-scale weakly-labeled data smoothing out individual errors, we could learn to refine the weak labels to construct the misinformation dataset. Labeling misinformation is already challenging, more so because misinformation is not easy to specify [22] . It can lie on a spectrum of truth, including false, conspiracy [7] , and misleading or distorted information such as missing or misleading contexts or mixture [13, 31] . We find that fact-checking organization Snopes uses a well-defined label schema that is general enough to fit any domain, and yet manages to cover all types and nuances of distortion we found upon examining tweets in the vaccines dataset, and generally in the literature [22] . Snopes includes several labels to cover the varying degree of truth and other deceptive tactics like miscaptioned, misattributed, scam. We work with the 6 most relevant Snopes categories, and add the 'Debunk' category based on what we observed in tweets. These cover even very specific types of anti-vaccine misinformation and science distortions [13] . The label scheme is proposed below, derived from Snopes and tweet inspection. We refine the label definitions to make the distinction between them and its coverage explicitly clear based on the inspected tweet data, for labeling social media posts based on their factualness. Guideline: Label the tweet based on what the tweet is trying to say or claim, and how factual its claim is. Choose one of the below labels for the tweet: • True: Primary elements of the claim are demonstrably true. • Debunk: Tweet calls out or debunks inaccurate information. • Mostly true: Primary elements of a claim are demonstrably true, but some of the ancillary details surrounding the claim may be inaccurate. • Mixture: Claim has significant elements of both truth and falsehood (including for e.g. significant missing context or misleading which might cause one to be misled about truth). • Mostly false: Primary elements of a claim are false, but ancillary details may be accurate. • False: Primary elements of a claim are false or conspiratorial. • Unproven: Insufficient evidence that it is true, but for which declaring it false would require a difficult (if not impossible) task of proving a negative. We evaluated the label scheme and guidelines on a random 200 sample subset from the collected tweet cascades. We compute the inter-annotator credibility for the tweets between two annotators, one graduate non-native English speaker familiar with misinformation research vs. one undergraduate native English speaker not familiar with the research. The agreement is moderate if we consider across the 7 label categories (0.61 Cohen's kappa), and substantial (0.77 Cohen's kappa) if binarized as (true, debunk, mostly true) vs. (mixture, mostly false, false, unproven) as high-level abstractions. Both annotators followed the same guideline and instructions with typical and difficult examples (noted in Appendix A). 8 We propose an alternate approach to constructing misinformation datasets at scale, addressing the shortcomings of existing approaches. We propose to use news-source credibility as weak labels and leverage the large-scale weak labeled data with label refinement to construct misinformation datasets minimizing time-consuming, and not scalable human labeling efforts. In the previous section, we observed that the news credibility labels are correlated overall with actual fact-checked claim based labels, and with the large scale of the weak labeled dataset smoothing out individual errors, we could learn to remove inaccuracies. The inaccuracies in these weak labels arise at two levels: 1) Article level. First, not all contents published by misinformation news sources might contain misinformation, although they tend to be unreliable or repeatedly violate journalistic reporting principles, enough for the source to be included as a low factual reporting source by experts. 2) Tweet level. Secondly, the weak label can be incorrect based on the stance of the social media post to the content from the news source. The post or tweet might reference content from the source with its supporting viewpoint or restating it as is. Or in other cases, oppose or distort the content from the source, which would result in a mislabeling at the level of a tweet. In Fig. 2 , we provide the proposed framework for misinformation dataset construction and labeling in new domains at scale. The weak labels unreliable, conspiracy, reliable on tweet cascades are from news-source credibility. It is utilized to construct the initial dataset with ∈ {0, 1} as weak labels with 1 as unreliable/conspiracy and 0 as reliable. Our goal is to remove or correct inaccurate weakly labeled instances in the dataset, and output highlevel distinctions of misinformation label as 1 and reliable information label as 0, with the model-guided predictions of confidence in the labels. In the proposed framework, we make use of any generic misinformation detection model to guide the weak label refinement. Classifiers are often utilized to estimate uncertainty in the instance labels from the loss or model predictions in label noise methods [1, 6] . In this work, we use entropy of the misinformation detection model predictions to measure closeness from the decision boundary [6] . High entropy indicates greater model uncertainty about the label. The entropy in the model predictions for an instance is defined as follows, where ( ) is the vector of predicted probabilities from the detection model, and is the classes, here for ∈ {0, 1}. We train the detection model on all weak labeled data, and then filter out high entropy instances. We also filter out tweets with low entropy model predictions if the initial weak label and predicted model label for the tweet are inconsistent with each other. This is for instances where the model is confident in its prediction, either has an incorrect weak label or predicted label. With the filtered dataset, we retrain the detection model, and repeat until the model has marginal improvements on a held-out small human-labeled or fact-checked labeled validation set is marginal. This iterative self-training improves the detection model and its signals of the inaccuracies in weak labels. In each iteration, the retrained detection model is applied to all instances in the initial dataset to calculate the entropy scores, and filter for the next iteration as it can now make more informed filtering decisions than those in the previous iterations. In misinformation applications, social media engagements are known to provide useful signals for misinformation detection [27] . Here, we propose that the social context can also be useful in guiding the construction of misinformation labeled datasets. We describe how social context can be modeled and leveraged in this application. We incorporate social context of the post using the community of its associated user to model a user account's credibility and stance in the discourse. Social media discourse tends to be segregated into echo-chambers of user accounts sharing similar opinions [9] . User accounts follow each other based on their interests, and become more exposed to contents that align with their interest and ideologies [22] . The retweet graph between user accounts that retweeted each other's tweets can be used to identify user communities. Retweets are seen as a form of endorsement of content and edges with at least two retweets are retained to capture links of similar interests in the user accounts [9] . We identify user account communities from the retweet graph using Louvain method [2] . To leverage the social context, we identify communities that dominantly post or share misinformation sources vs. reliable sources. Several works have found that informed and misinformed user accounts exhibit echo-chambers in their network structure [15, 25] . For the identified communities, if tweets of user accounts in the community dominantly contain references to misinformation news sources, the communities are likely less to be credible, or are more misinformed. Misinformation communities would involve either malicious groups promoting misinformation, or groups with beliefs that support or are vulnerable to believing and sharing misinformation on the topic of the discourse [15] . We can thereby leverage this to encode a user account's credibility and stance with respect to misinformation on the topic of the discourse. Here, we denote social context of a tweet as the community of the user account that posted the tweet. Given the community structure, the tweet cascade is detected as possibly mislabeled if, • the user account belongs to an identified dominant misinformation (unreliable/conspiracy) news-source sharing community, but the tweet is weakly-labeled as reliable. • or, if the account is in a dominantly reliable information sharing community, but the weak label is unreliable/conspiracy. For mixed communities with unclear dominant reliable or misinformation sharing patterns, we have no definitive social context for label refinement, and use only the detection entropy. We jointly use the social context signal and the detection model entropy to guide the identification of post-level mistakes in the weakly-labeled data. The proposed framework (Fig 2) is iterative and flexible. We can replace the misinformation detection model with any modeling choice, and use either self-supervision and/or human/model based relabeling. The detection model is first trained by itself with self-supervision from the model predictions. Then the improved detection model signals are jointly combined with social context for further label refinement. The process is iteratively repeated with evaluation of detection model on small held-out human labeled or fact-checked validation set as a proxy for label quality in the large-scale dataset. The procedure for label refinement from detection model and social context is described in Algorithm 1. The subroutine assumes as input the instance (tweet cascade denoted as , with its weak label ), the detection model trained in the previous iteration, and the social context . Given the model state, we generate three possible actions: (1) RETAIN weak label (2) FLIP weak label (3) QUERY label. Action retain keeps the instance with its weak label in the dataset, flip is model-guided relabeling (without human resource), and query is for active human relabeling of the model suggested instance. If the human resource is not available, then QUERY can be replaced by REMOVAL (discarding the instance due to low confidence in its label or due to contradictory confident signals from detection model and social context ). The states from the detection model and social context are defined as follows, for instance , • M-lc: If high-entropy in detection model prediction (M-lc stands for low confidence, that is, high entropy) • M-consistent and M-inconsistent: If low entropy , and predicted label equals weak label then it is consistent (opposite predicted label and weak label, then inconsistent) • S-unk: no social context signal, either its user's community is not dominantly reliable or unreliable/conspiracy, but a mixture; or the user is not clustered in any main community. • S-consistent and S-inconsistent: If the social context of a user account (its community label) is (in)consistent with the weak label of its tweet in (as described earlier). Require: Dataset instance , weak label , and detection model , and social context Ensure: Action: RETAIN, FLIP, QUERY label 1: if M-consistent and S-consistent then The objective of the procedure is to minimize human relabel queries, and incorporate high confidence signals from both detection model and social context to ultimately remove or correct as many inaccurate weak labels, keeping as many correctly weak labeled instances. If the signals reinforce each other, the procedure can more confidently take an action without human label querying (or removal/discarding of the instance). Given the state, the appropriate action is selected by the procedure Alg. 1. Fine-grained semi-supervised classification. The dataset is refined based on retaining weak labels, model based relabeling, and human relabeling or removal of the instance. The retained and refined instances form the output constructed dataset with the associated model confidence in its label. The fine-grained labels are obtained by the human labeling but only on selected instances false, unproven, mixture, mostly false, mostly true, true, and debunk. For the remaining instances, we can use a semi-supervised classification setup [10, 32] to obtain fine-grained distinctions. The few obtained human labeled instances become class prototypes to separate the rest into the seven classes. The distinctions can be very nuanced with varying degrees of truth, and difficult for a model to distinguish very accurately, so we provide these as auxiliary outputs. We study the proposed approach for constructing a large-scale public misinformation dataset on COVID-19 vaccines. We use iterative self-training of the misinformation detection model CSI [18] trained first on the initial weakly-labeled cascades. We use low-quality news sources for weak labels on the collected Twitter dataset, compiled from fact-checking resources as described earlier in Section 3. We have a total of 14.6k tweet cascades with roughly 10,377 weakly labeled as reliable and 4,267 weakly labeled as unreliable/conspiracy. With this setting, we experiment with the proposed framework for large-scale misinformation dataset construction from weak news-source labels. Table 2 : Results for noise detection in weak labels with label refinement proposed approach for misinformation dataset construction on COVID-19 vaccines. Evaluation metrics: Rec (noise recall), Prec (precision), Frac UQ (fraction of unwanted queries), F1 (F1 of detected noise in weak labels). The evaluation test set of tweet cascades contains 256 tweets with ground-truth fact-checked claim based labels obtained by searching for tweets related to Snopes fact-checks and AP news/NPR news on COVID-19 vaccines and labeling from the 7 fine-grained labels according to the annotation scheme and fact-checked claims. For experiments, a human-labeled validation set of 150 tweets, based on the annotation scheme and guidelines, is also constructed and held-out from the 14.6k tweet cascades (as described in Sec 4.2). 6.1.1 Evaluation tasks. We cannot directly measure the quality of the constructed misinformation dataset, since we cannot obtain ground-truth fact-checker (e.g. Snopes) labels on all 14.6k tweet cascades. We instead evaluate on the fact-checked claim based test subset of 256 tweet cascades using the following evaluation metrics (i) Misinformation detection performance on test set. Label quality in the dataset should be positively correlated with misinformation detection accuracy on ground-truth labeled data. (ii) Label correction accuracy on validation and test sets and (iii) # of wasted queries generated in the label refinement procedure, to measure human resources that are inefficiently utilized. (ii) The baselines and proposed experiments are evaluated for label correction accuracy on the ground-truth test set and validation set. We have the initial weak labels and correct misinformation labels for the test and validation set. Therefore, we can measure the recall (Rec), precision (Prec), and F1 of the noise in the weak labels (i.e., weak label and ground-truth fact-checked label are not aligned). The instances detected as noise by the methods are ones selected for FLIP or QUERY (REMOVE) actions (as it is predicted by the method as having a possibly mislabeled weak label). Recall is the fraction of actual noise in weak labels that are correctly detected by the methods, and Precision measures the correctly recalled noise in weak labels out of all instances detected as noise by the methods. F1 is the harmonic mean of precision and recall. (iii) We additionally propose a metric to also measure how efficiently the resources are utilized by the baselines and proposed methods. We define Frac UQ (fraction of correctly weak-labeled instances that are assigned QUERY (REMOVE) action for human relabeling (removal), i.e. unwanted or wasted queries) as follows, Frac UQ = | (QUERY action assigned) & (correct weak label) | |correct weak label| (3) The # of instances with correct weak labels assigned QUERY (RE-MOVE) action is the numerator, measuring human resource wastage. Lower value of Frac UQ is better, while maintaining high noise recall. Misinformation detection performance. In Table 1 we provide results of the proposed framework to construct misinformation datasets from weak labels. We trained the CSI [18] misinformation detection model on weak labels from news source credibility to classify misinformation (unreliable/conspiracy) tweets from reliable information tweets, as a baseline. The held-out validation set is used by the detection model for early stopping in model optimization, and for calculating the threshold for detection based on AUC curve, trading off sensitivity and specificity on the validation set. The reported results are on the held-out ground-truth test set of factchecked based labels. The same setting is used in all experiments. The removal (entropy filtering) guided by the detection model (self-training iteration 1 and 2) improves the classification on the ground-truth test set, indicative of higher label quality in the retained tweets. After 2 iterations, the improvement was insignificant. Further, incorporating social context modeling, we first evaluate Social-context only, wherein the tweets with labels opposite to their community label (dominantly reliable or dominantly misinformation) are surfaced as to be queried or removed. We find that combining the social context modeling and detection model guidance is more informative about possible mislabeling (tweets to be removed) in the weak labels (Social+Detection model). Finally, Social+Detection model (+label correction) is used to correct the labels that the two signals suggest should be oppositely labeled, and remove ones that the model is unsure about either from the detection model or social context (i.e., using the label refinement procedure in Alg 1). We find results in model-guided label refinement for construction of misinformation datasets is effective and significantly improves both recall in misinformation detection, (since the misinformation examples are fewer in the imbalanced data), and the precision of detected misinformation, and other metrics separating the two classes of misinformation and reliable information. Label correction accuracy and resource efficiency of label refinement. In Table 2 , we provide the results of performance on label correction using the signals from the detection model and/or social context. Naive baseline is trivially set to assume all weak labels are mislabeled, and QUERY all of them. Therefore, all correctly weak-labeled instances are Queried with worst resource utilization of 1, and low precision, F1 scores. For the removal (entropy filtering) guided by the detection model (self-training iteration 2) has roughly 56% recall in inaccurate weak labels, with reasonable precision and low Frac UQ. It is similarly the case for Social-context only method. With the proposed approach (i.e., using the label refinement procedure in Alg 1) combining the social context modeling and detection model guidance is more informative about possible mislabeling in the weak labels (Social+Detection model) and we see a massive increase in recall on combining the two signals, resulting in where now some of the detected noise will be directly selected for FLIP action instead of for removal (or query), minimizing the wasted queries, if the FLIP was assigned to actual noisy instances. These evaluation metrics suggest how well the proposed method works at constructing high-quality misinformation labels, with the least cost incurred in terms of the human labeling resources, or mistakes in identification of possible incorrect weak labels. Constructed misinformation dataset. In the constructed misinformation dataset derived from weak labels with the proposed method, in Fig. 3 , we examine scatter plot of instances on the predicted probability of misinformation from the detection model, which as we see is correlated with the fine-grained human labels available on the validation and test set, capturing the varying degree of truth. In Table 3 , we show the fine-grained classification from human labeled class prototypes on remaining examples in the dataset, using 5-fold cross validation on stratified splits of the validation plus test set for evaluation. For classification, we additionally labeled 400 instances to include as human-labeled class prototypes. We used extracted representations of tweet cascades from the CSI detection model used here, to train an MLP. With class-weighting, the fine-grained classifier has 0.57 weighted F1 distinguishing over the 7 nuanced label categories which is a difficult task. The proposed label refinement approach is effective at constructing large-scale datasets from weak labels with high recall of inaccurate weak labels when incorporating social context jointly with entropy filtering. We provide discussions of the proposed approach and three potential, concrete future research directions: (1) The weak labels are collected from news-source credibility, so posts with references to news sources form the basis of the dataset. The dataset might be more centered on news-worthy contents, which is one limitation of the approach. Other sources to augment weak labels could be considered in the future. Also, the annotation scheme label categories are general, but effective ways to construct expert examples as annotator guidelines in new domains could be explored. (2) The proposed approach models instance credibility and user credibility through entropy filtering and social context modeling. The label refinement procedure could be provided additional signals of news-source credibility by modeling each news source separately (since each might cause different noise rates, and unreliable sources are more mixed than conspiracy sources, as we observed). (3) The fine-grained classification is a difficult task and future research directions could explore model-guided selection of instances for human labeling, to act as the most effective class prototypes, with richer semi-supervised fine-grained classification techniques. To conclude, the proposed label refinement with social context modeling is a useful, new approach for constructing misinformation datasets, in a timely and scalable way for new or evolving domains. In Fig 4, the instructions and guidelines specified for annotators is included. The annotators are asked to label in the context of when the tweet is posted, with examination of facts from high-factual, low bias news article sources, fact-checking resources, and official information sources. The annotators are provided tweets with the screen name, news source domain, news source label, full tweet text (including the news URL hyperlink), tweet timestamp are provided to aid the annotator. The article URL provides context to the tweet content, and is needed at times to understand the tweet's claim. Typical and trick examples with remarks in each category were provided to review and revisit while annotating, which is a useful guide to provide the distinctions between label types. Unsupervised Label Noise Modeling and Loss Correction Fast unfolding of communities in large networks Twitter's 'Birdwatch' Aims to Crowdsource Fight Against Misinformation Higher Ground? How Groundtruth Labeling Impacts Our Understanding of Fake News about the Proceedings of the International AAAI Conference on Web and Social Media Ginger Cannot Cure Cancer: Battling Fake Health News with a Comprehensive Data Repository Label-noise reduction with support vector machines Characterizing social media manipulation in the 2020 US presidential election Classification in the presence of label noise: a survey Quantifying controversy on social media Semi-supervised classification with graph convolutional networks Robust active label correction Rumor detection over varying time windows The COVID-19 Vaccine Communication Handbook. A practical guide for improving vaccine communication and fighting misinformation Detecting Rumors from Microblogs with Recurrent Neural Networks Characterizing COVID-19 misinformation communities using a novel twitter dataset Where the Truth Lies: Explaining the Credibility of Emerging Claims on the Web and Social Media Neural User Response Generator: Fake News Detection with Collective User Intelligence CSI: A Hybrid Deep Model for Fake News Detection FA-KES: a fake news dataset around the Syrian war Noiserank: Unsupervised label noise reduction with dependence models Characterizing Online Engagement with Disinformation and Conspiracies in the 2020 U.S. Presidential Election Combating fake news: A survey on identification and mitigation techniques Covid-19 on social media: Analyzing misinformation in twitter conversations. arXiv e-prints (2020) Identifying Coordinated Accounts on Social Media through Hidden Influence and Group Behaviours COVID-19 Vaccines: Characterizing Misinformation Campaigns and Vaccine Hesitancy on Twitter Detecting fake news with weak social supervision FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media Emily Vraga, and Yanchen Wang. 2020. A first look at COVID-19 information and misinformation sharing on Twitter Learning from noisy labels with deep neural networks: A survey Bin Zhong, Qiang Deng, and Jing Gao. 2020. Weak supervision for fake news detection via reinforcement learning Fake news. It's complicated Unsupervised Data Augmentation for Consistency Training Modeling information diffusion in implicit networks Recovery: A multimodal repository for covid-19 news credibility research False, Misleading, Clickbait-Y, and Satirical 'News' Sources This work is supported by NSF Research Grant (CCF-1837131) and DARPA (HR001121C0169 and W911NF-17-C-0094). Views and conclusions are of the authors and should not be interpreted as representing the social policies of the funding agency, or the U.S. Government. We thank Feng Pan and Cindy Lin for helping with annotation guidelines and labeling efforts.