key: cord-0505629-svkq7pg3
authors: Nakov, Preslav; Martino, Giovanni Da San; Elsayed, Tamer; Barr'on-Cedeno, Alberto; M'iguez, Rub'en; Shaar, Shaden; Alam, Firoj; Haouari, Fatima; Hasanain, Maram; Mansour, Watheq; Hamdan, Bayan; Ali, Zien Sheikh; Babulkov, Nikolay; Nikolov, Alex; Shahi, Gautam Kishore; Struss, Julia Maria; Mandl, Thomas; Kutlu, Mucahid; Kartal, Yavuz Selim
title: Overview of the CLEF--2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News
date: 2021-09-23
journal: nan
DOI: nan
sha: 8629704268f9d81eb4bb9b37ba2420051deca372
doc_id: 505629
cord_uid: svkq7pg3

We describe the fourth edition of the CheckThat! Lab, part of the 2021 Conference and Labs of the Evaluation Forum (CLEF). The lab evaluates technology supporting tasks related to factuality, and covers Arabic, Bulgarian, English, Spanish, and Turkish. Task 1 asks to predict which posts in a Twitter stream are worth fact-checking, focusing on COVID-19 and politics (in all five languages). Task 2 asks to determine whether a claim in a tweet can be verified using a set of previously fact-checked claims (in Arabic and English). Task 3 asks to predict the veracity of a news article and its topical domain (in English). The evaluation is based on mean average precision or precision at rank k for the ranking tasks, and macro-F1 for the classification tasks. This was the most popular CLEF-2021 lab in terms of team registrations: 132 teams. Nearly one-third of them participated: 15, 5, and 25 teams submitted official runs for tasks 1, 2, and 3, respectively.

The mission of the CheckThat! lab is to foster the development of technology to enable the (semi-)automatic verification of claims. Systems for claim identification and verification can be very useful as supportive technology for investigative journalism, as they could provide help and guidance, thus saving time [34, 45, 47, 97, 54] . A system could automatically identify check-worthy claims, make sure they have not been fact-checked already by a reputable factchecking organization, and then present them to a journalist for further analysis in a ranked list. Additionally, the system could identify documents that are potentially useful for humans to perform manual fact-checking of a claim, and it could also estimate a veracity score supported by evidence to increase the journalist's understanding and trust in the system's decision.

CheckThat! at CLEF 2021 is the fourth edition of the lab. The 2018 edition [65] focused on the identification and verification of claims in political debates. The 2019 edition [31, 32] featured political debates and isolated claims, in conjunction with a closed set of Web documents to retrieve evidence from.

In 2020 [15] , the focus was on social media -in particular on Twitter -as information posted on this platform is not checked by an authoritative entity before posting and such posts tend to disseminate very quickly. Moreover, social media posts lack context due to their short length and conversational nature; thus, identifying a claim's context is sometimes key for effective fact-checking [23] .

In the 2021 edition of the CheckThat! lab, we feature three tasks: 1. checkworthiness estimation, 2. detecting previously fact-checked claims, and 3. predicting the veracity of news articles and their domain. In these tasks, we focus on (i ) tweets, (ii ) political debates and speeches, and (iii ) news articles. Moreover, besides Arabic and English, we extend our language coverage to Bulgarian, Spanish, and Turkish. We further add a new task (task 3) on multi-class fake news detection for news articles and topical domain identification, which can help direct the article to the right fact-checking expert [68] .

Three editions of the CheckThat! lab have been held so far, and some of the tasks in the 2021 edition are reformulated from previous editions. Below, we discuss some relevant tasks from previous years.

Task 1 2020 . Given a topic and a stream of potentially related tweets, rank the tweets by check-worthiness for the topic [43, 82] . The most successful runs adopted state-of-the-art transformer models. The top-ranked teams for the English version of this task used BERT [24] and RoBERTa [70, 98] . For the Arabic version, the top systems used AraBERT [52, 98] and the multilingual BERT [42] . Task 2 2020 . Given a check-worthy claim and a dataset of verified claims, rank the verified claims, so that those that verify the input claim (or a sub-claim in it) are ranked on top of the list [82] . The most effective approaches fine-tuned large-scale pre-trained transformers such as BERT and RoBERTa. In particular, the top-ranked run fine-tuned RoBERTa [18] . Task 4 2020 . Given a check-worthy claim on a specific topic and a set of potentiallyrelevant Web pages, predict the veracity of the claim [43] . Two runs were submitted for the task [94] , using a scoring function that computes the degree of concordance and negation between a claim and all input text snippets for that claim.

Task 5 2020 . Given a political debate or a speech, segmented into sentences, together with information about who the speaker of each sentence is, prioritize the sentences for fact-checking [82] . For this task, only one out of eight runs outperformed a strong bi-LSTM baseline [59] .

Task 1 2019 . Given a political debate, an interview, or a speech, segmented into sentences, rank the sentences by the priority with which they should be factchecked [10] . The most successful approaches used neural networks for the classification of the individual instances. For example, Hansen et al. [40] learned domain-specific word embeddings and syntactic dependencies and used an LSTM with a classificatiuon layer onn top of it.

Task 2 2019 . Given a claim and a set of potentially relevant Web pages, identify which of the pages (and passages thereof ) are useful for assisting a human to fact-check that claim. There was also a second subtask, asking to determine the factuality of the claim [44] . The most effective approach for this task used textual entailment and external data [35] .

Task 1 2018 [9] was identical to Task 1 2019 . The best approaches used pseudospeeches as a concatenation of all interventions by a debater [104] , and represented the entries with embeddings, part-of-speech tags, and syntactic dependencies [39] . Task 2 2018 . Given a check-worthy claim in the form of a (transcribed) sentence, determine whether the claim is likely to be true, half-true, or false [17] . The best approach retrieved relevant information from the Web, and fed the claim with the most similar Web-retrieved text to a convolutional neural network [39] . The full verification pipeline. The 2021 lab covers three tasks from that pipeline: (i) check-worthiness estimation, (ii) verified claim retrieval, and (iii) fake news detection. The gray tasks were addressed in previous editions of the lab [16, 32] .

The lab is organized around three tasks, each of which in turn has several subtasks. Figure 1 shows the full CheckThat! verification pipeline, and the three tasks we target this year are highlighted.

The aim of Task 1 is to determine whether a piece of text is worth fact-checking. In order to do that, we either resort to the judgments of professional fact-checkers or we ask human annotators to answer several auxiliary questions [3, 4] , such as "does it contain a verifiable factual claim?", "is it harmful?" and "is it of general interest?", before deciding on the final check-worthiness label.

Subtask 1A: Check-worthiness of tweets. Given a tweet, produce a ranked list of tweets, ordered by their check-worthiness. This is a ranking task, focusing either on COVID-19 or politics. It was offered in Arabic, Bulgarian, English, Spanish, and Turkish. The participants were free to work on any language(s) of their choice, and they could also use multilingual approaches that make use of all datasets for training.

Subtask 1B: Check-worthiness of debates or speeches. Given a political debate/speech, return a ranked list of its sentences, ordered by their checkworthiness. This is a ranking task, and it was offered in English.

Given a check-worthy claim in the form of a tweet, and a set of previously fact-checked claims, rank these previously fact-checked claims in order of their usefulness to fact-check that new claim.

Subtask 2A: Detect previously fact-checked claims from tweets. Given a tweet, detect whether the claim it makes was previously fact-checked with respect to a collection of fact-checked claims. This is a ranking task, offered in Arabic and English, where the systems need to return a list of top-n candidates.

Subtask 2B: Detect previously fact-checked claims in political debates or speeches. Given a claim in a political debate or a speech, detect whether the claim has been previously fact-checked with respect to a collection of previously fact-checked claims. This is a ranking task, and it was offered in English.

Task 3 was offered for the first time, as a pilot task. In includes two subtasks.

Subtask 3A: Multi-class fake news detection of news articles. Given the text of a news article, determine whether the claims made in the article are true, partially true, false, or other. This is a classification task, offered in English.

Subtask 3B: Given the text of a news article, determine the topical domain of the article. This is a classification task to determine the topical domain of a news article [86] . It involves six categories (health, crime, climate, election, and education), and was offered in English.

Here, we briefly describe the datasets for each of the three tasks. For more details, refer to the task description paper for each individual task [80, 81, 88] .

Subtask 1A: Check-worthiness for tweets. We produced datasets in five languages with tweets covering COVID-19, politics, and other topics. We refer to these datasets as the CT-CWT-21 corpus, which stands for CheckThat! checkworthiness for tweets 2021. Table 1 shows statistics about the corpus.

For Arabic, the training set is sampled from the corpus used in the 2020 edition of the CheckThat! lab [43] ; we only kept tweets with full agreement between the annotators. The tweets mainly cover politics and COVID-19. The newly collected testing set covers two political events: Gulf reconciliation and US Capitol riots. They were labelled by two expert annotators, and the disagreements were resolved by discussion between the annotators. Main topics

For Bulgarian, we created a new dataset focusing on COVID-19. The tweets were annotated by three annotators, and disagreements were resolved by majority voting, and then by a consolidator.

For English, the dataset also focused on COVID-19. For training, we released the data used in the CheckThat! lab of 2020 [82] . For testing, we annotated new instances, where we had three annotators per example, and we resolved the disagreements by majority voting, and then by a consolidator.

For Spanish, we had a new dataset. The tweets were manually annotated by journalists from Newtral -a Spanish fact-checking organization-and came from the Twitter accounts of 300 Spanish politicians.

For Turkish, the training set came from the TrClaim-19 dataset [53] , whereas the testing set was labelled for this task by three annotators. We applied majority voting for aggregation. The training set covers important events in Turkey in 2019 (e.g., the earthquake in Istanbul, and the military operation in Syria), whereas the test set focuses on COVID-19.

The datasets for Arabic, Bulgarian, and English have annotations for some auxiliary questions. For example, annotators were asked question such as "Is the claim of interest to the public?" and "Would the claim cause harm?" Subtask 1B: Check-worthiness for debates/speeches. For training, we collected 57 debates/speeches from 2012-2018, and we selected sentences from the transcript that were checked by human fact-checkers. After a political debate/speech, PolitiFact journalists publish an article fact-checking some of the claims made in it. We collected all such sentences and considered them checkworthy, and the rest non check-worthy. However, as PolitiFact journalists only fact-check a few claims made in the claims, there is an abundance of false negative examples in the dataset. To address this issue at test time, we manually looked over the debates from the test set and we attempted to check whether each sentence contains a verified claim using BM25 suggestions. Table 2 shows some statistics about the data. Note the higher proportion of positive examples in the test set compared to the training and the development sets.

Further details about the CT-CWT-21 corpus for Task 1 can be found in [81] . Subtask 2B: Detecting previously fact-checked claims in political debates/speeches. We have 669 claims from political debates [79] , matched against 804 verified claims (some input claims match more than one verified claim) in a collection of 19,250 verified claims in PolitiFact. Table 3 shows statistics about the CT-VCR-21 corpus for Task 2, including both subtasks and languages. CT-VCR-21 stands for CheckThat! verified claim retrieval 2021. Input-VerClaim pairs represent input claims with their corresponding verified claims by a fact-checking source. The input for subtask 2A (2B) is a tweet (sentence from a political debate or a speech). More details about the corpus construction can be found in [80] .

The process of corpus creation for Task 3 extends the AMUSED framework [83] . Starting with articles written by fact-checking organizations, we scraped the links to the original articles they verified, together with the factuality judgments. This process was done in two steps. First, in an automatic filtering step, all links with posts from social media channels or to multimedia documents were filtered out. In a second step, the remaining links were subjected to a manual checking process. During this step, we additionally made sure that the scraped link actually pointed to the checked document and that the document still existed (thus, eliminating error pages, articles with other content, etc.). After successful verification for each article, we scraped its title and full text. Subtask 3A: Multi-class fake news categorization of news articles. This subtask was offered in English only. We collected a total of 900 news articles for training and 354 news articles for testing from 11 fact-checking websites such as PolitiFact. The label for the original fact-checking site was given as a rating. However, due to the heterogeneous labeling schemes of different fact-checking organizations (e.g., false: incorrect, inaccurate, misinformation; partially false: mostly false, half false), we merged labels with shared meaning according to [84] , resulting in the following four classes: false, partially false, true and other. We provided an ID, the title of the article, the text of the article, and our rating as data to the participants. No further metadata about the article was made available in the dataset. The ID is a unique identifier created for the dataset, the title is the title given in the target article, the text is the full-text content of the article, and our rating is the normalized rating provided in one of the above four label categories.

Subtask 3B: Topical domain identification of news articles. This subtask is also offered in English only. We annotated a subset of the articles from subtask 3A with their topic: 318 articles for training, and 137 articles for testing in six different classes as shown in Table 4 based on [85] . We refer to the corpus as CT-FAN-21, which stands for CheckThat! 2021 Fake News. We provided the ID, the title, the text, and our rating as the metadata for the dataset. Here, ID is the unique ID, title is the title of the fake news article, the text is the full-text content of the article, and domain is the domain, expressed in terms of one of the above six categories. 

For the ranking tasks, as in the two previous editions of the CheckThat! lab, we used Mean Average Precision (MAP) as the official evaluation measure. We further calculated and reported reciprocal rank, and P @k for k ∈ {1, 3, 5, 10, 20, 30}, as unofficial measures. For the classification tasks, we used accuracy and macro-F 1 score.

Below, we report the evaluation results for task 1 and its two subtasks for all five languages.

Fifteen teams took part in this task, with English and Arabic being the most popular languages. Four out of the fifteen teams submitted runs for all five languages -most of them having trained independent models for each language (yet, team UPV trained a single multilingual model). For all five languages, we had a monolingual baseline based on n-gram representations. Table 5 shows the performance of the official submissions on the test set, in addition to the ngram baseline. The official run was the last valid blind submission by each team. The table shows the runs ranked on the basis of the official MAP measure and includes all five languages.

Arabic Eight teams participated for Arabic, submitting a total of 17 runs (yet, recall that only the last submission counts). All participating teams fine-tuned existing pre-trained models, such as AraBERT, and multilingual BERT models.

We can see that the top two systems additionally worked on improved training datasets. Team Accenture used a label augmentation approach to increase the number of positive examples, while team bigIR augmented the training set with the Turkish training set (which they automatically translated to Arabic). Bulgarian Four teams took part for Bulgarian, submitting a total of 11 runs. The top-ranked team was bigIR. They did not submit a task description paper, and thus we cannot give much detail about their system. Team UPV is the second best system, and they used multilingual sentence transformer representation (SBERT) with knowledge distillation. They also introduced an auxiliary language identification task, aside from the downstream check-worthiness task.

English Ten teams took part in task 1A for English, with a total of 21 runs. The top-ranked team was NLP&IR@UNED, and they fine-tuned several pretrained transformers models. They reported BERTweet was best on the development set. The model was trained using RoBERTa on 850 million English tweets and 23 million COVID-19 related English tweets. The second best system (Team Fight for 4230) also used BERTweet with a dropout layer. It also included preprocessing and data augmentation.

Spanish Six teams took part for Spanish, with a total of 13 runs. The top team TOBB ETU explored different data augmentation strategies, including machine translation and weak supervision. However, they submitted a fine-tuned BETO model without any data augmentation. The first runner up GPLSI opted for using the BETO Spanish transformer together with a number of hand-crafted features, such as the presence of numbers or words in the LIWC lexicon.

Turkish Five teams participated for Turkish, submitting a total of 9 runs. All participants used BERT-based models. The top ranked team TOBB ETU fine-tuned BERTurk after removing user mentions and URLs. The runner up team SU-NLP applied a pre-processing step that includes removing hashtags, emojis, and replacing URLs and mentions with special tokens. Subsequently, they used an ensemble of BERTurk models fine-tuned with different seed values. The third-ranked team bigIR machine-translated the Turkish text to Arabic and then fine-tuned AraBERT on the translated text.

All languages. Table 6 summarizes the MAP performance of all the teams that submitted predictions for all languages in Task 1A. We can see that team BigIR performed best overall. Table 8 shows the official results for Task 2A in both Arabic and English. A total of four teams participated in this task, and they all managed to improve over the Elastic Search (ES) baseline.

Arabic One team, bigIR, submitted a run for this subtask. They used AraBERT to rerank a list of candidates retrieved by a BM25 model. Their approach consists of three main steps. First, constructing a balanced training dataset, where the positive examples correspond to the query relevances (qrels) provided by the organizers, while the negative examples were selected from the top retrieved candidates by BM25 such that they were not already labeled as positive. Second, they fine-tuned AraBERT to predict the relevance score for a given tweet-VerClaim pair. They added two neural network layers on top of AraBERT to perform the classification task. Finally, at inference time, they first used BM25 to retrieve the top-20 candidate verified claims. Then, they fed each tweet-VerClaim pair to the fine-tuned model to get a relevance score and to rerank the candidate claims accordingly. As Table 8 shows, team bigIR outperformed the Elastic Search baseline by a good margin achieving a MAP@5 of 0.908 versus 0.794 for the baseline. Team MRR MAP Precision @1 @3 @5 @10 @20 @1 @3 @5 @10 @20 English Three teams participated for English, submitting a total of ten runs. All of them managed to improve over the Elastic Search (ES) baseline by a large margin. Team Aschern had the top-ranked system, which used TF.IDF, finetuned pre-trained sentence-BERT, and the reranking LambdaMART model. The system is 13.4 (MAP@5) points absolute above the baseline. The second best system is the NLytics, which used RoBERTa to train their model and this system was 5 (MAP@5) point above the baseline.

Political Debates and Speeches Table 9 shows the official results for Task 2B, which was offered in English only.

We can see that only three teams participated in this subtask, submitting a total of five runs, and no team managed to beat the Elastic Search (ES) baseline, which was based on BM25. Among the three participating teams, Team DIPS was the top-ranked one. They used sentence BERT (S-BERT) embeddings for all claims, and computed the cosine similarity for each pair of an input claim and a verified claim from the dataset of previously fact-checked claims. They made a prediction was made by passing a sorted list of cosine similarities to a neural network. Team BeaSku was the second-best team, which used a triplet loss training method to perform fine-tuning of the S-BERT model. Then, they used the scores predicted by the fine-tuned model along with BM25 scores as features to train a reranker based on rankSVM. In addition, they discussed the impact of applying online mining of triplets. They also performed some experiments aiming at augmenting the training dataset with additional examples. 

MRR MAP Precision @1 @3 @5 @10 @20 @1 @3 @5 @10 @20 In this section, we present an overview of all task submissions for tasks 3A and 3B. Overall, there were 88 submissions by 27 teams for Task 3A and 49 submissions by 20 teams for task 3B. For task 3, unlike the other tasks, each participant could submit up to 5 runs. After evaluation, we found that two teams from task 3A and seven teams from task 3B submitted the wrong files, and thus we have not considered them for evaluation; we report the ranking for 25 teams for task 3A and 13 teams for task 3B. In Tables 10 and 11 , we report the best submission of each team for task 3A and 3B, respectively. In the following sections, we report the results for each of the subtasks.

Most teams used deep learning models and in particular the transformer architecture for this pilot task. There have been no attempts to model knowledge with semantic technology, e.g., argument processing [30] .

The best submission (team NoFake) was ahead of the rest by a rather large margin and achieved a macro-F1 score of 0.838. They applied BERT and made extensive use of external resources and in particular downloaded collections of misinformation datasets from fact-checking sites. The second best submission (team Saud) achieved a macro-F1 score of 0.503 and used lexical features, traditional weighting methods as features, and standard machine learning algorithms. This shows, that traditional approaches can still outperform deep learning models for this task. Many teams used BERT and its newer variants. Such systems are ranked after the second position. The most popular model was RoBERTa, which was used by seven teams. Team MUCIC used a majority voting ensemble with three BERT variants [12] . The participating teams that used BERT had to find solutions for handling the length of the input: BERT and its variants have limitations on the length of their input, but the length of texts in the CT-FAN-21 dataset, which consists of newspaper articles, is much longer. In most cases, heuristics were used for the selection of part of the text. Overall, most submissions achieved a macro-F1 score below 0.5. The second most popular neural network model was the recurrent neural network, which was used by six teams. Many participants experimented also with traditional text processing methods as they were commonly used for knowledge representation in information retrieval. For example, team Kovachevich used a Naïve Bayes classifier with TF.IDF features for the 500 most frequent stems in the dataset [55] . Some lower-ranked teams used additional techniques and resources. These include LIWC [49] , data augmentation by inserting artificially created similar documents [8] , semantic analysis with the Stanford Empath Tool [26] , and the reputation of the sites of a search engine result after searching with the title of the article [49] . 

The performance of the systems for task 3B was overall higher than for task 3A.

The first three submissions were close together and all used transformer-based architectures. The best submissionm, by team NITK NLP, used an ensemble of three transformers [57] . The second best submission (by team NoFake) and the third best submission (by team Nkovachevich) used BERT.

There has been work on checking the factuality/credibility of a claim, of a news article, or of an information source [11, 13, 51, 58, 64, 69, 73, 103] . Claims can come from different sources, but special attention has been paid to those from social media [37, 62, 66, 78, 79, 89, 101] . Check-worthiness estimation is still a fairly-new problem especially in the context of social media [34, 45, 46, 47] . A lot of research was performed on fake news detection for news articles, which is mostly approached as a binary classification problem [71] . CheckThat! is related to several other initiatives at SemEval on determining rumour veracity and support for rumours [28, 36] , on stance detection [63] , on fact-checking in community question answering forums [61] , on propaganda detection [27, 29] , and on semantic textual similarity [2, 67] . It is also related to the FEVER task [93] on fact extraction and verification, as well as to the Fake News Challenge [38] , and the FakeNews task at MediaEval [72] .

We have presented the 2021 edition of the CheckThat! Lab, which was the most popular CLEF-2021 lab in terms of team registrations (132 teams registered), and about one-third of them actually participated: 15, 5, and 25 teams submitted official runs for tasks 1, 2, and 3, respectively. The lab featured tasks that span important steps of the verification pipeline: from spotting check-worthy claims to checking whether they have been fact-checked elsewhere before. We further featured a fake news detection task, and we also checked the class and the topical domain of news articles. Together, these tasks support the technology pipeline to assist human fact-checkers. Moreover, in-line with the general mission of CLEF, we promoted multi-linguality by offering our tasks in five different languages.

In future work, we plan to extend the datasets with more examples, more information sources, and also to cover more languages.

Team SCUoL [6] (1A:ar:3) used typical pre-processing steps, including cleaning the text, segmentation, and tokenization. Their experiments consists of finetuning different AraBERT models, and their final results were obtained using AraBERTv2-base. Team SU-NLP [22] (1A:tr:2) also used several pre-possessing steps, including (i ) removing emojis, hashtags, and (ii ) replacing all mentions with a special token (@USER), and all URLs with the respective website's domain. If the URL is for a tweet, they replaced the URL with TWITTER and the respective user account name. They reported that this URL expansion method improved the performance. Subsequently, they used an ensemble of BERTurk models finetuned using different seed values. Team TOBB ETU [100] (1A:ar:6 1A:bg:5 1A:en:10 1A:es:1 1A:tr:1) investigated different approaches to fine-tune transformer models including data augmentation using machine translation, weak supervision, and cross-lingual training. For their submission, they removed URLs and user mentions from the tweets, and fine-tuned a separate BERT-based models for each language. In particular, they fine-tuned BERTurk 1 , AraBERT, BETO 2 , and the BERT-base model for Turkish, Arabic, Spanish, and English, respectively. For Bulgarian, they fine-tune a RoBERTa model pre-trained with Bulgarian documents. 3 Team UPV [14] (1A:ar:8 1A:bg:2 1A:en:3 1A:es:6 1A:tr:4) used a multilingual sentence transformer representation (S-BERT) with knowledge distillation, originally intended for question answering. They further introduced an auxiliary language identification task, aside the downstream check-worthiness task.

Team Aschern [25] (2A:en:1) used TF.IDF, fine-tuned pre-trained S-BERT, and the reranking LambdaMART model. Team BeaSku [90] (2B:en:3) used triplet loss training to fine-tune S-BERT. Then, they used the scores predicted by the fine-tuned model along with BM25 scores as features to train a rankSVM re-ranker. They further discussed the impact of applying online mining of triplets. They also experimented with data augmentation. Team DIPS [60] (2A:en:3 2B:en:2) calculated S-BERT embeddings for all claims, then computed a cosine similarity for each pair of an input claim and a verified claim. The prediction is made by passing a sorted list of cosine similarities to a neural network. Team NLytics (2A:en:2 2B:en:4) approached the problem as a regression task, and used RoBERTa with a regression function in the final layer.

Team Black Ops [91] (3A:11) performed data pre-processing by removing stopwords and punctuation marks. Then, they experimented with decision trees, random forest, and gradient boosting classifiers for Task 3A, and found the latter to perform best. Team CIC [8] (3A:10 3B:5) experimented with logistic regression, multi-layer perceptron, support vector machines, and random forest. Their experiments consisted of using stratified 5-fold cross-validation on the training data. Their best results were obtained using logistic regression for task 3A, and a multi-layer perceptron for task 3B. Team CIC 3A:11 experimented with a decision tree, a random forest, and a gradient boosting algorithms. They found the latter to perform best. Team CIVIC-UPM [48] (3A:7 3B:8) participated in the two subtasks of task 3. They performed pre-processing, using a number of tools: (i ) ftfy to repair Unicode and emoji errors, (ii ) ekphrasis to perform lower-casing, normalizing percentages, time, dates, emails, phones, and numbers, (iii ) contractions for abbreviation expansion, and (iv ) NLTK for word tokenization, stop-words removal, punctuation removal and word lemmatization. Then, they combined doc2vec with transformer representations (Electra base, T5 small and T5 base, Longformer base, RoBERTa base and DistilRoBERTa base). They further used additional data from Kaggle's Ag News task, Kaggle's KDD2020, and Clickbait news detection competitions. Finally, they experimented with a number of classifiers such as Naïve Bayes, Random Forest, Logistic Regression with L1 and L2 regularization, Elastic Net, and SVMs. The best system for subtask 3A used DistilRoBERTa-base on the text body with oversampling and a sliding window for dealing with long texts. Their best system for task 3B used RoBERTa-base on the title+body text with oversampling but no sliding window. Team DLRG (3A:3 3B:4) experimented with a number of traditional approaches like Random Forest, Naïve Bayes and Logistic Regression as well as an online passive-aggressive classifier and different ensembles thereof. The best result was achieved by an ensemble of Naïve Bayes, Logistic Regression, and the Passive Aggressive classifier for task 3A. For task 3B, the Online Passive-Aggressive classifier outperformed all other approaches, including the considered ensembles. Team GPLSI [77] (3A:16) applied the RoBERTa transformer together with different manually-engineered features, such as the occurrence of dates and numbers or words from LIWC. Both the title and the body were concatenated as a single sequence of words. Rather than going for a single multi-class setting, they used two binary models considering the most frequent classes: false vs. other, and true vs. other, followed by one three-class model. Team MUCIC [12] (3A:19 3B:12) used a majority voting ensemble with three BERT variants. They applied BERT, Distilbert, and RoBERTa, and fine-tuned the pre-trained models. Team NITK NLP [57] (3A:5 3B:1) proposed an approach, that included preprocessing and tokenization of the news article, and then experimented with multiple transformer models. The final prediction was made by an ensemble.

Team NKovachevich [55] (3A:13 3B:3) created lexical features. They extracted the 500 most frequent word stems in the dataset, and calculated the TF.IDF values, which they used in a multinomial Naïve Bayes classifier. A much better performance was achieved with an LSTM model that used GloVe embeddings. A little lower F1 value was achieved using BERT. They further found RoBERTa to perform worse than BERT. Team NLP&IR@UNED [49] (3A:4) experimented with four transformer architectures and input sizes of 150 and 200 words. In the preliminary tests, the best performance was achieved by ALBERT with 200 words. They also experimented with combining TF.IDF values from the text, all the features provided by the LIWC tool, and the TF.IDF values from the first 20 domain names returned by a query to a search engine. Unlike what was obtained in the dev dataset, in the official competition, the best results were obtained with the approach based on TF.IDF, LIWC, and domain names. Team NLytics (3A:12 3B:7) fined-tuned RoBERTa on the dataset for each of the sub-tasks. Since the data is unbalanced, they used under-sampling. They also truncated the documents to 512 words to fit into the RoBERTa input size. Team NoFake [56] (3A:1 3B:2) applied BERT without fine-tuning, but used an extensive amount of additional data for training, downloaded from various fact-checking websites. Team Pathfinder [95] (3A:9 3A:10) participated in both tasks and used multinomial Naïve Bayes and random forest. The former performed better for both tasks. For task 3A, the they merged the classed false and partially false into one class, which boosted the model performance by 41% (a non-official score mentioned in the paper). Team Probity (3A:20) addressed the multiclass fake news detection subtask, they used a simple LSTM architecture where they adopted word2vec embeddings to represent the news articles. Team Qword [96] (3A:23) applied pre-processing techniques, which included stop-word removal, punctuation removal and lemmatization using a Porter stemmer. The TF.IDF values were calculated for the words. For these features, four classification algorithms were applied. The best result was given by Extreme Gradient Boosting. Team SAUD (3A:2) used an SVM with TF.IDF. They tried Logistic Regression, Multinomial Naïve Bayes, and Random Forest, and found SVM to work best. Team Sigmoid [76] (3A:17) experimented with different traditional machine learning approaches, with multinomial Naïve Bayes performing best, and one deep learning approach, namely an LSTM with the Adam optimizer. The latter outperformed the more traditional approaches. Team Spider (3A:22) applies an LSTM, after a pre-processing consisting of stop-word removal and stemming. Team UAICS [26] (3A:6) experimented with various models including BERT, LSTM, Bi-LSTM, and feature-based models. Their submitted model is a Gradient Boosting with a weighted combination of three feature groups: bi-grams, POS tags, and lexical categories of words.

Team University of Regensburg [41] (3A:8) used different fine-tuned variants of BERT with a linear layer on top and applied different approaches to address the maximum sequence length of BERT. Besides hierarchical transformer representations, they also experimented with different summarization techniques like extractive and abstractive summarization. They performed oversampling to address the class imbalance, as well as extractive (using DistilBERT) and abstractive summarization (using distil-BART-CNN-12-6), before performing classification using fine-tuned BERT with a hierarchical transformer representation.

QMUL-SDS at CheckThat! 2021: Enriching pretrained language models for the estimation of check-worthiness of Arabic tweets

SemEval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation

Fighting the COVID-19 infodemic in social media: A holistic perspective and a call to arms

Fighting the COVID-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society

AraFacts: The first large Arabic dataset of naturally occurring claims

An AraBERT model for check-worthiness of Arabic tweets

M82B at Check-That! 2021: Multiclass fake news detection using BiLSTM based RNN model

Fake news detection using machine learning and data augmentation -CLEF2021

Overview of the CLEF-2018 CheckThat! lab on automatic identification and verification of political claims. Task 1: Check-worthiness

Overview of the CLEF-2019 CheckThat! lab on automatic identification and verification of claims. Task 1: Check-worthiness

VERA: A platform for veracity estimation over web data

MUCIC at CheckThat! 2021: FaDofake news detection and domain identification using transformers ensembling

What was written vs. who read it: News media profiling using text analysis and social media context

UPV at CheckThat! 2021: Mitigating cultural differences for identifying multilingual check-worthy claims

Overview of CheckThat! 2020 -automatic identification and verification of claims in social media

Proceedings of the eleventh international conference of the CLEF association: Experimental IR meets multilinguality, multimodality, and interaction

Overview of the CLEF-2018 CheckThat! lab on automatic identification and verification of political claims. Task 2: Factuality

Buster.AI at Check-That! 2020: Insights and recommendations to improve fact-checking

CLEF 2020 Working Notes. CEUR Workshop Proceedings

Working Notes of CLEF 2019 Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings

Working Notes of CLEF 2018-Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings

SU-NLP at CheckThat! 2021: Check-worthiness of Turkish tweets

A content management perspective on fact-checking

Check square at CheckThat! 2020: Claim detection in social media via fusion of transformer and syntactic features

Aschern at CLEF CheckThat! 2021: Lambda-calculus of fact-checked claims

UAICS at CheckThat! 2021: Fake news detection

SemEval-2020 task 11: Detection of propaganda techniques in news articles

SemEval-2017 task 8: RumourEval: Determining rumour veracity and support for rumours

SemEval-2021 task 6: Detection of persuasion techniques in texts and images

A framework for argument retrieval -ranking argument clusters by frequency and specificity

CheckThat! at CLEF 2019: Automatic identification and verification of claims

Overview of the CLEF-2019 CheckThat!: Automatic identification and verification of claims

CLEF 2021 Working Notes. Working Notes of CLEF 2021-Conference and Labs of the Evaluation Forum. CEUR-WS.org

A contextaware approach for detecting worth-checking claims in political debates

UPV-UMA at CheckThat! lab: Verifying Arabic claims using cross lingual approach

SemEval-2019 task 7: RumourEval, determining rumour veracity and support for rumours

TweetCred: Real-time credibility assessment of content on Twitter

A retrospective analysis of the fake news challenge stancedetection task

The Copenhagen team participation in the check-worthiness task of the competition of automatic identification and verification of claims in political debates of the CLEF-2018 fact checking lab

Neural weakly supervised fact check-worthiness detection with contrastive sampling-based ranking loss

University of Regensburg at CheckThat! 2021: Exploring text summarization for fake newsdetection

bigIR at CheckThat! 2020: Multilingual BERT for ranking Arabic tweets by check-worthiness

Overview of CheckThat! 2020 Arabic: Automatic identification and verification of claims in social media

Overview of the CLEF-2019 CheckThat! lab on automatic identification and verification of claims. Task 2: Evidence and factuality

Detecting check-worthy factual claims in presidential debates

Comparing automated factual claim detection against judgments of journalism organizations

ClaimBuster: The first-ever end-to-end fact-checking system

CIVIC-UPM at CheckThat! 2021: Integration of transformers in misinformation detection and topic classification

NLP&IR@UNED at CheckThat! 2021: Check-worthiness estimation and fake news detection using transformer models

DLRG@CLEF2021: An ensemble approach for fake detection on news articles

Fully automated fact checking using external sources

TOBB ETU at CheckThat! 2020: Prioritizing English and Arabic claims based on check-worthiness

TrClaim-19: The first collection for Turkish check-worthy claim detection with annotator rationales

Tiplines to combat misinformation on encrypted platforms: A case study of the 2019 Indian election on WhatsApp

BERT fine-tuning approach to CLEF CheckThat! fake news detection

NoFake at CheckThat! 2021: Fake news detection using BERT

NITK NLP at CLEF CheckThat! 2021: Ensemble transformer model for fake news classification

Detecting rumors from microblogs with recurrent neural networks

NLP&IR@UNED at Check-That! 2020: A preliminary approach for check-worthiness and claim retrieval tasks using neural networks and graphs

DIPS at CheckThat! 2021: Verified claim retrieval

SemEval-2019 task 8: Fact checking in community question answering forums

CREDBANK: A large-scale social media corpus with associated credibility annotations

SemEval-2016 task 6: Detecting stance in tweets

Leveraging joint interactions for credibility analysis in news communities

Overview of the CLEF-2018 lab on automatic identification and verification of claims in political debates

Automated fact-checking for assisting human fact-checkers

SemEval-2016 Task 3: Community question answering

The CLEF-2021 checkthat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news

FANG: Leveraging social context for fake news detection using graph representation

Team Alex at Check-That! 2020: Identifying check-worthy tweets with transformer models

A survey on natural language processing for fake news detection

FakeNews: Corona virus and 5G conspiracy task at MediaEval 2020

Credibility assessment of textual claims on the web

NLytics at CheckThat! 2021: Check-worthiness estimation as a regression problem on transformers

NLytics at CheckThat! 2021: Multi-class fake news detection of news articles and domain identification with RoBERTa -a baseline model

Team Sigmoid at CheckThat! 2021: Multiclass fake news detection with machine learning

GPLSI team at CLEF CheckThat! 2021: Finetuning BETO and RoBERTa

The role of context in detecting previously fact-checked claims

That is a known lie: Detecting previously fact-checked claims

Overview of the CLEF-2021 Check-That! lab task 2 on detect previously fact-checked claims in tweets and political debates

Overview of the CLEF-2021 CheckThat! lab task 1 on check-worthiness estimation in tweets and political debates

Overview of CheckThat! 2020 English: Automatic identification and verification of claims in social media

AMUSED: An annotation framework of multi-modal social media data

An exploratory study of COVID-19 misinformation on Twitter

Exploring the spread of COVID-19 misinformation on Twitter

FakeCovid -a multilingual cross-domain fact check news dataset for COVID-19

CT-FAN-21 corpus: A dataset for Fake News Detection

Overview of the CLEF-2021 CheckThat! lab: Task 3 on fake news detection

Fake news detection on social media: A data mining perspective

BeaSku at CheckThat! 2021: Fine-tuning sentence BERT with triplet loss and limited data

Black Ops at CheckThat! 2021: User profiles analyze of intelligent detection on fake tweets notebook in shared task

ClaimsKG: A knowledge graph of fact-checked claims

FEVER: a large-scale dataset for fact extraction and VERification

EvolutionTeam at CheckThat! 2020: Integration of linguistic and sentimental features in a fake news detection approach

Classifier for fake news detection and topical domain of news articles

Qword at CheckThat! 2021: An extreme gradient boosting approach for multiclass fake news detection

It takes nine to smell a rat: Neural multi-task learning for check-worthiness prediction

Accenture at CheckThat! 2020: If you say so: Post-hoc fact-checking of claims using transformer-based models

Accenture at CheckThat! 2021: Interesting claim identification and ranking with contextually sensitive lexical training data augmentation

TOBB ETU at CheckThat! 2021: Data engineering for detecting check-worthy claims

Enquiring minds: Early detection of rumors in social media from enquiry posts

Fight for 4230 at CLEF CheckThat! 2021: Domain-specific preprocessing and pretrained model for ranking claims by checkworthiness

Analysing how people orient to and spread rumours in social media by looking at conversational threads

A hybrid recognition system for check-worthy claims using heuristics and supervised learning

The work of Tamer This research is also part of the Tanbih mega-project, developed at the Qatar Computing Research Institute, HBKU, which aims to limit the impact of "fake news", propaganda, and media bias, thus promoting digital literacy and critical thinking.

A Systems for Task 1 The positions in the task ranking appear after each team name. See Tables 5-7 for further details. Team Accenture [99] (1A:ar:1 1A:bg:4 1A:en:9 1A:es:5 1A:tr:5) used BERT and RoBERTa with data augmentation. They further generated additional synthetic training data using lexical substitution. To find the most probable substitutions, they used BERT-based contextual embedding to create synthetic examples for the positive class. They further added a mean-pooling layer and a dropout layer on top of the model before the final classification layer. Team Fight for 4230 [102] (1A:en:2 1B:en:1) focused its efforts mostly on two fronts: the creation of a pre-processing module able to properly normalize the tweets and the augmentation of the data by means of machine translation and WordNet-based substitutions. The pre-processing included link removal and punctuation cleaning, as well as quantities and contractions expansion. All hashtags related to COVID-19 were normalized into one and the hashtags were expanded. Their best approach was based on BERTweet with a dropout layer and the above-mentioned pre-processing. Team GPLSI [77] (1A:en:5 1A:es:2) applied the RoBERTa and the BETO transformers together with different manually engineered features, such as the occurrence of dates and numbers or words from LIWC. A thorough exploration of parameters was made using weighting and bias techniques. They also tried to split the four-way classification into two binary classifications and one three-way classification. They further tried oversampling and undersampling. Team iCompass (ar:4) used several prepossessing steps, including (i ) English word removal, (ii ) removing URLs and mentions, and (iii ) data normalization, removing tashkeel and the letter madda from texts, as well as duplicates, and replacing some characters to prevent mixing. They proposed a simple ensemble of two BERT-based models, which include AraBERT and Arabic-ALBERT. Team NLP&IR@UNED [49] (1A:en:1 1A:es:4) used several transformer models, such as BERT, ALBERT, RoBERTa, DistilBERT, and Funnel-Transformer, for the experiments to compare the performance. For English, they obtained better results using BERT trained with tweets. For Spanish, they used Electra. Team NLytics [74] (1A:en:8 1B:en:3) used RoBERTa with a regression function in the final layer, approaching the problem as a ranking task. Team QMUL-SDS [1] (1A:ar:4) used the AraBERT preprocessing function to (i ) replace URLs, email addressees, and user mentions with standard words, (ii ) removed line breaks, HTML markup, repeated characters, and unwanted characters, such as emotion icons, and (iii ) handled white spaces between words and digits (non-Arabic, or English), and/or a combination of both, and before and after two brackets, and also (iv ) removed unnecessary punctuation. They addressed the task as a ranking problem, and fine-tuned an Arabic transformer (AraBERTv0.2-base) on a combination of the data from this year and the data from the CheckThat! lab 2020 (the CT20-AR dataset).