key: cord-0233499-st5sx3xs
authors: Zeng, Xia; Abumansour, Amani S.; Zubiaga, Arkaitz
title: Automated Fact-Checking: A Survey
date: 2021-09-23
journal: nan
DOI: nan
sha: db4e215501e57ff09345348c0632acad0e996cbb
doc_id: 233499
cord_uid: st5sx3xs

As online false information continues to grow, automated fact-checking has gained an increasing amount of attention in recent years. Researchers in the field of Natural Language Processing (NLP) have contributed to the task by building fact-checking datasets, devising automated fact-checking pipelines and proposing NLP methods to further research in the development of different components. This paper reviews relevant research on automated fact-checking covering both the claim detection and claim validation components.

While online content continues to grow unprecedentedly, the spread of false information online increases the potential of misleading people and causing harm. This leads to an increasing demand on fact-checking, i.e. a task consisting in assessing the truthfulness of a claim [Vlachos and Riedel, 2014] , where a claim is defined as 'a factual statement that is under investigation' [Hanselowski, 2020] . A number of fact-checking organisations have been founded in recent years, e.g. FactCheck, PolitiFact, Full Fact, Snopes, Poynter and NewsGuard. Fact-checkers have conducted laborious manual fact-checking, i.e. familiarise with the topic, identify the claim, aggregate evidence, check source credibility, verify the claim and its reasoning chain and check for fallacies [Hanselowski, 2020] . However, the speed and efficiency of manual fact-checking cannot keep up with the pace at which online information is posted and circulated. The journalism community can benefit from tools that, at least partially, automate the fact-checking process [Cohen et al., 2011 , Hassan et al., 2017a , Konstantinovskiy et al., 2021 , particularly by automating more mechanical tasks, so that human effort can instead be dedicated to more labour-intensive tasks [Babakar and Moy, 2016] . Restricting claims to those that are objectively fact-checkable makes the automation task more realistically achievable while reducing the volume of content needing manual fact-checking. Furthermore, recent progress in the fields of natural language processing (NLP), information retrieval (IR) and big data mining has demonstrated the potential for efficiently processing large-scale textual information online, which inspires automated fact-checking.

Researchers have developed valuable fact-checking datasets, pipelines and models, an effort which has also been supported by shared tasks, including RumourEval [Derczynski et al., 2017b , Gorrell et al., 2018 , CLEF Check-That! , Barron-Cedeno et al., 2020 , ClaimBuster [Hassan et al., 2017c] , FEVER [Thorne et al., 2018a , Thorne et al., 2019 , SCIVER [Wadden et al., 2020] , Fake News Challenge [Pomerleau and Rao, 2017] , and HeroX fact checking challenge [Francis and Fact, 2016] .

With different major concerns, proposed pipelines take various forms. For instance, ClaimBuster [Hassan et al., 2017c] designed a comprehensive pipeline of four components to verify web documents: a claim monitor that performs document retrieval; a claim spotter that performs claim detection; a claim matcher that matches a detected claim to fact-checked claims; a claim checker that performs evidence extraction and claim validation. A similar pipeline was proposed by CLEF CheckThat! [Nakov et al., 2021] , which in its 2021 edition included three subtasks: first, perform claim detection to detect claims that are check-worthy; second, determine whether a claim has been previously fact-checked; and third, perform claim validation to determine the factuality of the detected claims. While some pipelines include claim detection, some are only designed to tackle claim validation, e.g. FEVER [Thorne et al., 2018a , Thorne et al., 2019 and SCIVER [Wadden et al., 2020] , 1 assuming check-worthy claims are already at hand. Figure  1 depicts a comprehensive fact-checking pipeline as discussed in this survey and consisting of two components: (1) a claim detection component, which looks for claims that need checking and tries to find matches between claims when they are related to the same fact-check, and (2) a claim validation component, which retrieves the documents and rationales that can serve as evidence to fact-check a claim and ultimately performs the verification task, producing a verdict.

In this paper, we present a survey on automated fact-checking with special focus on claim detection and claim validation, and is structured as follows. Section 2 focuses on the task of detecting check-worthy claims, which is the very first task of a comprehensive automated fact-checking pipeline. It also contains a brief overview of claim matching. Section 3 presents the task of claim validation, which typically involves addressing evidence retrieval and claim verification together. In section 4, we discuss advantages and drawbacks of current automated fact-checking pipelines, with a focus on current challenges and promising future directions. Section 5 presents closely related NLP tasks. Conclusions are drawn in Section 6. 1 The task of claim validation is referred to as fact-checking by some papers in the literature.

Claim detection plays a crucial role in automated fact-checking systems as all other components need to rely on the output of this stage. It aims to relief the burden of identifying claims for fact-checkers and help them by minimising the volume of online content they need to deal with.

The claim detection component is responsible for selecting claims that need to go through the rest of the fact-checking pipeline due to needing to be checked, i.e. needing verification. For instance, a factual statement such as "He voted against the first gulf war" can be deemed a claim that should be fact-checked. In contrast, a piece of opinion such as "I think it's time to talk about the future" is not a claim that should be fact-checked [Hassan et al., 2017a] .

Going further, one can also distinguish between check-worthy vs non-check-worthy claims [Nakov et al., 2021] , as cases that being claims are worthy or not fact-checking. For example, one could argue that "the government invested more than 10 billion last year in education" is a claim that is worthy of fact-checking, whereas a claim such as "my friend had a coffee this morning for breakfast" may not be worthy of fact-checking.

Researchers typically formulate the problem as one having a set of sentences as input (e.g. originating from a debate or conversation), and is tackled as a classification task, where a binary decision is made on whether each input sentence constitutes a claim or not, or a ranking task, where input sentences are ranked by check-worthiness, hence prioritising top claims on top positions of the list.

In recent studies, several datasets were built with the purpose of enabling training machine learning models to predict check-worthy claims, as shown in Table 1 . The vast majority of datasets cover sentences pertaining to the political domain, as a result of events that synchronously occur with the US elections. In contrast, the CheckThat! Lab released English and Arabic datasets that contain a small number of instances related to COVID-19 in early 2020. [Atanasova et al., 2019a] are the largest datasets, while CW-USPD-2016 [Gencheva et al., 2017] , CT-CWC-18 , CT20-AR [Hasanain et al., 2020] , and FactRank [Berendt et al., 2020] are a degree of magnitude smaller, followed by other smaller datasets.

Datasets have binary-class for either single-label or multi-labels, depending on the annotation process. The annotation process comes in diverse types. Some datasets are automatically built by collecting claims from fact-checking web-sites, while other datasets rely on manual annotations given specific definitions of check-worthiness. Crowd-sourcing platform has also been demonstrated to be helpful [Hassan et al., 2015] .

Moreover, the majority of datasets are available in the English language as opposed to a smaller number of datasets in the Arabic language. Most of these Arabic datasets are generated from translations of originally English datasets, except for CT20-AR, which is originally Arabic content. In addition, there is one dataset in Dutch and another one in Turkish, while datasets in other languages are not yet available.

ClaimBuster is the first automated fact-checking system that consists of integrated components tackling the entire fact-checking pipeline, starting off from the claim detection component. Its claim detection component called "claim spotter" classifies input sentences into one of (1) a factual claim, (2) an unimportant factual claim, or (3) a non-factual claim. This in turn assists fact checkers by prioritising the most check-worthy claims by ranking them based on accuracy measures such as Precision at k (P@K) [Hassan et al., 2017b] . To develop this, a multi-class Support Vector Machine (SVM) classifier was built which used features such as bags-of-words, Part-Of-Speech (POS) tags, and Entity Types (ET). The model achieved competitive performance and was considered as the baseline to beat in subsequent works [Hassan et al., 2017a] .

Another model called "CNC" (i.e. "Claim/not Claim") [Konstantinovskiy et al., 2021] builds on top of InferSent embeddings [Conneau et al., 2017] , combining them with part-of-speech tags and named entities found in texts, which are fed to a Logistic Regression classifier. Authors of CNC had as their main goal the improvement of the recall score achieved by their system, arguing that fact-checkers don't want to miss out any claims (no false negatives) while they can deal with some false positives. While improving in terms of recall, CNC also achieved superior performance in F1 score.

Apart from traditional machine learning models, neural networks have also been studied for the claim detection task. For example, in the CheckThat! Lab 2019 shared task, LSTM neural networks and Feed forward neural networks were the most effective models used by the top two participants. Along with the use of neural networks, top participants also showed the usefulness of context (i.e. surrounding sentences) in improving claim detection performance [Elsayed et al., 2019] . The use of context was studied in more detail in another work conducted outside the shared task, in this case by [Atanasova et al., 2019b] . They studied the inclusion of context and discourse features along with sentence-level features. They used a Feed-Forward Neural Network (FNN) as the model, which was then evaluated as a ranking task, proving the effectiveness of context and discourse features.

While all aforementioned works focused on English claims, there have also been efforts in other languages. For example, the ClaimRank model [Jaradat et al., 2018] was tested on Arabic claims (translated from claims originally in English). The Arabic claim detection model used Farasa [Abdelali et al., 2016] for tokenization, part-of-speech (POS) tagging, as well as MUSE embeddings. The first experiments on original Arabic data (rather than translated) were conducted in the CheckThat! 2020 shared task. Most participants proposed approaches fine-tuning pre-trained language models. For instance, the top-performing participant fine tuned AraBERT v0.1 with neural networks [Williams et al., 2020] . Likewise, [Hasanain and Elsayed, 2020] fine-tuned multilingual BERT (mBERT) with different classification models. Another recent effort, called FactRank [Berendt et al., 2020] , focused on claim checkworthiness detection for the Dutch language, in this case using a convolutional neural network (CNN) along with Platt scaling for an SVM model and a softmax to obtain the degree of check-worthiness.

Given that methods for claim detection have been applied in different settings or on different datasets, it is difficult to establish what the state-of-the-art model is today. However, the best way of determining the best-performing model is to look at the leaderboard in the CheckThat! shared task of the most recent edition.

Another task that has recently emerged is claim matching, also referred to as identifying previously fact-checked claims. For a claim spotted in the claim detection component, claim matching consists in determining whether this is a claim that exists in the database and can be resolved by a previous fact-check. The task is formulated as follows: given a check-worthy claim as input, and having a database of previously fact-checked claims, it consists in determining if any of the claims in the database is related to the input; in this case, the new claim would not need fact-checking again, as it was fact-checked in the past. It is normally framed as a ranking task, where claims in the database are ranked based on their similarity to the input claim [Shaar et al., 2020a] . This task comes right after the claim detectino component, to determine if the claim is new, and can help avoid the need for running the claim validation component for a particular claim when it is found in the database.

There are two released datasets: one based on PolitiFact and the other based on Snopes. Initial explorations were conducted on using BM25 [Robertson et al., 1994] and BERT-based models respectively as well as building a SVM reranker with features from both approaches [Shaar et al., 2020a] . Otherwise, CLEF2020-CheckThat! held a shared task on Verified Claim Retrieval which uses the Snopes dataset. While the baseline system is a simple BM25 system, shared task participants explored various scoring functions, including unsupervised approaches such as Terrier and Elastic Search scores, classic supervised models such as SVM and various BERT-based models [Shaar et al., 2020b] . Buster.ai, the winning team, fine-tune a RoBERTa [Liu et al., 2019b] model on the task which was first fine-tuned on other fact-checking datasets [Bouziane et al., 2020] . Team UNIPI-NLE, achieving close performance to the winning team, performed two cascade fine-tunings on a sentenceBERT [Reimers and Gurevych, 2019] model [Passaro et al., 2020] .

As a component of the automated fact-checking pipeline, claim validation is formulated as 'the assignment of a truth value to a claim made in a particular context' [Vlachos and Riedel, 2014] .

In order to fulfill the task of claim validation, two different major approaches to verification have emerged: 1) the claim is verified against textual references such as documents from Wikipedia [Thorne et al., 2018a , Thorne et al., 2019 ; 2) the claim is verified against existing knowledge bases [Shi and Weninger, 2016, Syed et al., 2019] . Both approaches assume their references are reliable. The first approach may limit evidence to only trusted resource such as Wikipedia, fact-checking websites, peer-reviewed academic papers, and government documents, achieving substantial coverage of information. However, the second approach faces bigger challenges in terms of coverage of reliable information. Existing knowledge bases tend to be too small to cover sufficient information for claim validation purposes [Mendes et al., 2012 , Azmy et al., 2018 , Pellissier Tanon et al., 2020 . Attempts have been made to automatically populate knowledge bases [Nakashole and Weikum, 2012 , Adel, 2018 , Balog, 2018 , Mesquita et al., 2019 but this method has the risk of further introducing unreliable noise and makes it harder to maintain the knowledge bases. Due to its maturity and reliability, our survey focuses on the first approach.

There have been a number of shared tasks focused on claim validation in slightly different ways. One of the major differences is whether the final verification step is reliant on previously identified pieces of evidence (such as Wikipedia documents or scientific articles) or it is instead reliant on the stances expressed by users (for example by aggregating supporting or opposing stances towards a story in social media). Of those relying on evidence, well-known shared tasks include FEVER [Thorne et al., 2018a] and SCIVER [Wadden et al., 2020] , both of which perform different forms of evidence retrieval first and then perform claim validation based on that evidence. On the other hand, both UKP Snopes [Hanselowski et al., 2019] and RumourEval [Derczynski et al., 2017a , Gorrell et al., 2018 proposed to tackle the task by retrieving texts relevant to a story, determining the stance of those texts afterwards, to ultimately classify the veracity value of the story.

The NLP community has developed valuable datasets to progress research in automated claim validation, though with common issues of being synthetic and imbalanced. As shown in Table 2 , recent datasets are not only growing in size, but they also attempt to capture naturally occurring sentences, include context and metadata, cover different domains, and offer evidence chains.

Evidence retrieval is conventionally addressed in two steps: document retrieval and rationale selection. Document retrieval is the task of retrieving relevant documents that supports the prediction of a claim's veracity. Rationale selection is the task of selecting directly relevant sentences out of the retrieved documents to get final supporting evidence for claim verification.

Document Retrieval Deeply influenced by information retrieval research, the majority of work in the literature addresses it as a ranking problem consisting in retrieving the top k documents. Various combination of Named Entities, Noun Phrases and Capitalised Expressions from the claim were used to query search APIs such as Google or Wikipedia and search servers [Thorne et al., 2018b] , when participating in the FEVER shared task. Metadata such as page viewership statistics is helpful to rank webpages [Nie et al., 2019] . However, when search engines are not available, such as PolitiFact [Vlachos and Riedel, 2014] 106 claims Politics Very small; metadata and evidence of various forms Emergent [Ferreira and Vlachos, 2016] 300 claims News Very small; 2595 associated documents LIAR 12,836 claims Politics Medium; metadata Snopes [Popat et al., 2017] 4,956 claims Snopes website Medium; 30 Google retrieved documents for each claim FEVER [Thorne et al., 2018a] 185,445 claims Wikipedia Big; associated Wikipeida evidence LIAR-PLUS [Alhindi et al., 2018] 12 Medium; metadata and 10 Google retrieved webpages for each claim Scifact [Wadden et al., 2020] 1,409 claims Scientific papers Small; associated documents PolitiHop [Ostrowski et al., 2020] 500 claims Politics Very small; evidence chains for multi-hop reasoning WikiFactCheck-English [Sathe et al., 2020] 124,821 claims Wikipedia Big; context and evidence Climate-FEVER [Diggelmann et al., 2021] 1,535 claims Climate Medium; 7,675 claim-evidence pairs with climate related claims verified against Wikipedia evidence COVID-Fact [Saakyan et al., 2021] 4,086 claims COVID-19 Medium; 1,296 supported claims from r/COVID19 subreddit and 2,790 automatically generated refuted claims Vitamin-C [Schuster et al., 2021] 488,904 pairs Wikipedia Big; contrastive evidence from Wikipedia edits FEVEROUS [Aly et al., 2021] 87,026 claims Wikipedia Biggest; evidence collected from both structured and unstructured information on whole Wikipedia in the SCIVER shared task, the majority of effort goes into exploring similarity metrics that are used as a proxy to determine the documents' relevance to a claim. TF-IDF similarity is a common baseline [Wadden et al., 2020 , Malon, 2018 and BM25 [Robertson et al., 1994] is demonstrated to be effective [Pradeep et al., 2020] . When dealing with a specific domain, in-domain word embeddings are also a promising option, e.g. BioSentVec [Chen et al., 2019a] for the SCIFACT dataset [Li et al., 2021] .

Instead of completely relying on unsupervised methods, improvements have been achieved by reranking based on supervised learning on top of a large number of retrieved documents [Pradeep et al., 2020] .

Rationale Selection Keyword matching, sentence similarity scoring and supervised ranking are common approaches to rationale selection [Thorne et al., 2018b] . Similar to document retrieval, attempts typically use one of these approaches or a combination of them to get a ranking score and select top k sentences as rationale with a manual choice of the k value [Pradeep et al., 2020] .

Most of studies in the literature conduct evidence retrieval by addressing document retrieval and rationale selection in a pipeline manner, which ignores valuable information across sentences.

Claim verification is commonly addressed as a text classification task by NLP researchers. Given a claim under investigation and its retrieved evidence, models need to reach a verdict of the claim, which may be 'SUPPORT', 'CON-TRADICTION' or 'NOT ENOUGH INFORMATION'. Some other datasets [Hanselowski et al., 2019 include other labels such as 'mostly-true', 'half-true', 'pants-fire', 'most false', 'most true' and 'other', whose finer granularity is more difficult to tackle through automated means and are sometimes collapsed into fewer labels. An important observation here is the difference in the types of labels used by different studies. Some studies rely on truth values (e.g. true, false, half-true), determining the veracity value of a claim. Others refer to the concept of support instead (i.e. support, contradict), which instead determine whether there is an agreement between the claim and the reference. The latter avoids making an explicit connection with truthfulness, looking instead at the alignment of a claim with respect to a given reference.

The task of claim verification may be essentially addressed as a Recognising Textual Entailment (RTE) task, i.e. 'deciding, given two text fragments, whether the meaning of one text is entailed (can be inferred) from another text' [Dagan et al., 2009] or a Natural Language Inference (NLI) task, i.e. 'characterising and using semantic concepts of entailment and contradiction in computational systems ' [Bowman et al., 2015] .

Given that claim verification is predominantly addressed as a RTE or NLI task, we present a brief overview of them below. The RTE task, which dates back to 2005, focuses on detecting whether the hypothesis h is entailed by a given text t or not, which corresponds to 'SUPPORT' or not. Proposed models may take a linguistic approach, a statistical approach, a machine learning approach or a hybrid version of them. The NLI task, equipped with many large-scale labelled datasets, has powered large neural models to be the dominant approach. State-of-the-art models are large pre-trained language models that are fine-tuned on large NLI datasets. -Ginet, 2000 ]. This definition, as well as many other formal linguistic theories, is theoretically sound but practically too rigid to handle uncertainty. In practical NLP context, we define entailment to include cases where the truth value of hypothesis h is highly plausible given text t, rather than absolutely certain [Dagan et al., 2009 ]. In other words, 'text t entails hypothesis h if, typically, a human reading t would infer that h is most likely true' [Bar-Haim et al., 2014] . In contrast to a formal theoretical definition, this somewhat informal definition of entailment allows and requires common sense background knowledge.

The task of RTE started as a two-way classification of deciding whether hypothesis h is entailed/supported by text t or not [Bar-Haim et al., 2006 , Dagan et al., 2005 , Giampiccolo et al., 2007 . After the notion of 'contradiction', i.e., 'the negation of the hypothesis h is entailed by the text t' is introduced [de Marneffe et al., 2008], the RTE task became a three-way classification task of predicting labels of a text pair out of 'ENTAILMENT', 'CONTRADICTION' and 'UNKNOWN' [Giampiccolo et al., 2008] .

Driven by a yearly RTE challenge from 2005 to 2011, the NLP community developed some useful datasets for the task, specifically the RTE1 -RTE7 datasets. Despite being relatively small and imbalanced, these datasets enabled developing and evaluating of various approaches. While lexical-based and syntax-based approaches struggled to achieve good results [Bar-Haim et al., 2014 , Dagan et al., 2009 , machine learning approaches achieved reasonable performance, often combined with logical or probabilistic methods.

One of the earlier models attempted to feed deep semantic features generated by first-order theorem prover and finite model builder into a machine learning classifie to make predictions [Bos and Markert, 2006] . Surprisingly, the deep semantic features failed to outperform shallow semantic features. This is likely due to the models' naïve and rigid representation of sentences, lack of background knowledge and flawed sample distribution of the dataset.

Another intuitive approach is to first induce representations of text snippets into a hierarchical knowledge representation and then use a sound inferential mechanism to prove semantic entailment [de Salvo Braz et al., 2006] . Despite its sound and tangible system design, this model only achieved an overall accuracy of 65.9%.

Furthermore, the NatLog system deals with the problem in three stages. It first conducts linguistic analysis, then aligns the dependency graph of the text t and the hypothesis h, finally uses a decision tree classifier to perform entailment inference based on antonyms, polarity, graph structure and semantic relations [Chambers et al., 2007 ]. This NatLog system trades low recall (31.71 on RTE3 test set) for higher precision (68.06 on RTE3 test set).

To help address the low recall achieved by first-order rules, the class of pair feature spaces was introduced [Zanzotto et al., 2009] . It allowed the model to enrich the sentence-pair with 'placeholders' and then generate firstorder rewrite rules to relax the rigidness. This model achieved around 68% overall accuracy on RTE3.

Moreover, COGEX developed a system that first transforms the text into three-layered semantically-rich logic form representations, then generates a set of linguistic and world knowledge axioms, and searches for a proof of entailment [Tatu and Moldovan, 2007] . This system achieved an overall accuracy of 72.25%.

Overall, many inspiring hybrid models of logical inference and machine learning methods were developed for RTE challenges. Though they did not achieve perfect performance, we believe they have great potential once equipped with better text representations and more powerful neural models.

Natural Language Inference (NLI) More recently, NLI is proposed as 'the problem of determining whether a natural language hypothesis h can reasonably be inferred from a given premise p' [Bowman et al., 2015, MacCartney and Manning, 2009] . Noticeably, the definition of NLI is very similar to RTE and researchers tend to mention them together when addressing the problem.

Despite that, NLI datasets have improvements over RTE datasets. Earlier RTE datasets, published before the notion of 'CONTRADICTION' attracted enough attention, only have binary labelling of 'ENTAILMENT'. In contrast, NLI datasets all include three-way labelling that includes 'ENTAILMENT', 'CONTRADICTION' and 'UNKNOWN'. Furthermore, recent NLI datasets have larger size, more balanced label distribution and cover various domains. Table  2 presents NLI datasets that are potentially useful for claim verification.

With their large size and balanced design, NLI datasets have powered large neural network models, which has become the dominant approach. The common practice is to fine-tune a large pre-trained language model on the target NLI dataset, which may or may not be coupled with small task-specific techniques. Compared with traditional approaches, this approach improved text representations, achieved better generalisability and allowed more complex computing without relying on hand-crafted rules.

Current state-of-the-art models on NLI datasets are BERT [Devlin et al., 2019] , RoBERTa [Liu et al., 2019b] , MT-DNN [Liu et al., 2019a] , ALBERT [Lan et al., 2020] and T5 [Raffel et al., 2020] .

In this section, we discuss current progress in each of the components of the automated fact-checking task, as well as highlight the main open challenges.

Conceptual Definition of Claim The definition of claim check-worthiness is brief [Allein and Moens, 2020] . Full Fact describe it as "an assertion about the world that can be checked". In contrast, [Konstantinovskiy et al., 2021] mentioned this definition is not enough to decide whether this claim is worthy for check or not. Similarly, [Berendt et al., 2020] declared that not every factual claim will be verified by fact checkers.

Narrow Domains Claims in the political domain are dominating the interest of journalists and researchers, as can be seen in existing datasets. As an example, [Wright and Augenstein, 2020] investigated the development of a claim check-worthiness detection method that would consistently perform over different domains, in this case rumours on Twitter, Wikipedia citations, and political speeches. However, the method showed important challenges in trying to perform well across domains. Recent research in claim detection has expanded to focus on health claims too, particularly owing to COVID-19.

Annotation Issues Labelling of sentences as claims or non-claims is generally done manually by non-experts (see Table 1 ). An alternative to this is to derive labels from previously fact-checked claims collected from fact-checking websites. The main caveat of this approach is that fact-checking websites only list claims, rather than non-claims, which means that one needs to develop models that only leveraged instances of the positive class, i.e. positive unlabelled learning [Wright and Augenstein, 2020] .

The majority of datasets are imbalanced where not check-worthy claims outnumber checkworthy claims. While this is possibly due to the nature of the task, existing models can have a tendency to overfit due to this imbalance, which calls for more research to tackle the problem. For example, in the CheckThat! Lab 2020, [Williams et al., 2020] attempted to mitigate the problem of overfitting by retraining the model by resampling the larger number of positive instances that were augmenting data through translation between Arabic and English [Williams et al., 2020] .

Despite the noticeable progress, current automated claim validation systems also face unique challenges and desire improvements over several key aspects: datasets quality, system integrity and model interpretability.

Datasets Quality State-of-the-art systems heavily rely on training large language models, which require large-scale, high-quality, labelled datasets that are expensive and may be unrealistic for specific domains. Despite the great contri-butions recent datasets have made, they tend to be imbalanced and somewhat synthetic, which are not ideal for model training. We believe future high-quality datasets will continue to help progress the field.

System Scalability and Integrity Proposed automated claim validation systems cover a range of relevant tasks. Though a few of them try to jointly handle rationale selection and claim verification [Hidey and Diab, 2018, Li et al., 2021] , most of them are pipeline systems that deal with subtasks separately. Improved scalability and integrity is desired.

Large pre-trained language models, the current dominant approach of various relevant tasks, requires lots of computing resources to train and inference. The scalability and accessibility of the proposed systems remain inferior.

Otherwise, increased system integrity is desired. Pipeline design has its inevitable disadvantage: downstream components can only make inferences on upstream results and errors accumulate throughout the pipeline. For instance, a claim verification component that takes the retrieved evidence and the detected claim as input will perform poorly with low-quality evidence or claims that are not checkable. Furthermore, the popular three-way label prediction approach is not the best approach for claim verification. Models struggle particularly to predict contradiction relation due to a lack of training data in this class, which accumulates errors across classes. For example, a model may predict a claim to be "NO INFO" while it should be "CONTRADICT", which makes it a false positive for the "NO INFO" class and a true negative for the "CONTRADICT" class. Preliminary research splitting the three-way classification into two binary classifications [Zeng and Zubiaga, 2021] is likely to help avoid such errors. Moreover, current approaches leave no space for aggregating evidence across sentences.

We believe a more compact overall system design is desired for automated claim validation such that it handles subtasks in a systematic way. We believe a promising direction is to train a model to learn all involved tasks in a multitask learning manner so that it may optimise for better overall performance.

Model Intepretability Neural networks are robust but struggle with interpretability and generalisability [Duan et al., 2020] , which is of particular importance for automated claim validation. Underwhelming model interpretability may induce an increased probability of models making the right prediction based on the wrong evidence. In contrast, symbolic systems that are unfortunately fragile and inflexible have strong interpretability and abstraction. Naturally, building a neural-symbolic system that integrates neural networks with symbolic logic becomes an interesting direction. In a nutshell, neural-symbolic systems = connectionist machine + logical abstractions [Besold et al., 2017] .

Researchers have proposed various architectures that incorporate first-order logic into neural networks. A recent study proposed a general framework capable of enhancing neural networks with declarative first-order logic [Hu et al., 2016] .

Another study explored a symbolic intermediate representation for neural surface realisation [Elder et al., 2019] that is similar to first-order logic. Moreover, a recent attempt adapted module networks to model natural logic operations, which is enhanced with a memory component to model contextual information [Feng et al., 2020] . Furthermore, Ru-leNN [Sen et al., 2020] is developed to tackle sentence classification where models are in the form of first-order logic, and achieved performance that is comparable to some neural models.

Neural-symbolic methods have a fascinating potential of attaining interpretability from symbolic models and robustness from neural models makes. We believe that designing and implementing neural-symbolic methods for various tasks of automated fact-checking is promising and of particular interest to our society.

There are some other popular tasks in natural language processing which are also related to the accuracy, verifiability and credibility of information, which we briefly discuss next as topics recommended for further reading:

Fake News Detection It is the task of determining whether a news article on the web is accurate or not [Shu et al., 2017] . Proposed classification approaches are typically centred on shallow features of the articles: n-grams, characters, stop words, part-of-speech tags, readability scores, term frequency, etc. Some more advanced approaches use additional metadata. However, these approaches are more likely to merely capture patterns of different article styles, rather than to sensibly distinguish reliable and unreliable articles [Hanselowski, 2020] .

Rumour Detection It is the task of identifying unverified reports circulating on social media. Predictions are typically made on language subjectivity and metadata on social media . Despite the relevance of these features, the truth value of a claim does not directly depend on these features.

Being considerably different from automated fact-checking, clickbait detection does not require external evidence. Approaches with relatively shallow linguistic features [Chakraborty et al., 2016 , Chen et al., 2015 , Potthast et al., 2016 have yielded reasonable performance.

Commonsense Reasoning To perform commonsense reasoning [Storks et al., 2019a] , the model needs to be able to do reasoning beyond the explicit information given in sentence pairs, which is highly valued in automated factchecking [Thorne and Vlachos, 2018] . As a new frontier of artificial intelligence, novel studies have investigated learned knowledge in pre-trained language models, commonsense integration from external knowledge bases, symbolic knowledge incorporation, etc. However, these tasks are currently under investigation and the field calls for major breakthroughs. For more information, we refer to a recent survey [Storks et al., 2019b ] and a tutorial [Sap et al., 2020] .

In this paper, we present a survey on automated fact-checking with special focus on claim detection and claim validation. Substantial progress has been made by applying large pre-trained language models through designed pipelines, but numerous open challenges still need further research. Claim Detection faces challenges from conceptual definition, narrow domains, annotation issues and imbalanced datasets. In addition, improvements over datasets quality, system integrity and model intepretability are desired for claim validation. Dagan, I., Dolan, B., Ferro, L., and Giampiccolo, D. (2006) 

Farasa: A fast and furious segmenter for arabic

Deep learning methods for knowledge base population

Where is your evidence: Improving factchecking by justification modeling

Checkworthiness in automatic claim detection models: Definitions and analysis of datasets

Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims

Supervised learning of universal sentence representations from natural language inference data

Recognizing textual entailment: Rational, evaluation and approaches

The PASCAL Recognising Textual Entailment Challenge

Finding contradictions in text

An Inference Model for Semantic Entailment in Natural Language

Semeval-2017 task 8: Rumoureval: Determining rumour veracity and support for rumours

SemEval-2017 task 8: RumourEval: Determining rumour veracity and support for rumours

BERT: Pre-training of deep bidirectional transformers for language understanding

CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims

Machine reasoning: Technology, dilemma and future

Designing a symbolic intermediate representation for neural surface realization

Overview of the CLEF-2019 CheckThat! Lab: Automatic Identification and Verification of Claims

Exploring end-to-end differentiable natural logic modeling

Emergent: a novel data-set for stance classification

Fast & furious fact check challenge

A contextaware approach for detecting worth-checking claims in political debates

The Fourth PASCAL Recognizing Textual Entailment Challenge

The third PASCAL recognizing textual entailment challenge

A Machine-Learning-Based Pipeline Approach to Automated Fact-Checking

A richly annotated corpus for different tasks in automated fact-checking

bigIR at CheckThat! 2020: Multilingual BERT for Ranking Arabic Tweets by Check-worthiness

Overview of checkthat! 2020i arabic: Automatic identification and verification of claims in social media

Toward automated fact-checking: Detecting check-worthy factual claims by claimbuster

Detecting check-worthy factual claims in presidential debates

Claimbuster: The first-ever end-to-end fact-checking system

ClaimBuster: the first-ever end-to-end fact-checking system

Team SWEEPer: Joint sentence extraction and fact checking with pointer networks

Harnessing deep neural networks with logic rules

ClaimRank: Detecting check-worthy claims in Arabic and English

TrClaim-19: The first collection for Turkish checkworthy claim detection with annotator rationales

Toward automated factchecking: Developing an annotation schema and benchmark for consistent automated claim detection

ALBERT: A lite BERT for self-supervised learning of language representations

A Paragraph-level Multi-task Learning Model for Scientific Fact-Verification

Multi-task deep neural networks for natural language understanding

An extended model of natural logic

Team papelo: Transformer networks at FEVER

DBpedia: A multilingual cross-domain knowledge base

Real-time population of knowledge bases: Opportunities and challenges

The clef-2021 checkthat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news

Combining fact extraction and verification with neural semantic matching networks

UNIPI-NLE at CheckThat! 2020: Approaching Fact Checking from a Sentence Similarity Perspective Through the Lens of Transformers. page 15

YAGO 4: A Reason-able Knowledge Base

The fake news challenge: Exploring how artificial intelligence technologies could be leveraged to combat fake news

Where the Truth Lies: Explaining the Credibility of Emerging Claims on the Web and Social Media

Clickbait Detection

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Sentence-BERT: Sentence embeddings using Siamese BERT-networks

COVID-Fact: Fact Extraction and Verification of Real-World Claims on COVID-19 Pandemic

Automated fact-checking of claims from Wikipedia

Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence

Learning explainable linguistic expressions with neural inductive logic programming for sentence classification

That is a known lie: Detecting previously fact-checked claims

Overview of CheckThat! 2020 English: Automatic Identification and Verification of Claims in Social Media

Overview of checkthat! 2020 english: Automatic identification and verification of claims in social media

Discriminative Predicate Path Mining for Fact Checking in Knowledge Graphs. Knowledge-Based Systems

Fake news detection on social media: A data mining perspective

Commonsense reasoning for natural language understanding: A survey of benchmarks, resources, and approaches

Commonsense Reasoning for Natural Language Understanding: A Survey of Benchmarks, Resources, and Approaches

Unsupervised Discovery of Corroborative Paths for Fact Validation

COGEX at RTE 3

Automated fact checking: Task formulations, methods and future directions

FEVER: a large-scale dataset for fact extraction and VERification

The fact extraction and VERification (FEVER) shared task

Fact checking: Task definition and dataset construction

Fact or fiction: Verifying scientific claims

liar, liar pants on fire": A new benchmark dataset for fake news detection

Accenture at CheckThat! 2020: If you say so: Post-hoc fact-checking of claims using transformer-based models

Claim check-worthiness detection as positive unlabelled learning

A machine learning approach to textual entailment recognition

QMUL-SDS at SCIVER: Step-by-step binary classification for scientific claim verification

This work was supported by the Engineering and Physical Sciences Research Council (grant EP/V048597/1). Xia Zeng is funded by China Scholarship Council (CSC). Amani S. Abumansour holds a scholarship from Taif University, Saudi Arabia.