key: cord-0595927-m5u4bkd0
authors: Elyashar, Aviad; Reuben, Maor; Puzis, Rami
title: Fake News Data Collection and Classification: Iterative Query Selection for Opaque Search Engines with Pseudo Relevance Feedback
date: 2020-12-23
journal: nan
DOI: nan
sha: 77d3542c58f629a3fd7040dacc6a73dcd0a3c323
doc_id: 595927
cord_uid: m5u4bkd0

Retrieving information from an online search engine is the first and most important step in many data mining tasks. Most of the search engines currently available on the web, including all social media platforms, are black-boxes (a.k.a opaque) supporting short keyword queries. In these settings, retrieving all posts and comments discussing a particular news item automatically and at large scales is a challenging task. In this paper, we propose a method for generating short keyword queries given a prototype document. The proposed algorithm interacts with the opaque search engine to iteratively improve the query. It is evaluated on the Twitter TREC Microblog 2012 and TREC-COVID 2019 datasets showing superior performance compared to state of the art and is applied to automatically collect large scale dataset for training machine learning classifiers for fake news detection. The classifiers training on 70,000 labeled news items and more than 61 million associated tweets automatically collected using the proposed method obtained impressive performance of AUC and accuracy of 0.92, and 0.86, respectively.

Every day, millions of people search for information online [1] . Market researchers search for products related to their product or business [2] . Researchers reviewing the academic literature search for works related to their current article [3] . Posts and comments that discuss a news item are retrieved from online social media (OSM) for fake news detection [4] . There are many similar cases when additional information related to a specific document is required. We will refer to such a document as a prototype.

There are multiple methods for retrieving a set of documents that are similar to a given prototype from a corpus. Most of these methods represent documents as vectors and calculate the similarity between the prototype and other documents. The basic methods are based on TF-IDF (term frequency-inverse document frequency) and, BM25 which treats each document as a bag-of-words [5] . Advanced methods, such as Doc2Vec [6] , Skip-thoughts [7] , Sent2Vec [8] , and others, use neural networks to represent documents as low dimensional vectors [5] . Once vector representations of documents are available, retrieving the documents that are most similar to the prototype, i.e., closest to it in the embedding space, is straightforward.

Retrieval methods that are based on document similarity assume access to the corpus being searched. This assumption is valid for transparent search engines, where the repository and the algorithms are known to the user. However, all the popular search engines, including general purpose search like Google, or platform specific, such as Twitter search, are opaque providing very little information about their repositories and algorithms [9] . Other than Google's image arXiv:2012.12498v1 [cs.IR] 23 Dec 2020 search, current search engines do not provide document-based search services. Therefore, we usually resort to the short keyword queries that are a mainstay of anyone using today's search engines [10] .

Due to the ambiguity of short keyword queries, they often do not reflect the original intention of the query writer [11] . For example, the keyword "apple," may refer to the fruit or to the technology company. 1 In this paper, we focus on the problem of retrieving documents that are most similar to a given prototype document from an opaque search engine supporting short keyword queries, such as Twitter. It is possible to manually formulate a search query from the document's content [4] . But of course, manual query selection does not scale. We can use the prototype document's title to generate queries [14] , or search for its URL if the prototype is a web page [15] . However, these approaches miss many relevant results. We further discuss the pros and cons of existing query selection methods in Section 2.

Therefore, in this paper, we suggest a novel iterative approach that selects queries that maximize the number of retrieved relevant results, using limited interaction with an opaque search engine. This approach consists of two components: the iterative query selection (IQS) algorithm and mean relevance error (MRE) measure. The IQS is a hill climbing algorithm that iteratively optimizes short keyword queries given a prototype document. The MRE is used as pseudo relevance feedback, by ranking the results of incumbent queries generated by the IQS algorithm according to their relevance to the prototype document. The details of the proposed method are discussed in Section 3. In the absence of a prototype document, incumbent results are compared to a set of relevant results according to user relevance feedback. We evaluated the proposed methods on TREC 2012 Microblog benchmark and TREC-COVID 2019 datasets assuming relevance feedback (see Section 3.2) In addition, we evaluate the IQS and MRE in pseudo relevance feedback settings using manually labeled fake news Twitter dataset (see Section 5.2) and retrieve a large scale fake news dataset from Twitter to be used toward training fake news classifiers (see Section 5) .

The contributions of this paper are:

• We discuss the mean relevance error (MRE) (see Section 3.1), a measure that can be used for pseudo relevance feedback when a prototype document is provided. • We propose an automated mechanism for optimizing short keyword queries termed iterative query selection (IQS) (see Section 3.2). The IQS outperforms existing opaque relevance feedback search on Twitter TREC Microblog 2012 and on TREC-COVID 2019 datasets (see Section 4.3). • We present two Fake News datasets: 2 (1) 20 news items and 1,076 associated tweets all of which are manually curated (see Section 5.2); and (2) 70,018 news items and approximately 61 million associated tweets automatically collected using the proposed IQS method (see Section 5.3.1). • We demonstrate the quality of the automatically retrieved Fake News dataset through the fake news detection task using well-known supervised machine learning with AUC 3 of 0.92 and accuracy of 0.86 (see Section 5.3.2).

This paper describes a new approach for selecting the best queries engaging with opaque search engines. In the next sections, we elaborate on the methods that are associated with the query selection. Next, for the demonstration of fake news detection use cases, we provide the necessary background for this domain.

Query selection is the task of selecting the most suitable queries for extracting relevant documents from web search engines [16] . In most of the cases, selecting these queries requires reformulation or expansion of an initial query. Several methods suggest to analyze the underlying corpus of the given search engine and used this valuable information for expanding the queries. Dwaipayan et al. [17] selected the most similar terms to a given query for expansion based on word embedding trained on the corpus. Their idea was to choose terms that yielded the highest probability for being related to the current query. Kuzi et al. [18] like Dwaipayan et al. used word embedding trained on the corpus to select expansion terms and suggested centroid-and fusion-based terms scoring methods to select them. Xu et al. [19] selected candidate terms for expansion based on context features, such as TF-IDF and co-occurrence of the query terms. Afterward, they used the learned term-ranking models to rank the candidate terms. Pang and Du [20] utilized click-through data of old queries to choose terms for expansion and which terms from the initial query to reformulate. 1 Google and some other engines use the search context to resolve ambiguity and retrieve documents that are most relevant for a specific user or case [12] . Other engines, such as Twitter, do not use contextual information and retrieve all results exactly matching the specified keywords [13] . Personalization and context aware information retrieval is out of the scope of this paper. 2 Datasets are publicly available as a collection of Twitter IDs in the following link: https://bit.ly/2vd58u6 3 Area Under the Receiver Operating Characteristic Curve All these approaches require knowledge about the underlying corpus of the search engine and therefore not suitable for the case of opaque search engines. Also, their starting point requires an initial query and initial results that are utilized to expand or reformulate terms. Thus, it is not possible to use a prototype document in these approaches.

Since most of the search engines are opaque, some query selection methods expand or reformulate queries without knowing the documents that exist in the underlying corpus. For example, Li et al. [21] implemented a double-loop retrieval system titled ReQ-ReC that consists of a query selection and a classifier that uses relevance feedback to improve the result iteratively. The query selection mechanism suggests queries that return relevant results and the classifier rank obtained results according to their relevance. Another retrieval system that uses relevance feedback was presented by Makki et al. [22] . This system titled ATR-Vis is a user-driven visual approach for the retrieval content from Twitter. ATR-Vis proposes four strategies of active learning (ambiguous retrieval, near-duplicates, leveraging hashtags, leveraging replies) to decrease user involvement in the process. Zamani et al. [23] referred to the task of query expansion as a recommendation task and used matrix factorization for recommending expansion terms. Another approach for query reformulation was introduced by Al-Khateeb et al. [24] where the initial query can be reformulated using a genetic algorithm search. The synonyms of the query terms are candidates for the reformulation and the fitness function is based on the similarity between the query to the results. Nogueira and Cho [25] presented a neural network architecture that reformulates a query to optimize the number of relevant documents returned. Chy et al. [26] proposed a query expansion method that selects effective expansion terms using a learning-to-rank method. The drawback of this method is that it works in a supervised manner which requires to train a classifier that ranks terms for the specific search engine before the query selection. Albishre et al. [27] suggested a system that uses pseudo-relevance feedback and topic-based query expansion method.

ALMIK is another active retrieval method that tries to achieve both high-precision and high-recall in collecting event-related tweets [28] . The ALMIK is composed of a keyword expansion component and an event-related tweet identification component. In each iteration, the event-related tweet identification component asks the user to label the most uncertain instances. Then the keyword expansion component selects the K keywords with the highest score for keyword expansion. The keyword score is based on keyword relevance, coverage, and evolvement in the event.

These approaches, except ATR-Vis, require an initial query, as well as initial retrieved results for selecting a query. The drawback of the ATR-Vis is that it implemented specifically for the Twitter search engine and requires users to label the retrieved results (relevance feedback). Opposed to them, our proposed method uses a pseudo-relevance feedback process (no user-interaction required) and can be used on any search engine.

Over the past years, various solutions have been suggested for estimating the semantic similarity between documents based on lexical matching, handcrafted patterns, syntactic parse trees, external sources of structured semantic knowledge and distributional semantics.

In 2015, Kusner et al. [29] proposed the word mover's distance (WMD), which measures the dissimilarity between two documents as the minimal sum of distances that the word vectors of one document need to "move" towards the word vectors of another document. They evaluated their metric in the context of KNN classification on eight benchmark document categorization tasks. They concluded that the the use of WMD lead to the lowest error rates in comparison to other methods. We use the concept of WMD in the proposed mean relevance error (MRE) measure, however we expand this metric to deal with a collection of returned documents associated with a prototype document instead of a single document as described in [29] .

In the same year, Kenter and de Rijke [30] extracted multiple types of meta-features from the comparison of the word embedding for short text pairs. The features representing labelled short text pairs were used to train a supervised learning classifier. Later, they used the trained model for predicting the semantic similarity of new, unlabelled pairs of short texts.

The dual embedding space model (DESM) proposed by Nalisnick et al. [31] calculates the average cosine distance of each term in the query with every centroid representation on the documents using pre-trained word embeddings. The centroid representation of each document is the mean vector of the document words.

In 2016, Guo et al. [32] proposed a deep relevance matching model (DRMM) for determining the relevance of a document given a particular query. The proposed model employed a joint deep architecture at the query term level for relevance matching. By using matching histogram mapping, a feed forward matching network, and a term gating network, they created a machine learning model. Later, they evaluated their model on two benchmark collections and showed that their DRMM model can outperformed well-known baseline retrieval models.

In 2017, Mitra et al. [33] suggested a document ranking model composed of two separate deep neural networks, where the first network matches the query and the document using a local representation, and the second network matches the query and the document using learned distributed representations. Next, the two networks are jointly trained as part of a single neural network. Mitra et al. showed that this combination performed better than either neural network individually on a web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.

In many cases, queries for search engines can arguably be considered ambiguous to some extent. Therefore, in order to tackle query ambiguity, search result diversification approaches have recently been proposed to produce rankings for satisfying the multiple possible information needs underlying a query [34] . In most of the cases, the diversification of retrieved results implies a trade-off between having more relevant results which reflect the true intent of the user and having less redundancy [35] . There are two prominent diversification approaches: implicit and explicit. The implicit approaches implicitly assume that similar documents will cover similar interpretations or aspects associated with the query and should hence be dismissed, in order to achieve a diversified ranking. In particular, an implicit aspect representation relies on features which belong to each document in order to model different aspects, such as the terms contained by the retrieved documents [36] , the clicks they received [37] , their topic models [38] , or clusters [39] . With respect to explicit approaches, a broad topic can be associated with an ambiguous query can be usually decomposed into its constituent sub-topics. Therefore, we can explicitly search for different aspects of the query for producing a diverse ranking of results. In most of the cases, explicit approaches relying on features derived from the query itself as candidate aspects, such as different query categories [40] or query reformulations [41] .

In this paper, we increase the diversification of the returned documents using two query expansion methods: adding synonyms based on WordNet and Adding the k closest words in the embedding space for each candidate word.

Fake news is a long-lasting problem that has drawn significant attention in recent years. It has been widely spread within the online social media (OSM) [42] . Since the detection of fake news is very challenging, many researchers suggested different approaches to confronting this issue. Many of them were based on natural language processing [43] , investigating the diffusion of news [15] , etc. Also, a few papers have attempted to detect fake news solely using social context features [44] .

In order to train supervised classifiers for fake news detection, a ground truth dataset containing labeled news items is required. Such news items can be collected from fact checking websites, such as Snopes, 4 PolitiFact, 5 FactCheck, 6 and others [15, 45] .

There are two commonly-used methods for collecting relevant posts associated with a given claim: The first method is to retrieve posts based on the sources that distributed the claims. For example, Monti et al. [14] used the source's headlines that exist in fact-checking websites to collect tweets. Vosoughi et al. [15] investigated the diffusion of news, based on collected tweets that contained links to the given claims. However, collecting tweets based on sources may be incomplete since many posts are associated with the given claim, but do not contain a link to the claim's source. Moreover, URL shortening, quotation, and cross reference common in the press, as well as among bloggers lead to a situation where tweets mentioning the same news contain links to different sources. Therefore, collecting tweets only based on links will result in a subset of the tweets relevant to a claim. In addition, also the use of the source's headlines does not always reflect well the claim's content (e.g., in the case of clickbait) which can lead to irrelevant results. These drawbacks limit the ability to collect a quality dataset that contains enough relevant data for accurate classification.

The second method that people use to collect relevant posts is through the use of manual query selection. For example, Zhou et al. [4] demonstrated a real-time news certification system on Sina Weibo 7 using queries provided by the user to gather related posts. Then, they built an ensemble model that combined user-based, propagation-based, and content-based features and evaluated the proposed model on a small dataset of 146 claims. Jin et al. [46] and Wang et al. [47] developed neural network-based methods for fake news detection and to evaluate their proposed methods they both used two small datasets from Sina Weibo (40k tweets) and Twitter (15k tweets on 52 rumor-related events). Those datasets were created using manual query selection. Selecting queries manually to a large collection of claims requires a lot of human effort and limits the amount of collected data.

Due to the limitations of both methods described above, it is clear that fake news detection based on the OSM can benefit from a tool that can automatically select accurate short keyword queries for a given claim (i.e., a news item). In this study, we demonstrate the usefulness of the proposed iterative query selection (IQS) method to retrieve a large scale fake news dataset from Twitter automatically, as well as train fake news classifiers using social context features extracted from the tweets.

In this paper, we propose a novel iterative approach for optimizing short keyword queries given a prototype document through interaction with an opaque search engine. First, we describe the mean relevance error (MRE), a simple measure that estimates the relevance of the results retrieved from an opaque search engine with respect to a given prototype document. This measure is calculated by summing the shortest distances between words in the given prototype document and words in the retrieved results (see Section 3.1). The lower the MRE, the more relevant the retrieved results are. Second, we outline the iterative query selection (IQS) algorithm for finding queries that retrieve results with the lowest MRE score (see Section 3.2).

In this section, the mean relevance error (MRE) is designed with the purpose of retrieving microblog entries (e.g., tweets) relevant to a prototype document, such as a news item. This suggested measure estimates the minimal distance between word vector representations of the words existing in the retrieved result and the prototype document. The MRE is a private case of the word mover's distance (WMD) suggested by Kusner et al. [29] since it is the mean WMD score of an arbitrary number of results. The intuition is that documents that are close in their semantic space, probably discuss the same topic.

Let d denote the prototype document and r denote a short document retrieved using a search engine. The MRE works best when r is shorter than d. The prototype d may be more general than r, for example, covering multiple topics that are not mentioned in r. The MRE relies on semantic word embedding, also known as a low dimensional vector representation of words. Intuitively, the MRE quantifies the similarity between the words used in the prototype document and in a retrieved result.

There are many word embedding methods, such as GloVe [48] , Word2Vec [49] , fastText [50] , etc. We can use any word embedding method, where words with a similar meaning are embedded close to each other. Let V be the vocabulary of words embedded in a n dimensional space using either one of the word embedding approaches. Let dist(w i , w j ) denotes the cosine distance between the vector representations of two words (w i , w j ∈ R n ). The cosine distance is define as 1 − cosineSim(w i , w j ). Thus the cosine distance ranges between 0 to 2.

. . , w d l } be the set of word vectors in d and W r = {w r1 , w r2 , . . . , w r k } be the set of word vectors in r. The W d and W r do not contain stop words. The distance between a word w and a document d is the minimal distance between the word w i and all the words in W d (see Equation 1 ).

The distance between the word vectors of the word w i and all the words in W d reflects the the relevance of the word w i in the result document r to the prototype document. The smaller the distance the higher the relevance of the word w i to the prototype document d.

Given a result document r, let relevance error (RE) of r with respect to the prototype document d be the average distance of all words w ri ∈ W r to the document d (see Equation 2 ):

Where |W r | represents the number of words in W r except stop words. It is important to mention that the stop word removal does not impact the rationality of the proposed method. However, mutual stop words in the result and the prototype documents does not indicate that both documents are similar semantically.

Note that, although the RE is a binary function defined on pairs of documents, it is not a distance metric. The RE is not symmetric and RE(r, d) = 0 does not mean that r and d are equal in any sense. Rather RE is similar to a fuzzy version of set inclusion (⊆), where RE(r, d) = 0 =⇒ W r ⊆ W d . If r contains only words in d or their synonyms, RE(r, d) will be close to zero. Until this point, the RE measure is identical to the word mover's distance presented by Kusner et al. [29] . In the final step, we set the MRE measure to estimate the relevance of the results retrieved from an opaque search engine to a given prototype document. .

Finally, let R be a set of short documents retrieved from a search engine. We define the M RE as the mean RE of all results r ∈ R with respect to the prototype d (see Equation 3 ):

The MRE outputs a score between 0 to 2. The lower the MRE score, the more relevant the retrieved results R are to the prototype document d.

The MRE defined above is designed to measure only one aspect of query performance, the relevance of the results. Other important aspects, for example, the number of results, are intentionally not captured by the MRE. The quality of the MRE is affected by the quality of the underlying word embedding model. For general purpose query evaluation, it is recommended to use word embedding models trained on large non-domain specific datasets.

The proposed iterative query selection (IQS) method is based on a local search algorithm, which selects the queries that maximize the relevance of the corresponding results retrieved from an opaque search engine. We use the hill climbing algorithm [51] since querying the search engine is resource intensive and we need to find local optimum with only a few iterations [51] .

Let d be a prototype document and W d be the set of words in d as in the previous subsection. W d does not contain stop words. Best practice (and our preliminary experiments) suggests that W d should contain only nouns, adjectives, verbs, or numbers. In addition, named entities, e.g., "Michael Jordan," are considered as a single term if they are found in the vocabulary of the word embedding approach used as the basis for the MRE. We advise against stemming for query selection.

Let V d denote the vocabulary of terms from which possible queries q ∈ V d are selected. V d may be equal W d or expanded using any query expansion approach. We consider two query expansion methods: (1) Adding synonyms based on WordNet [52] for each word in W d , later referred to as Synonyms.

(2) Adding k closest words in the embedding space for each candidate word in W d , later referred to as KNN.

The IQS searches through the space of possible queries q ∈ V d . It starts with a random subset of words from V d . For efficiency, the query size is limited by two control variables minq, and maxq, which are the minimal and the maximal number of words in a query. In every iteration, we randomly modify the query using one of the following three actions: ADDWORD(q, V d ) randomly adds to q a word from V d that is not yet in q. REMOVEWORD(q, V d ) removes a random word from the query q decreasing its size. SWAPWORDS(q, V d ) exchanges a random word in q with a random word in V d that was not already in q. Possible actions are chosen to ensure the query size constraints.

After modifying the query q using one of the three actions, we evaluate the MRE of its results R q from the search engine se. Due to computational and network performance considerations, it is important to limit the number of results retrieved from se in each iteration of the algorithm. Usually, this limit further referred to as rlimit ≥ |R q |, is defined by the search engine interface and is set to the number of results on a single page. The larger the rlimit is, the more accurate the M RE(R q , d) will be in each iteration of the IQS. However, the rlimit is also the primary factor (linearly) affecting the time of an iteration.

The hill climbing IQS algorithm is implemented as described in Algorithm 1. It receives as an input a prototype document d, an opaque search engine se, the maximal and minimal number of words in a query (maxq and minq, respectively), the maximal number of iterations itr, and the number of result documents (rlimit) retrieved from se in each iteration. During the algorithm, we keep only queries that decrease the M RE score. If the query returns no results, we assume the M RE(R q , d) to be the maximal score of 2 and we do not add a word in the next iteration.

The IQS returns an ordered set of queries. Some search engines allow words from the query to be missing in the results while others retrieve only documents containing all keywords in the query. Twitter is an example of the latter, a boolean search engine. In the case of a boolean search engine, it is important to run multiple slightly modified queries in order to retrieve as many relevant results as possible. This is the main reason due to which the IQS returns a list of queries and not only a single best query. q best ← random subset of V d

R q best ← se(q best , rlimit) 6: calculate M RE(R q best , d) 7: q new ← q best 8:

loop itr times action ← random(actions) 14: q new ← action(q best , V d ) 15 : if M RE(R qnew , d) < M RE(R q best , d) then 18: queries.add(q new , M RE(R qnew , d)) 19 :

In this section, we conducted a series of experiments that evaluate our iterative query selection (IQS) method. Since IQS required a loss function that evaluates each generated query, we first measure the ability of the MRE measure to distinguish between relevant and irrelevant results (see Section 4.2). Then we will examine the MRE performance correlation to informative the prototype document. After evaluating the MRE, we will examine the performance of the IQS using the MRE as an active retrieval method for an opaque search engine (see Section 4.3). Finally, we evaluate the IQS's ability to find relevant queries for a prototype document without the user's help on an opaque search engine.

In the following experiment we used the Twitter TREC Microblog 2012 dataset and the TREC-COVID 2019 dataset. The Twitter TREC Microblog 2012 consists of 59 topics (used as initial queries) and 73K judgments (relevant and irrelevant tweets) for those topics [53] . The TREC-COVID 2019 8 consists of 35 topics and 20.7K judgments. In this dataset the topics also contain an initial query, a question, and a search narrative.

The MRE measure purpose in to rank documents relevance according to a prototype document. However, The Twitter TREC Microblog 2012 dataset and TREC-COVID 2019 dataset include topic definitions that cannot be used as prototype documents (initial query), due to their rather short length. Therefore, we iteratively construct such prototype documents for each topic using a relevance feedback process. In addition, in this experiment, we examine the correlation between how informative the prototype document to the relevance of the results. Note that in the case of TREC-COVID 2019 we can use the search narrative as a prototype claim which we will discuss in Section 4.4.

The constructed prototype document should reflect the topic being searched. We use the following general process to iteratively build a prototype document for each topic and improve the results retrieved using the RE. First, we use the initial topic definition as the prototype document. We calculate the RE between the topic definition and each tweet in the dataset. Although the topic definition is too short to qualify for a good prototype document, some relevant tweets can be found using the MRE. Second, we retrieve the top k results and request relevance feedback from a user (or from an oracle if ground truth is provided for evaluation purposes). Next, we expand the prototype document using the content of the relevant retrieved results and run the second step again.

It is important to note that the relevance feedback should be saved to avoid labeling the same result multiple times. The process stops after n labeled results for each topic in the dataset (or a user is satisfied with the results). In this experiment we set the top k results to 10 and n to be 300. For the TREC Microblog 2012 dataset we discard query 76, since it does not contain any relevant judgments and.

As baselines we use the following: Okapi BM25 [54] , latent semantic analysis (LSA) [55] , and TF-IDF. Also, we use the dual embedding space model (DESM) using the same pre-trained word embeddings used for the RE [31] . In this comparison we use only unsupervised document similarity measures since our goal is that the final IQS will run on any search engine without any training.

The mean average precision (MAP) and R-precision results are summarized in Table 1 . As can be seen, the MRE outperforms other methods evaluated on both datasets in terms of MAP and R-precision. These results emphasize the effectiveness of the MRE of being a good indicator for distinguishing between relevant and irrelevant documents. This finding strengths the pre-trained word vectors of being very useful to detect similar words in two documents. This calculation is automatic and doesn't require cold start like in case of learn to rank measures which requires the user to train the ranking measure on the search engine.

Another impotent aspect we want to examine is the effect of the prototype document informativeness on the MAP score relative to each relevance measure. This aspect is impotent because it examines whether the relevance measure estimates the results' relevance according to the prototype or not. The results are presented in Figure 1 . The trends in the results show how the MRE utilizes better the information found in the prototype document to rank the results. This finding indicates that the MRE is the best candidate as a loss function for our query selection method. 

In this experiment, we evaluate the full iterative query selection (IQS) pipeline based on a process that mimics interaction with Twitter's search engine. Twitter uses a boolean retrieval model meaning that the returned results must contain all the words in the query. We assume an opaque search engine (se) and access the corpus through the boolean search only. We used a boolean retrieval model such as in Twitter since IQS final purpose is to retrieve relevant data for a fake news classification task from the Twitter search engine.

Similarly to the previous experiment, we dynamically construct a prototype document for each topic using relevance feedback. First, we use the topic definition as the prototype document. Second, we run the first iteration of the IQS calculating the MRE score between the prototype document and the results retrieved. Third, before proceeding to the next iteration of the IQS, we sort the retrieved results in ascending order, according to their MRE score (a lower score means greater relevance to the topic). Fourth, we take the top k = 10 results and ask a user (or an oracle) to label them. Fifth, we expand the prototype document using the content of the relevant results identified by the user (or the oracle) and proceed to the next iteration of IQS. The stopping condition is the same as in the previous experiment, retrieving labeling n = 300 tweets. We set the minimal and maximal number of words in a query to be between 1 to 6 (minq = 1, maxq = 6). This parameter influences directly on the number of retrieved results. In case of a query containing a single word, many results are expected to be retrieved, however most of them are not relevant directly to the prototype document. For example, assume that the prototype document is "Crude oil production in the U.S." and the query contains the word of oil, solely. In this case, twitter search engine is expected to retrieve many tweets that include the word in their text, however they are not related directly to the oil industry in the U.S. (for example, tweets that focus on oil painting and oil production in Russia). In the same manner, high number of words decrease the number of retrieved results but most of these tweets are expected to be relevant to the given prototype document. Lastly, we set the number of returned result to 20 (rlimit = 20) to simulate similar number of retrieved documents from standard search engine within the Web.

In order to choose the best hyper-parameter for the IQS algorithm, we tested the following ranges: maximal number of iteration between 10 to 45, number of runs between 1 to 3, and the number of output queries between 5 to 50. For our final evaluation, we used the hyper-parameter configuration that yielded the best results on both datasets. The best parameters found for IQS are itr = 15, runs = 3, minSize = 1, maxSize = 6, numQueries = 40,.

We compared the performance of the IQS to the ReQ-ReC implementation on GitHub 9 with the same settings: the top 10 documents are labeled by the user (k = 10), the algorithm stops after 300 labels for each topic (n = 300) and a boolean search engine.

In addition, we run the ALMIK method, a state-of-the-art active retrieval method proposed by Zheng and Sun [28] , on the TREC Microblog 2012 and TREC-COVID 2019 datasets. We implemented the ALMIK method based on the method description presented in their paper. Again, we limit the amount of label request from the user to 300 and use the same search engine mechanism. We also conducted hyper-parameter tuning for ALMIK in order to achieve the best results. The best results were achieved when the ALMIK conduct 3 rounds of active learning phases. Between the phases, we conduct a keyword expansion phase and used the new results in the next round. In each active learning phase, we conduct 10 iterations of 10 label request of the most uncertain tweets from the user (A total of 100 in each active learning phase).

We reported the performance of the IQS, the ReQ-ReC, and the ALMIK all topics in both datasets Table 2 . The proposed IQS method outperformed the ReQ-ReC and ALMIK in terms of MAP and R-Precision in both datasets. The results show that the IQS can retrieve more relevant results from a boolean opaque search engine using relevance feedback given a short initial query.

In the above experiments, we used the vanilla IQS without any additional query expansion methods. Next, we evaluated the IQS with a query expansion based on WordNet denoted as IQS+Synonyms and the IQS with a query expansion based on k nearest neighbors in the embedding space, denoted as IQS+KNN. In addition, we evaluate both query expansion methods together, denoted as IQS+Synonyms+KNN. All the IQS variations were run using the best hyper-parameters found (iter = 15, run = 3).

The MAP of the IQS on the Twitter TREC Microblog 2012 dataset for 40 relevance feedback iterations is presented in Figure 2 . All the IQS variations perform the same or worse then the vanilla implementation. Since we use a local search algorithm for query reformulation it is expected that when there are more candidate terms the algorithm require more search iterations to find the most suitable terms. Since we want to keep a low amount of interactions with the search engine, vanilla IQS is the best option. Note that the query expansion is applied to the prototype document terms and not on each query. 

As a final experiment we evaluated the IQS ability to find relevant documents for a given prototype document without the user's help.

For this experiment, we used the search narrative for each topic in the TREC-COVID 2019 as the prototype document. We used the same hyper-parameters from the previous experiments for IQS. Since the TREC-COVID 2019 dataset was published as part of a Kaggle competition, we also submit our results as a late submission.

The IQS achieved MAP and R-Precision of 0.019 and 0.068 on theTREC-COVID 2019 dataset which are not impressive. These results imply that there is a mismatch between the terms used in the topic narrative to the judged documents' terms. To verify this assumption we evaluate the MAP and R-Precision after one active learning iteration (10 labels) and the results reached a MAP and R-Precision of 0.317 and 0.367 respectively. This strengthens our assumption that there is a high vocabulary mismatch between the topic narrative and the judged documents. Thus the topic narrative does not represent well the discussed topic.

The results on the Kaggle platform were calculated using Normalized Discounted Cumulative Gain (NDCG). Although we expect IQS to have poor performance due to the vocabulary mismatch, The IQS reached the fifth place on the public leader-board obtaining NDCG of 0.133 and the forth place on the private leader-board obtaining NDCG of 0.475. These results suggest that even though there is a vocabulary mismatch some of the relevant documents are still ranked high.

As opposed to the previous evaluations, which focused on relevance feedback, in many real world problems, it is impossible to use such a manner. In many cases, it is not practical and scalable to ask feedback from the users continuously. Therefore, here, we are in a mode of pseudo relevance feedback. This means that we apply the iterative query selection (IQS) on the Twitter search engine and refer the obtained mean relevance error (MRE) score for the returned tweets as pseudo-relevance feedback.

In this section, we demonstrate the importance of the IQS as a necessary link in the pipeline of fake news detection. Using the proposed IQS and MRE, we demonstrate an automated collection of relevant tweets associated with labeled given news items (ground truth). Later, using these relevant tweets, we demonstrate fake news detection using supervised machine learning classifiers. In the following section, we describe the background of fake news detection on online social media (OSM), including the data collection process. Afterward, we present the dataset obtained using the IQS for fake news detection on OSM. Finally, we train a classifier on the collected data and present its performance.

The task of fake news classification has been studied intensely in recent years. Along with the growth of online news, many non-traditional news sources, such as blogs have evolved in order to respond to users' "appetite for information." In many cases, however, these sources are operated by amateurs whose reporting is often subjective, misleading, or unreliable [56] . This everyone is a journalist phenomenon [57] , coupled with the flood of unverified news and the absence of quality control procedures to prevent potential deception, has contributed to an increasing problem of fake news dissemination [58] .

The spread of misinformation, propaganda, and fabricated news has potentially harmful effects, even including a significant impact on real-world events [59] . In recent years, it has weakened public trust in democratic governments and their activities, such as the "Brexit" referendum and the 2016 U.S. election [60] . World economies are also not immune to the impact of fake news; this was demonstrated when a false claim regarding an injury to President Obama caused the stock markets to plunge (dropping 130 billion dollars) [61] . In recent years, due to the threats to democracy, journalistic integrity, and economies, researchers have been motivated to develop solutions for this serious problem [60] proposing approaches for the detection of fake news based on natural language processing [43] , and investigating the diffusion of news [15] , etc.

First, it is important to verify that the RE identifies relevant posts from Twitter to a given news item. To measure the RE score, we collected the News dataset. The News dataset consists of 20 news items that were retrieved manually from the fact checking website of Snopes. The manual query selection for each news item includes the following steps: First, read the given news item in order to understand its subject. Second, assign queries that express the meaning of the given news item, omitting stop words. Similar to Zhang's suggestions [62] , extract 3-5 queries from the title and description of a given news item. Third, use synonyms to expand and reformulate the queries in order to retrieve a high number of relevant posts [63] . Therefore, you are also required to use synonyms in order to retrieve a high number of posts relevant to the given news item. For example, for the news item titled: "Did Donald Trump Scare a Group of Schoolchildren?," there are several synonyms that can be used: Donald Trump -President of U.S., scare -frighten, schoolchildren -youngsters, etc. Fourth, after the queries are determined, use them in the search engine and read a few of the results returned in order to understand whether they are relevant to the given news item. Moreover, the number of retrieved posts is important. In the case, there are only a few posts, it can be a good indication to use more synonyms as query keywords (see Algorithm 2).

Algorithm 2 Manual Query Selection 1: Read news item's title and description 2: If it is necessary, read the full report 3: Assign query that express the meaning of the news item. 4: Provide 3-5 alternative queries 5: Use synonyms 6: Verify the relevance of the queries using the search engine 7: Read a few of the retrieved results. 8: Check relevance. 9: Record the number of retrieved results.

Since the queries assigned manually provided mostly relevant results, we used the TF-IDF query generator to obtain more negative samples. The TF-IDF query generator generated the query by selecting query keywords that obtained the highest TF-IDF score. The TF-IDF score of each news item's word is calculated relative to all other news items in the dataset. Then, we used the Twitter API to collect associated tweets using the assigned keywords. In total, we collected 1,173 tweets and labeled them manually. For the labeling process, we used three annotators (students) that were required to read the news item's title and description and the retrieved tweets associated with it. Later, each annotator labeled each tweet with one of the optional labels: Relevant in case of the given tweet is associated to the given news item, Irrelevant in the opposite case, and Unknown in case the annotator is not sure whether the tweet is related or not. Among the 1,173 retrieved tweets, we used only the tweets that the majority among the annotators agreed on (1,076 tweets). For an example of a news item and relevant, and irrelevant tweets see Table 3 .

After collecting the News dataset, we calculated the proposed RE score for each tweet using its news item. As a prototype document, we concatenated the news item's title and description. Next, we used the pre-trained fastText [64] model for word vector representations which was trained on Common Crawl 10 and Wikipedia 11 using fastText library. 12 For evaluation, we calculated the AUC based on the RE scores of the retrieved tweets. Based on 1,076 labeled tweets, we obtained an AUC score of 0.9 (see Figure 3 ). This observation shows the effectiveness of the RE score for differentiating between relevant and irrelevant associated tweets to a given news item.

In this section, we describe the construction of a large dataset for the task of fake news detection on OSM using the iterative query selection (IQS). First, we crawl news items from fact checking websites, such as Snopes, Gossip Cop, 13 and Politifact. For Snopes and Politifact news items, there are five fine-grained multiple labels: true, mostly true, false, mostly-false and pants-on-fire. Similarly to Rasool et al. [65] , we converted the classification problem into binary by categorizing the news items that their label are true and mostly-true as true and news items that their label are false, mostly-false and pants-on-fire as false. For Gossip Cop, we categorized news items with scores between 0 to 3 as false and news items with scores between 7 to 10, as true. Since the majority of the news items in fact checking websites are false, we added news items from 10 well-known news sources (Time of Israel, CNN News , ABC News, BBC News, The New York Times, The Jerusalem Post , The American Conservative, MSNBC, Fox News, and Politico) as true news items. A large number of studies exploited reliable news sources as a proxy for true news items [14] . In total, we collected 70,018 news items (16,212 false, 53,806 true). For each news item, we set the IQS algorithm to run on Twitter API three times, with the following parameters: Returning 5 final queries (numQueries = 5, a maximal number of 15 iterations (itr = 15), five queries, the number of words in a query is between 3 to 6 (minq = 3, maxq = 6), and the number of results returned equals to 20 (resultLimit = 20). Then, after obtaining the top five final queries from all three runs, we use them in order to retrieve the most relevant tweets for each news items, while limiting the number of tweets returned for each query to 500. To make the fake news detection more realistic, we collected only the tweets that were posted before the fact checker assigned a label for the given news item. Finally, utilizing this approach, we constructed a large fake news dataset containing about 70,000 news items and about 61 million corresponding posts, the distribution of which is shown in Table 4 . Table 4 : Fake news dataset statistics.

To classify the news items, we extract author-and post-based features. For author-based features, we applied aggregation functions on various aspects of author demographics, such as registration age, number of followers, number of followees, number of distributed tweets published by the user, etc. Post-based features include the aggregations of posts metadata, such as retweet count, text length, the time interval between the oldest and newest post, etc. For all the features extracted from the post, we removed stop words. Also, regarding post's data, we extracted the following features: sentiment, temporal (post diffusion patterns), LDA (variations on the posts' topics), TF-IDF and word embedding. For the latter, we used the Glove Wikipedia pre-trained model with 300 dimensions. For aggregation functions, we used mean, median, max, min, standard deviation, kurtosis and skewness functions.

Gini Importance Max followers count of claim's authors 0.033 Table 5 : Features listed by Gini importance.

For the classification, we tried many combinations of supervised machine learning algorithms and feature subsets. All classifiers were trained using 10-fold cross-validation. Eventually, we averaged the results obtained from all the folds. We determined that the best performing classifier on the test set was the Random Forest with 100 estimators and a max depth of 10. This classifier, with 100 features obtained AUC and accuracy of 0.92 and 0.86, respectively. This result is evidence that an application that detects false news based on the OSM can benefit from our proposed approach.

Also, we analyzed most of the influential features of the best classifier (see Table 5 ). The most important feature was the number of verified authors with Gini importance of 0.162 (see Table 5 ). Comparing the distribution of these authors with respect to fake and true news items, we can see the number of verified authors within true news is 3 times higher than in false news items. These differences were found statically significant (a p-value of 0.0). Based on this result, we conclude that verified authors are important actors for the detection of fake news. The higher their participation, the higher the reliability of the online discussion.

According to the Gini importance, the second, fifth, and sixth highest influential features were aggregations over the news item's authors. This strengthens the conclusion of Castillo et al. [66] that author-based features are very relevant for fake news detection within the OSM.

In addition, the third and fourth features are aggregations on the word embedding of the news item's posts. These features indicate that the fixed-length of vector representations of the words consists of the online discussions that can hold the truthfulness of given news items.

These results show that our proposed algorithm can be utilized for solving real world problems (e.g., the detection of fake news). In addition, the machine learning classifiers trained on the collected this large dataset using the IQS obtained impressive results. This strengths that our method can be very useful for detecting fake news while retrieving relevant data automatically.

Collecting information from OSM has raised ethical concerns in recent years. To minimize the potential risks of such activities, this study follows recommendations presented by [67] , which deal with the ethical challenges of OSM and Internet communities.

For this study, we proposed a method, which selects queries for a given prototype document to retrieve the maximal number of relevant documents. To evaluate the proposed method, we used the Twitter search engine in order to retrieve tweets associated with the given prototype document. This service collects tweets published by accounts that agreed to share their information publicly.

In this paper, we propose an automated iterative query selection (IQS) algorithm for improving information retrieval from opaque search engines. This method consists of two components: the mean relevance error (MRE) which estimates the relevance of documents to a given prototype document and the iterative algorithm which selects suitable queries based on the MRE estimation.

We demonstrated our methods as a retrieval system with relevant feedback on the Twitter TREC Microblog 2012 dataset and TREC-COVID 2019 dataset. Our proposed method outperformed all the other methods on the both datasets. Finally, we demonstrated the use of the proposed IQS method as part of a fake news detection pipeline and obtain impressive results of an AUC of 0.92 and an accuracy of 0.86.

As a result, we conclude the following conclusions: First, the MRE score is found to be a successful measure for differentiating between relevant and irrelevant documents concerning a given prototype document (see section 5.2).

Second, the IQS algorithm can find a large number of relevant results from an opaque search engine (see section 4.3). Third, the IQS was found suitable and useful for an automated fake news detection system. In contrast to other data collection approaches, our method is generic and can obtain a larger subset of relevant posts. For example, an approach that only collects posts that contain the news item's URL reaches a smaller subset of all the relevant documents and requires the user to provide the appropriate source URL.

One possible future work could demonstrate the proposed approach on different OSM platforms, such as Reddit 14 , Quora 15 , etc. Another might compare the effectiveness of a fake new detection system using data collected using a source URL versus data collected using our query selection method.

Introduction to Information Retrieval

Product recommendation based on search keywords

Who should i cite: learning literature search models from citation behavior

Real-time news cer tification system on sina weibo

A review of word embedding and document similarity algorithms applied to academic text

Distributed representations of sentences and documents

Skip-thought vectors

Unsupervised learning of sentence embeddings using compositional n-gram features

A case for interaction: A study of interactive information retrieval behavior and effectiveness

Personalized query expansion for the web

Predicting query performance

Placing search in context: The concept revisited

Fake news detection on social media using geometric deep learning

The spread of true and false news online

Query selection techniques for efficient crawling of structured web sources

Using word embeddings for automatic query expansion

Query expansion using word embeddings

Improving pseudo-relevance feedback with neural network-based word representations

Query expansion and query fuzzy with large-scale click-through data for microblog retrieval

Req-rec: High recall retrieval with query pooling and interactive classification

Atr-vis: Visual and interactive information retrieval for parliamentary discussions in twitter

Pseudo-relevance feedback based on matrix factorization

Query reformulation using wordnet and genetic algorithm

Task-oriented query reformulation with reinforcement learning

Query expansion for microblog retrieval focusing on an ensemble of features

Effective pseudo-relevance for microblog retrieval

Collecting event-related tweets from twitter stream

From word embeddings to document distances

Short text similarity with word embeddings

Improving document ranking with dual word embeddings

A deep relevance matching model for ad-hoc retrieval

Learning to match using local and distributed representations of text for web search

Search result diversification

An axiomatic approach for result diversification

The use of mmr, diversity-based reranking for reordering documents and producing summaries

Learning optimally diverse rankings over large document collections

Probabilistic models of ranking novel documents for faceted topic retrieval

Result diversification based on query-specific cluster ranking

Diversifying search results

Exploiting query reformulations for web search result diversification

This analysis shows how viral fake election news stories outperformed real news on facebook

Fake news detection via nlp is vulnerable to adversarial attacks

Fake news detection on social media: A data mining perspective

liar, liar pants on fire": A new benchmark dataset for fake news detection

Multimodal fusion with recurrent neural networks for rumor detection on microblogs

Eann: Event adversarial neural networks for multi-modal fake news detection

Glove: Global vectors for word representation

Efficient estimation of word representations in vector space

Enriching word vectors with subword information

The algorithm design manual: Text

WordNet: A Lexical Database for English

Overview of the trec-2012 microblog track

The probabilistic relevance framework: BM25 and beyond

Indexing by latent semantic analysis

The reconstruction of american journalism

Danger, trauma, and verification: eyewitnesses and the journalists who view their material

Automatic deception detection: Methods for finding fake news

Social media and fake news in the 2016 election

Fake news: Fundamental theories, detection strategies and challenges

Can 'fake news' impact the stock market? Pridobljeno iz www

Automatic keyword extraction from documents using conditional random fields

Query expansion using lexical-semantic relations

Advances in pretraining distributed word representations

Multi-label fake news detection using multi-layered supervised learning

Information credibility on twitter

Ethical considerations when employing fake identities in online social networks for research