Automated Fake News Detection in the Age of Digital Libraries ARTICLE Automated Fake News Detection in the Age of Digital Libraries Uğur Mertoğlu and Burkay Genç INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2020 https://doi.org/10.6017/ital.v39i4.12483 Uğur Mertoğlu (umertoglu@hacettepe.edu.tr) is a PhD Candidate, Hacettepe University. Burkay Genç (bgenc@cs.hacettepe.edu.tr) is Assistant Professor, Hacettepe University. © 2020. ABSTRACT The transformation of printed media into the digital environment and the extensive use of social media have changed the concept of media literacy and people’s habits of news consumption. While online news is faster, easier, comparatively cheaper, and offers convenience in terms of people's access to information, it speeds up the dissemination of fake news. Due to the free production and consumption of large amounts of data, fact-checking systems powered by human efforts are not enough to question the credibility of the information provided, or to prevent its rapid dissemination like a virus. Libraries, long known as sources of trusted information, are facing challenges caused by misinformation as mentioned in studies about fake news and libraries.1 Considering that libraries are undergoing digitization processes all over the world and are providing digital media to their users, it is very likely that unverified digital content will be served by world’s libraries. The solution is to develop automated mechanisms that can check the credibility of digital content served in libraries without manual validation. For this purpose, we developed an automated fake news detection system based on Turkish digital news content. Our approach can be modified for any other language if there is labelled training material. This model can be integrated into libraries’ digital systems to label served news content as potentially fake whenever necessary, preventing uncontrolled falsehood dissemination via libraries. INTRODUCTION Collins dictionary which chose the term “fake news” as the “Word of the Year 2017,” describes news as the actual and objective presentation of a current event, information, or situation that is published in newspapers and broadcast on radio, television, or online.2 We are in an era where everything goes online, and news is not an exception. Many people today prefer to read their daily news online, because it is a cost-effective and convenient way to remain up to date. Although this convenience has lucrative benefits for society, it can also have harmful side effects. Having access to news from multiple sources, anytime, anywhere has become an irresistible part of our daily routines. However, some of these sources may provide unverified content which can easily be delivered right to your mobile device. Most importantly, potential fake news content delivered by these sources may mislead society and cause social disturbances such as triggering violence against ethnic minorities and refugees, causing unnecessary fear related to health issues, or even sometimes result in crisis, devastating riots and strikes. Not having a steady definition compared to news, fake news is often defined according to the data used or the limited perspective of the study in the literature. For example; DiFranzo and Gloria- Garcia defined the fake news as “false news stories that are packaged and published as if they were genuine.”3 On the other hand, Guess et al. see the term as “a new form of political misinformation” within the domain of politics, whereas Mustafaraj is more direct and defines it as mailto:umertoglu@hacettepe.edu.tr mailto:bgenc@cs.hacettepe.edu.tr INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 2 “lies presented as news.”4 A comprehensive list of 12 definitions can be found in Egelhofer and Lecheler.5 In simplified terms, news which is created to deceive or mislead readers can be called fake news. However, the concept of fake news is a quite broad one that needs to be specified meticulously. Fake news is created for many purposes and emerges in many different types. Having an interwoven structure, most of these types are shown in figure 1. Although, it is not easy to cluster these types into separate groups, they can be categorized according to the information quality or based on the intention as it is created to deceive deliberately or not, as Rashkin et al. did.6 We propose the following classification where the two dimensions represent the potential impact and the speed of propagation. Figure 1. The volatile distribution of the fake news types (clustered in four regions: sr, Sr, Sr, SR) with respect to two dimensions: speed of propagation and potential impact. The four regions visualized are clustered according to their dangerousness. First of all, it should be noted that to order types of fake news in a stable precision is quite a challenging task. The variations within the field highly depend on dynamic factors such as timespan, actors, and echo- chamber effect. Hence, this figure should be considered as a clustering effort. There are possible intersecting areas of types within the regions. We will now give examples for two regions, “sr” and “SR.” For example, the SR grouping shows characteristics of high-risk levels and fast dissemination. This includes varieties of fake news such as propaganda, manipulation, misinformation, hate news, INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 3 provocative news, etc. We usually encounter this in the domain of politics. This kind of news may cause critical and nonrecoverable results in politics, the economy, etc., in a short period of time. The rise of the term fake news itself can also be attributed to this kind of news. On the other hand, the relatively less severe group (sr) of fake news, comprising of satire, hoax, click-bait, etc., has low-risk levels and a slow speed of dissemination. A frequently used type of this group, click-bait, is a sensational headline or link that urges the reader to click on a post, link, article, image, or video. These kinds of news have a repetitive style. It can be said that readers become aware of falsehood after experiencing a few times. So, risk level is lower, and dissemination is slower. Vosoughi et al. stated the assumption that “Falsehood diffuses significantly farther, faster, deeper, and more broadly than the truth.”7 So indeed, just one piece of fake news may affect many more people than thousands of true news items do because of the dramatic circulation of fake news. In their recent survey about fake news, Zhou and Zafarani highlighted that fake news is a major concern for many different research disciplines especially information technologies. 8 Being a trusted source of information for a long time, libraries will play an important role in fighting against fake news problem. Kattimani et al. claims that the modern librarian must be equipped with necessary digital skills and tools to handle both printed collections and newly emerging digital resources.9 Similarly, we foresee that digital libraries, which can be defined as collections of digital content licensed and maintained by libraries, can be a part of the solution as an authority service with a collective effort. Connaway et al. point to the key role of information professionals such as librarians, archivists, journalists, and information architects in helping society use the products and services related to news in a convenient way. 10 As libraries all over the world are transitioning into digital content delivery services, they should implement mechanisms to avoid fake and misleading content being disseminated through them under the guidance of information professionals. To lay out proper future directions for the solution strategy, a clear understanding of interaction between library and information science (LIS) community and fake news must be addressed. Sullivan states that the LIS community has been affected deeply in the aftermath of the 2016 US presidential elections.11 Moreover, he quotes many other scientists, emphasizing libraries’ and librarians’ role in the fight against fake news. For example, Finley et al. say that libraries are the direct antithesis of fake news, the American Library Association (ALA) called fake news an anathema to the ethics of librarianship in 2017, Rochlin emphasizes the role of librarians in this fight, and talks about the need to adopt fake news as a central concern in librarianship and many other researchers name librarians in the front lines of the fight against fake news.12 Today, the struggle to detect fake news and prevent their spread is so popular that competitions are being organized (e.g., http://www.fakenewschallenge.org/) and conferences are being held (e.g., Bobcatsss 2020). The struggle against fake news can be classified under three main venues: • Reader awareness • Fact-checking organizations and websites • Automated detection systems The first item requires awareness of individuals against fake news and a collective conscience within the society against spreading fake news. To this end, visual and textual checklists, frameworks, and guidance lists are being published by official organizations, such as IFLA’s13 http://www.fakenewschallenge.org/ INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 4 (International Federation of Library Associations) infographic which contains eight steps to spot fake news. The RADAR framework and the Currency, Relevance, Authority, Accuracy, and Purpose (CRAAP) test are some of the efforts trying to increase reader-awareness of fake news.14 Unfortunately, due to the nature of fake news and the clever way they are created triggering people’s hunger to spread sensational information, it is very difficult to achieve full control via this strategy. Some studies explicitly showed that humans are prone to get confused when it comes to spotting lies or deciding whether a news item is fake or not.15 Furthermore, people often overlook facts that conflict with their current belief, especially in politics and controversial social issues.16 The second strategy focuses on third-party manually driven systems for checking and labelling content as fake or valid. Recently, we have seen many examples of offline and online organizations trying to work according to this strategy, such as a growing body of fact-checking organizations, start-ups (Storyzy, Factmata, etc.), and other projects with similar purposes.17 Unfortunately, these manually powered systems cannot cope with the huge amounts of digital content being steadily produced. Therefore, they focus only on a subset of digital content that they classify as having higher priority. Even for this subset of content, their reaction speed is much slower than the fake information’s spread speed. Therefore, automated and verified systems emerge as an inevitable last option. The third strategy offers automated fact-checking systems, which once trained, can deliver content labelling at unprecedented speeds. Today, many researchers are researching automated solutions and building models with different methodologies.18 Notwithstanding the latest studies, there is still a lot to do in the realm of automated fake news detection. Automated fact-checking systems will be detailed in the rest of the paper. Thanks to the internet, the collections of digital content served by digital libraries can be accessed by a great number of users without distance and time limits. Therefore, we propose a solution to the problem by positioning digital libraries as automated fact-checking services, which label digital news content as fake or valid as soon as or before it is served through library systems. The main reason we associate this approach with digital libraries is their access to a wide variety of digital content which can be used to train the proposed mathematical models, as well as their role in the society as the publisher of trusted information. To this end, we develop a mathematical model that is trained using existing news content served by digital libraries, and capable of labelling news content as fake or valid with unprecedented accuracy. The proposed solution uses machine learning techniques with an optimized set of extracted features and annotated labels of existing digital news content. Our study mainly contributes (a) a new set of features highly applicable for agglutinative languages, (b) the first hybrid model combining a lexicon/dictionary- based approach with machine learning methods to detect fake news, and (c) a benchmark dataset prepared in Turkish for fake news detection. LITERATURE REVIEW Contemporary studies have indicated that social, economic, and political events in recent years, especially after the 2016 US presidential elections, are increasingly associated with the concept of fake news.19 Since then, fake news has begun to be used as a tool in many domains. On the other hand, researchers motivated by finding automated solutions started to make use of machine learning, deep learning, hybrid models, and other methodologies for their solutions. https://storyzy.com/ INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 5 Although computational deception detection studies applying NLP (Natural Language Processing) operations are not new, textual deception in the context of text-based news is a new topic for the field of journalism.20 Accordingly, we believe that there is a hidden body language of news text, which has linguistic clues indicating whether the news is fake or not. Thus, lexical, syntactic, semantic, and rhetorical analysis when used with machine learning and deep learning techniques offers encouraging directions. The textual deception spread over a wide spectrum and the studies have utilized many different techniques. There are some prominent studies which took the problem as a binary classification problem utilizing linguistic clues.21 Although it is still early to say the linguistic characteristics of fake news are fully understood, research into fake-news detection in English-language texts is relatively advanced compared to that in other languages. In contrast, agglutinative languages such as Turkish have been little researched when it comes to fake news detection. Agglutinative languages enable the construction of words by adding various morphemes, which means that words that are not practically in use may exist theoretically. For example, “gerek-siz-leş-tir-ebil- ecek-leri-miz-den-dir,” is a theoretically possible word that means “it is one of the things that we will be able to make redundant,” but it is not a practical one. Shu et al. classified the models for the detection of fake news in their study.22 According to this study, the automated approaches can focus on four types of attributes to detect fake news: knowledge based, style based, stance based, or propagation based. Among these, it can be said that the most useful approaches are the ones which focus on the textual news content. Th e textual content can be studied by an automated process to extract features that can be very helpful in classifying content as fake or valid. Many scholars have tried to build models for automatic detection and prediction of fake news using machine learning algorithms, deep learning algorithms, and other techniques. These scholars approach the detection of fake news from many different perspectives and domains. For example, in one of the studies, scientific news and conspiracy news were used.23 In Shu et al.’s study based on credibility of news, the headlines were used to determine whether the article was clickbait or not. In another study, Reis et al. worked on Buzzfeed articles linked to the 2016 US election using machine learning techniques with a supervised learning approach.24 Studies which try to detect satire and sarcasm can be attributed to subcategories of fake news detection.25 Our observation, in line with the general view, is that satire is not always recognizable and can be misunderstood for real news.26 For this reason, we included satirical news in our dataset. It should be noted that although satire or sarcasm can be classified by automated detection systems, experts should still evaluate the results of the classification. While some scholars used specific models focusing on unique characteristics, some others such as Ruchansky et al. proposed hybrid deep models for fake news detection making use of multiple kinds of features such as temporal engagement between users and news articles over time and generated a labelling methodology based on those features.27 In related studies, many features such as automatic extracted features, hand-crafted features, social features, network information, visual features, and some others such as psycholinguistic features, are applied by researchers.28 In this work, we focused on news content features, however the social context features can also be adapted using different tiers such as user activity patterns, INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 6 analysis of user interaction, profile metadata, social network/graph analysis etc. to extract features. We also have some of these features in our data but not having ground truth quantitatively, we avoided using these features. METHODOLOGY In this section, we present our motivation for this work which we visualized in a framework and named Global Library and Information Science (GLIS_1.0). Subsequently, we discuss the construction of the automated detection system as the key element of the GLIS_1.0 framework. We explain the framework, model, dataset, features, and the techniques used in this section. Framework The main structure of the proposed framework is shown in figure 2. This framework consists of highly cohesive but flexible layers. Figure 2. The GLIS_1.0 framework main structure. In the presentation layer one can find the different sources of news that are publicly available. These sources can be accessed directly using their websites or can be searched for via search engines. The news is received by fact-checking organizations which classify them manually, digital libraries which archives and serves them, and automated detection systems (ADS) which classify them automatically. Digital libraries work together with fact-checking organizations and ADSs to present clean and valid news to the public. Moreover, search engines use digital libraries systems to label their results as fake or valid. INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 7 Fact-checking organizations should also benefit from the output of ADSs, as instead of manually checking heaps of news content, they could now focus on news labeled as potentially fake by an ADS. Through GLIS, ADSs make the life of fact-checking organizations and digital libraries much easier, all the while increasing the quality of news served to the public. Considering this is a high-level overview of a structure given in figure 2, there may be many other components, mechanisms, or layers, but the key elements of this structure are automated detection systems and the digital libraries. A critical approach to this framework can be why we need such an authority mechanism. The answer will be quite simple, technological progress is not the only solution. On the contrary, tech giants have already been subject to regulatory scrutiny for how they handle personal information.29 Also, their policy related to political ads has been questioned. Furthermore, they are often blamed for failing to fight fake news. Indeed, there is an urgent need for a global action more than ever. Digital libraries are much more than a technological advancement. Hence, they should be considered as institutions or services which can be a great authority service to provide news to society since the printed media disappears day by day. The threats caused by fake news are real and dangerous, but only recently have researchers from different disciplines been trying to find possible solutions such as educational, technological, regulatory, or political. Digital librarianship can be the intersection of all these solutions for promoting information/media literacy. Hence, digital librarianship will make use of many automated detection systems (ADS) to serve qualified news. In the following section, we discuss ADS in detail. Model An overview of our model of automated detection system solution which is very critical for the framework is shown in figure 3. Our fake news detection model consists of two phases. First is the Language Model/Lexicon Generation and the second is Machine Learning Integration. In this work, we used machine learning algorithms via supervised learning techniques which learn from labeled news data (training) and helps us to predict outcomes for unforeseen news data (test). Dataset We collected our data from three sources: • The primary source is the GDELT (Global Database of Events, Language and Tone) Project (https://www.gdeltproject.org/), a massive global news media archive offering free access to news text metadata for researchers worldwide. It can almost be considered a digital library of news in its own right. However, GDELT does not provide the actual news text and only serves processed metadata along with the URL of the news item. GDELT normally does not check for the validity of any news items. However, we have only used news from approved news agencies and completely ignored news from local and lesser-known sources to maximize the validity of the news we have automatically obtained through GDELT. Moreover, we have post-processed the obtained texts by cross-validating with teyit.org data to clean any potential fake news obtained through GDELT links. https://www.gdeltproject.org/ INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 8 Figure 3. Integrated fake news detection model with main phases combining language-model based approach with machine learning approach. • The second source is teyit.org which is a fact-checking organization based in Turkey, compliant to the principles of IFCN (International Fact-Checking Network) aiming to prevent spreading of false information through online channels. Manually analyzing each INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 9 news item, they tag them as fake, true, or uncertain. We used their results to automatically download and label each news text. • Lastly, our team collected manually curated and verified fake and valid news obtained from various online sources and named it as MVN (Manually Verified News). This set includes fake and valid news that we have manually accumulated in time during our studies and that were not overlapping with the news obtained from GDELT and teyit.org sources. We named our dataset TRFN. In Phase 2, the data is very similar to the one we used in Phase 1. However, to see the effectiveness of model, we made modifications to exclude old news before 2017 and added new items from 2019. The news in our dataset span a time frame between 2017– 2019 and are uniformly distributed. Table 1 outlines the dataset statistics, namely where the news text comes from, its class (fake or valid), the amount of distinct texts and the corresponding data collection method. It can be seen from the table that most of our valid news come from the GDELT source, whereas teyit.org, a fact-checking organization, contributes only fake news. Table 1. TRFN Dataset Summary after cleaning and duplicate removal. Dataset Class Size of Processed Data Collection Method GDELT NON-FAKE 82708 Automated Teyit.org FAKE 1026 MVN NON-FAKE 1049 Manual FAKE 400 All news items were processed through Zemberek (http://code.google.com/p/zemberek), the Turkish NLP engine for extracting different morphological properties of words within texts. After this processing phase, all obtained features were converted into tabular format and made available for future studies. This dataset is now available for scholarly studies upon request. In a study of this nature, the verifiability of the data used is important. As we have already mentioned, most of the data we used comes from verified sources such as mainstream news agencies accessed through GDELT and teyit.org archives which are verified by teyit.org staff. All data used in training the mathematical models which are to be explained in the rest of the paper are either directly or indirectly verified. Another important issue was generalizability of the dataset, which determines whether the results of the study are only applicable to specific domains or to all available domains. Although focusing on a specific news domain would clearly improve our accuracies, we preferred to work in the general domain and included news from all specific domains. The distribution of domains in our dataset is visualized in figure 4. This distribution closely matches the distribution one would experience reading daily news in Turkey. Hence, we have no domain specific bias in our training dataset. http://code.google.com/p/zemberek INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 10 Figure 4. The distribution of domains in the dataset. (SciTechEnvWetNatLife = Science, Technology, Environment, Weather, Nature, Life. EduCultureArtTourism = Education, Culture, Art, Tourism.) Moreover, we obtained highly correlated evidence showing syntactic similarities with the other NLP studies in Turkish during the exploratory data analysis. For example, the results of a study by Zemberek developers (http://zembereknlp.blogspot.com/2006/11/kelime-istatistikleri.html) to find the most common words in Turkish experimented with over five million words is compatible with most common words in our corpus. This evidence can be attributed to representability of our dataset. The last issue worth discussing is the imbalanced nature of the dataset. An imbalanced dataset occurs in a binary classification study when the frequency of one of the classes dominates the frequency of the other class in the dataset. In our dataset, the amount of fake news is highly surpassed by the amount of valid news. This generally results in difficulties in applying conventional machine learning methods to the dataset. However, it is a frequently observed phenomenon due to the disparity of variable classes in these kinds of problems in real world. To avoid potential problems due to the imbalanced nature of the dataset, we used SMOTE (Synthetic http://zembereknlp.blogspot.com/2006/11/kelime-istatistikleri.html INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 11 Minority Over-sampling Technique) which is an over-sampling method.30 It creates synthetic samples of the minority class that are relatively close in the feature space to the existing observations of the minority class. Features In this study, we discarded some features because of their relatively low impact on overall performance during the exploratory data analysis and subsequently in the training phase. The most effective features we decided on are shown in table 2. Table 2. Main Features Features Group Definition nRootScore Language Model Features The news score calculated according to the Root Model nRawScore The news score calculated according to the Raw Model SpellErrorScore Extracted Features Spell errors per sentences ComplexityScore The score of the complexity/readibility of the news Source Labels The URL or identifier of the news MainCategory The category of the news NewsSite The unique address of the news The language model features nRootScore and nRawScore are features that we have borrowed from our earlier study on fake news detection.31 In that study, we focused on constructing a fake news dictionary/lexicon based on different morphological segments of the words used in news texts. These two scores were found to be the most successful ones in determining the fakeness/validity of a news text, one considering the raw form of the words, the other considering the root form. The extracted features are ComplexityScore and SpellErrorScore. ComplexityScore basically represents the readability of the text. Studies for determining a good readability metric exist for the Turkish language.32 We used a modified version of the Gunnig-Fog metric, which is based on word length and sentence length.33 Since Turkish is an agglutinative language, we used word length instead of using the syllable count. We also made some modifications to normalize the scores. The average number of syllables per word syllable in Turkish is 2.6, so we defined a word as a long word if it has more than 9 letters.34 For a given news text T, the Complexity Score (CS) can be computed by equation 1. (1) 𝑇𝐶𝑆 = ( 𝑊𝑜𝑟𝑑𝑐𝑜𝑢𝑛𝑡 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠𝑐𝑜𝑢𝑛𝑡 + 𝐿𝑜𝑛𝑔𝑊𝑜𝑟𝑑𝑐𝑜𝑢𝑛𝑡∗100 𝑊𝑜𝑟𝑑𝑐𝑜𝑢𝑛𝑡 10 ) The second Extracted Feature is SpellErrorScore. We foresee that there may be many more errors in fake news than in valid news. We calculated the spell error counts making use of Turkish Spellchecker class of Zemberek. Due to the text length of news varies, we calculate the ratio INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 12 according to the sentences. For a given news text T, SE (Spell Error Score) is calculated as shown in equation 2. (2) 𝑇𝑆𝐸 = ( 𝑆𝑝𝑒𝑙𝑙𝐸𝑟𝑟𝑜𝑟𝐶𝑜𝑢𝑛𝑡 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠𝐶𝑜𝑢𝑛𝑡 ) Finally, we included the metadata categories Source, MainCategory, and NewsSite as additional identifiers for the learning process. Then, we combined features extracted from text representation techniques with the features shown in table 2 and trained the model with different classifiers. For text representation, we followed two directions for the experiments. First, we converted text into structured features with Bag of Words (BOW) approach in which text data is represented as the multiset of its words. Second, we experimented with N-grams which represents the sequence of n words, in other words splitting text into chunks of size N-words. In the (BOW) model, documents in TRFN are represented as a collection of words, ignoring grammar and even word order, but preserving multiplicity. In a classic BOW approach, each document can be represented as a fixed-length vector with length equal to the vocabulary size. This means each dimension of this vector corresponds to the occurrence of a word in a news item. We customized the generic approach by reducing variable-length documents to fixed-length vectors to be able to use with varying lengths with many machine learning models. INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 13 Figure 5. An overview of BOW (Bag of Word) Approach. Because we ignore the word order, we reduced fixed length of counts as histograms as seen in figure 5. Assuming N is the number of news documents and W is the number of possible words in the corpus, it should be noted that in N*W count matrix, N is generally large but infrequent, because we have many news documents, but most words do not occur in any given document causing rareness of a term/word which is a drawback for the approach. Therefore, we modified the model to compensate the rarity problem by weighting the terms using TF-IDF measure which evaluates how important a word is to a document in a collection. The other technique we used, N-gram model is the generic term for a string of words in computational linguistics, and it is extensively used in text mining and NLP tasks. The prefixes that replace the n-part indicate the number of consecutive words in the string. So, a unigram is referred to one word, a bigram is two words, and an n-gram is n words. EXPERIMENTAL RESULTS AND DISCUSSION In this section, the experimental process and the results are presented. All experiments are performed using the Scikit-learn library. To evaluate the performance of the model and proposed features we employed the precision, recall, F1 score (the harmonic mean of the precision and recall), and accuracy metrics. We did many experiments using different combinations of features. INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 14 Several classification models have been trained. These are as follows: K-Nearest Neighbor, Decision Trees, Gaussian Naive Bayes, Random Forest, Support Vector Machine, ExtraTrees Classifier, and Logistic Regression. To be effective, a classifier should be able to correctly classify previously unseen data. To this end, we tuned the parameter values for all the classification models used. Then, models were trained and evaluated on TRFN dataset using 10-fold cross-validation. In table 3, we present the ultimate best scores of the proposed model. The results are highly motivating to exemplify how useful automated detection systems can be as a key component of the integrated solution framework in figure 2. We compared the algorithms with three ultimate feature sets for having respectively consistent results to the other feature set combinations. Set1 stands for bigram+FOpt (Optimized Features), Set2 stands for BOWModified+ FOpt and Set3 stands for unigram+bigram+FOpt. The results show that there is a relative consistency in terms of performance across the models. In almost all models, the combination of unigram+bigram and optimized features sets (FOpt) gives better results than the other combinations. The ExtraTree Classifier model is chosen as the best due to its higher performance. This model is also known as Extremely Randomized Trees Classifier which is a type of ensemble learning technique aggregating the results of multiple decision trees collected in a “forest” to output its classification result. It is very similar to Random Forest Classifier and only differs in the manner of construction of the decision trees. So, we can also see closer results between these two classifiers. Table 3. Results. Evaluation results of all combinations of features and classification models. Model Feature Sets Precision%(0,1) Recall%(0,1) Accurac y F1Scor e Set1 93.32 93.96 93.92 93.3 6 93.64 93.62 Gaussian Naive Bayes Set2 93.37 94.02 93.98 93.4 2 93.70 93.68 Set3 93.95 94.21 94.19 93.9 7 94.08 94.07 Set1 93.70 93.50 93.52 93.6 9 93.60 93.61 K-Nearest Neighbour Set2 93.66 94.05 94.03 93.6 8 93.85 93.84 Set3 94.42 94.21 94.22 94.4 1 94.31 94.32 Set1 94.15 94.92 94.88 94.1 9 94.53 94.51 ExtraTrees Classifier Set2 94.09 94.94 94.90 94.1 4 94.51 94.49 Set3 97.90 95.72 95.81 97.8 6 96.81 96.85 Set1 89.61 88.92 88.99 89.5 4 89.26 89.30 Support Vector Machine Set2 89.70 88.96 89.04 89.6 2 89.33 89.37 Set3 90.85 91.26 91.22 90.8 9 91.05 91.03 Set1 91.56 92.28 92.23 91.6 2 91.92 91.89 Logistic Regression Set2 91.50 92.28 92.22 91.5 6 91.89 91.86 Set3 92.25 92.90 92.86 92.3 0 92.57 92.55 Set1 93.71 94.44 94.40 93.7 5 94.07 94.05 Random Forest Set2 93.87 95.00 94.94 93.9 4 94.44 94.41 Set3 94.77 95.14 95.12 94.7 9 94.96 94.95 Set1 93.95 94.59 94.56 93.9 9 94.27 94.25 Decision Trees Set2 94.05 95.08 95.03 94.1 1 94.57 94.54 Set3 94.94 95.24 95.23 94.9 5 95.09 95.08 INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 15 Every ADS in GLIS_1.0 framework may use its own way to detect fake news. The open source ADS may improve with feedbacks. Hybrid models and other techniques such as neural networks with deep learning methodology can also be used according to the data, language of news and the news features related with both social context and news content. CONCLUSION AND FUTURE WORK In this study we presented a novel framework which offers a practical architecture of an integrated system for identifying fake news. We have tried to illustrate how digital libraries can be a service authority to promote media literacy and fight against fake news. Because librarians are trained to critically analyze information sources, their contributions to our proposed model are critical. Accordingly, we see this work as an encouraging effort for the next collaborative studies among the communities of LIS and CS (computer science). We think that there is an immediate need for LIS professionals to participate and contribute to automated solutions that can help detecting inaccurate and unverified information. In the same manner, we believe the collaboration of LIS professionals, computer scientists, fact-checking organizations, and pioneering technology platforms is the key to provide qualified news within a real-time framework to promote information literacy. Moreover, we put the reader at the core of the framework as the feed reader position while consuming news. In terms of automated detection systems, we proposed a fake news detection model in tegration of dictionary-based approach and machine learning techniques offering optimized feature sets applicable to agglutinative languages. We comparatively analyzed the findings with several classification models. We demonstrated that machine learning algorithms when used together with dictionary-based findings yield high scores both for precision and recall. Consequently, we believe once operational in the field, proposed workflow can be extended in the future to support other news elements such as photographs and videos. With the help of Social Network Analysis (SNA) it may be possible to stop or slow down the spread of fake news as it emerges. During all the experiments we did, this work also highlighted several tasks as future research directions such as: • The studies can be deepened to mathematically categorize the fake news types and the dissemination characteristics of each type can be analyzed. • The workflow has the potential to provide an automated verification platform for all news content existing in digital libraries to promote media literacy. ENDNOTES 1 M. Connor Sullivan, “Why Librarians Can’t Fight Fake News,” Journal of Librarianship and Information Science 51, no. 4 (December 2019): 1146–56, https://doi.org/10.1177/0961000618764258. 2 “Definition of 'News',” available at: https://www.collinsdictionary.com/dictionary/english/news 3 Dominic DiFranzo and Kristine Gloria-Garcia, “Filter Bubbles and Fake News,” XRDS: Crossroads, The ACM Magazine for Students 23, no. 3 (April 2017): 32–35, https://doi.org/10.1145/3055153. https://doi.org/10.1177/0961000618764258 https://www.collinsdictionary.com/dictionary/english/news https://doi.org/10.1145/3055153 INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 16 4 Andrew Guess, Brendan Nyhan, and Jason Reifler, “Selective Exposure to Misinformation: Evidence from the Consumption of Fake News during the 2016 US Presidential Campaign,” European Research Council 9, no. 3 (2018): 4; Eni Mustafaraj and P. Takis Metaxas, “The Fake News Spreading Plague: Was It Preventable?” Proceedings of the 2017 ACM on Web Science Conference, (June 2017): 235–39, https://doi.org/10.1145/3091478.3091523. 5 Jana Laura Egelhofer and Sophie Lecheler, “Fake News as a Two-Dimensional Phenomenon: A Framework and Research Agenda,” Annals of the International Communication Association 43, no. 2 (2019): 97–116, https://doi.org/10.1080/23808985.2019.1602782. 6 Hannah Rashkin et al., “Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking,” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, (2017): 2931–37. 7 Soroush Vosoughi, Deb Roy, and Sinan Aral, “The Spread of True and False News Online,” Science 359, no. 6380 (2018): 1146–51, https://doi.org/10.1126/science.aap9559. 8 Xinyi Zhou and Reza Zafarani, “A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities,” ACM Computing Surveys (CSUR) 53, no. 5 (2020): 1–40, https://doi.org/10.1145/3395046. 9 S. F. Kattimani, Praveenkumar Kumbargoudar, and D. S. Gobbur, “Training of the Library Professionals in Digital Era: Key Issues” (2006), https://ir.inflibnet.ac.in:8443/ir/handle/1944/1234. 10 Lynn Silipigni Connaway et al., “Digital Literacy in the Era of Fake News: Key Roles for Information Professionals,” Proceedings of the Association for Information Science and Technology 54, no. 1 (2017): 554–55, https://doi.org/10.1002/pra2.2017.14505401070. 11 Matthew C. Sullivan, “Libraries and Fake News: What’s the Problem? What’s the Plan?,” Communications in Information Literacy 13, no. 1 (2019): 91–113, https://doi.org/10.15760/comminfolit.2019.13.1.7. 12 Wayne Finley, Beth McGowan, and Joanna Kluever, “Fake News: An Opportunity for Real Librarianship,” ILA reporter 35, no. 3 (2017): 8–12; American Library Association, “Resolution on Access to Accurate Information,” 2018; Nick Rochlin, “Fake News: Belief in Post-Truth,” Library Hi Tech 35, no. 3 (2017): 386–92, https://doi.org/10.1108/LHT-03-2017-0062; Linda Jacobson, “The Smell Test: In the Era of Fake News, Librarians Are Our Best Hope,” School Library Journal 63, no. 1 (2017): 24–29; Angeleen Neely–Sardon, and Mia Tignor, “Focus on the Facts: A News and Information Literacy Instructional Program,” The Reference Librarian 59, no. 3 (2018): 108–21, https://doi.org /10.1080/02763877.2018.1468849; Claire Wardle and Hossein Derakhshan, “Information Disorder: Toward an Interdisciplinary Framework for Research and Policy Making,” Council of Europe report 27 (2017). 13 IFLA, “How to Spot Fake News,” 2017. https://doi.org/10.1145/3091478.3091523 https://doi.org/10.1080/23808985.2019.1602782 https://doi.org/10.1145/3395046 https://doi.org/10.1002/pra2.2017.14505401070 https://doi.org/10.15760/comminfolit.2019.13.1.7 https://www.emerald.com/insight/publication/issn/0737-8831 https://doi.org/10.1108/LHT-03-2017-0062 INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 17 14 Jane Mandalios, “Radar: An Approach for Helping Students Evaluate Internet Sources,” Journal of Information Science 39, no. 4 (2013): 470–78, https://doi.org/10.1177/0165551513478889; Sarah Blakeslee, “The CRAAP test,” LOEX Quarterly 3, no. 3 (2004):4. 15 Victoria L. Rubin and Niall Conroy, “Discerning Truth from Deception: Human Judgments and Automation Efforts,” First Monday 17, no. 5 (2012), https://doi.org/10.5210/fm.v17i3.3933; Verónica Pérez-Rosas et al., “Automatic Detection of Fake News,” arXiv preprint arXiv:1708.07104 (2017). 16 Justin P. Friesen, Troy H. Campbell, and Aaron C. Kay, “The Psychological Advantage of Unfalsifiability: The Appeal of Untestable Religious and Political Ideologies,” Journal of Personality and Social Psychology 108, no. 3 (2015): 515–29, https://doi.org/10.1037/pspp0000018. 17 Tanja Pavleska et al., “Performance Analysis of Fact-Checking Organizations and Initiatives in Europe: A Critical Overview of Online Platforms Fighting Fake News,” Social Media and Convergence 29 (2018). 18 Yasmine Lahlou, Sanaa El Fkihi, and Rdouan Faizi, “Automatic Detection of Fake News on Online Platforms: A Survey,” (paper, 2019 1st International Conference on Smart Systems and Data Science (ICSSD), Rabat, Morocco, 2019), https://doi.org/10.1109/ICSSD47982.2019.9002823; Christian Janze, and Marten Risius, “Automatic Detection of Fake News on Social Media Platforms,” (paper, Pasific Asia Conference on Information Systems (PACIS), 2017); Torstein Granskogen, “Automatic Detection of Fake News in Social Media Using Contextual Information” (master’s thesis, Norwegian University of Science and Technology (NTNU), 2018). 19 Jacob L. Nelson and Harsh Taneja, “The Small, Disloyal Fake News Audience: The Role of Audience Availability in Fake News Consumption,” New Media & Society 20, no. 10 (2018): 3720–37, https://doi.org/10.1177/1461444818758715; Philip N. Howard et al., “Social Media, News and Political Information During the US Election: Was Polarizing Content Concentrated in Swing States?,” arXiv preprint arXiv:1802.03573 (2018); Alexandre Bovet and Hernán A. Makse, “Influence of Fake News in Twitter During the 2016 US Presidential Election,” Nature Communications 10, no. 7 (2019): 1–14, https://doi.org/10.1038/s41467-018-07761-2. 20 Lina Zhou et al., “Automating Linguistics-Based Cues for Detecting Deception in Text-Based Asynchronous Computer-Mediated Communications,” Group Decision and Negotiation 13, no. 1 (2004): 81–106, https://doi.org/10.1023/B:GRUP.0000011944.62889.6f; Myle Ott et al., “Finding Deceptive Opinion Spam by Any Stretch of the Imagination,” arXiv preprint arXiv:1107.4557 (2011); Rada Mihalcea and Carlo Strapparava, “The Lie Detector: Explorations in the Automatic Recognition of Deceptive Language,” (paper, Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, (2009): Association for Computational Linguistics, 309–12); Julia B. Hirschberg et al., “Distinguishing Deceptive from Non-Deceptive Speech,” (2005), https://doi.org/10.7916/D8697C06. 21 Victoria L. Rubin, Yimin Chen, and Nadia K. Conroy, “Deception Detection for News: Three Types of Fakes,” Proceedings of the Association for Information Science and Technology 52, no. 1 (2015): 1–4, https://doi.org/10.1002/pra2.2015.145052010083; David M. Markowitz, and Jeffrey T. Hancock, “Linguistic Traces of a Scientific Fraud: The Case of Diederik Stapel,” PloS https://doi.org/10.1177/0165551513478889 https://doi.org/10.5210/fm.v17i3.3933 https://psycnet.apa.org/doi/10.1037/pspp0000018 https://doi.org/10.1109/ICSSD47982.2019.9002823 https://doi.org/10.1177%2F1461444818758715 https://doi.org/10.1038/s41467-018-07761-2 https://doi.org/10.1023/B:GRUP.0000011944.62889.6f https://doi.org/10.7916/D8697C06 https://doi.org/10.1002/pra2.2015.145052010083 INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 18 one 9, no. 8 (2014): e105937, https://doi.org/10.1371/journal.pone.0105937; Jing Ma et al., “Detecting Rumors from Microblogs with Recurrent Neural Networks,” (paper, Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI 2016), (2016): 3818–24), https://ink.library.smu.edu.sg/sis_research/4630. 22 Kai Shu et al., “Fake News Detection on Social Media: A Data Mining Perspective,” ACM SIGKDD Explorations Newsletter 19, no. 1 (2017): 22–36, https://doi.org/10.1145/3137597.3137600. 23 Eugenio Tacchini et al., “Some Like It Hoax: Automated Fake News Detection in Social Networks,” arXiv preprint arXiv:1704.07506 (2017). 24 Julio C.S. Reis et al., “Supervised Learning for Fake News Detection,” IEEE Intelligent Systems 34, no. 2 (2019): 76–81, https://doi.org10.1109/MIS.2019.2899143. 25 Victoria L. Rubin et al., “Fake News or Truth? Using Satirical Cues to Detect Potentially Misleading News,” (paper, Proceedings of the Second Workshop on Computational Approaches to Deception Detection, (2016): 7–17); Francesco Barbieri, Francesco Ronzano, and Horacio Saggion, “Is This Tweet Satirical? A Computational Approach for Satire Detection in Spanish,” Procesamiento del Lenguaje Natural, no. 55 (2015): 135-42; Soujanya Poria et al., “A Deeper Look into Sarcastic Tweets Using Deep Convolutional Neural Networks,” arXiv preprint arXiv:1610.08815 (2016). 26 Lei Guo and Chris Vargo, “’Fake News’ and Emerging Online Media Ecosystem: An Integrated Intermedia Agenda-Setting Analysis of the 2016 Us Presidential Election,” Communication Research 47, no. 2 (2020): 178–200, https://doi.org/10.1177/0093650218777177. 27 Natali Ruchansky, Sungyong Seo, and Yan Liu, “CSI: A Hybrid Deep Model for Fake News Detection,” Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, (November 2017): 797–806, https://doi.org/10.1145/3132847.3132877. 28 Yaqing Wang et al., “EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection,” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, (2018): 849–57, https://doi.org/10.1145/3219819.3219903; James W. Pennebaker, Martha E. Francis, and Roger J. Booth, “Linguistic Inquiry and Word Count: LIWC 2001”, Mahway: Lawrence Erlbaum Associates 71, no. 2001 (2001). 29 “Facebook, Twitter May Face More Scrutiny in 2019 to Check Fake News, Hate Speech,” accessed May 17, 2020, available: https://www.huffingtonpost.in/entry/facebook-twitter-may-face- more-scrutiny-in-2019-to-check-fake-news-hate-speech_in_5c29c589e4b05c88b701d72e. 30 Nitesh V. Chawla et al., “Smote: Synthetic Minority Over-Sampling Technique,” Journal of Artificial Intelligence Research 16, (2002): 321–57, https://doi.org/10.1613/jair.953. 31 Uğur Mertoğlu and Burkay Genç, “Lexicon Generation for Detecting Fake News,” arXiv preprint arXiv:2010.11089 (2020). 32 Burak Bezirci, and Asım Egemen Yilmaz, “Metinlerin Okunabilirliğinin Ölçülmesi Üzerine Bir Yazilim Kütüphanesi Ve Türkçe Için Yeni Bir Okunabilirlik Ölçütü,” Dokuz Eylül Üniversitesi https://doi.org/10.1371/journal.pone.0105937 https://ink.library.smu.edu.sg/sis_research/4630 https://doi.org/10.1145/3137597.3137600 https://doi.org10.1109/MIS.2019.2899143 https://doi.org/10.1177%2F0093650218777177 https://doi.org/10.1145/3132847.3132877 https://doi.org/10.1145/3219819.3219903 https://www.huffingtonpost.in/entry/facebook-twitter-may-face-more-scrutiny-in-2019-to-check-fake-news-hate-speech_in_5c29c589e4b05c88b701d72e https://www.huffingtonpost.in/entry/facebook-twitter-may-face-more-scrutiny-in-2019-to-check-fake-news-hate-speech_in_5c29c589e4b05c88b701d72e https://doi.org/10.1613/jair.953 INFORMATION TECHNOLOGY AND LIBRARIES DECEMBER 2020 AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 19 Mühendislik Fakültesi Fen ve Mühendislik Dergisi 12, no. 3 (2010): 49–62, https://dergipark.org.tr/en/pub/deumffmd/issue/40831/492667. 33 Robert Gunning, “The technique of clear writing,” Revised Edition, New York: McGraw Hill, 1968. 34 Ender Ateşman, “Türkçede Okunabilirliğin Ölçülmesi,” Dil Dergisi 58, no. 71–74 (1997). https://dergipark.org.tr/en/pub/deumffmd/issue/40831/492667 ABSTRACT INTRODUCTION LITERATURE REVIEW METHODOLOGY Framework Model Dataset Features EXPERIMENTAL RESULTS AND DISCUSSION CONCLUSION AND FUTURE WORK ENDNOTES