Automated Fake News Detection in the Age of Digital Libraries


ARTICLE 

Automated Fake News Detection in the  
Age of Digital Libraries 
Uğur Mertoğlu and Burkay Genç 

 
INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2020  
https://doi.org/10.6017/ital.v39i4.12483 

 
Uğur Mertoğlu (umertoglu@hacettepe.edu.tr) is a PhD Candidate, Hacettepe University. Burkay 
Genç (bgenc@cs.hacettepe.edu.tr) is Assistant Professor, Hacettepe University. © 2020. 

ABSTRACT 

The transformation of printed media into the digital environment and the extensive use of social 
media have changed the concept of media literacy and people’s habits of news consumption. While 
online news is faster, easier, comparatively cheaper, and offers convenience in terms of people's 
access to information, it speeds up the dissemination of fake news. Due to the free production and 
consumption of large amounts of data, fact-checking systems powered by human efforts are not 
enough to question the credibility of the information provided, or to prevent its rapid dissemination 
like a virus. Libraries, long known as sources of trusted information, are facing challenges caused by 
misinformation as mentioned in studies about fake news and libraries.1 Considering that libraries are 
undergoing digitization processes all over the world and are providing digital media to their users, it 
is very likely that unverified digital content will be served by world’s libraries. The solution is to 
develop automated mechanisms that can check the credibility of digital content served in libraries 
without manual validation. For this purpose, we developed an automated fake news detection system 
based on Turkish digital news content. Our approach can be modified for any other language if there 
is labelled training material. This model can be integrated into libraries’ digital systems to label 
served news content as potentially fake whenever necessary, preventing uncontrolled falsehood 
dissemination via libraries. 

INTRODUCTION 

Collins dictionary which chose the term “fake news” as the “Word of the Year 2017,” describes 
news as the actual and objective presentation of a current event, information, or situation that is 
published in newspapers and broadcast on radio, television, or online.2 We are in an era where 
everything goes online, and news is not an exception. Many people today prefer to read their daily 
news online, because it is a cost-effective and convenient way to remain up to date. Although this 
convenience has lucrative benefits for society, it can also have harmful side effects. Having access 
to news from multiple sources, anytime, anywhere has become an irresistible part of our daily 
routines. However, some of these sources may provide unverified content which can easily be 
delivered right to your mobile device. Most importantly, potential fake news content delivered by 
these sources may mislead society and cause social disturbances such as triggering violence 
against ethnic minorities and refugees, causing unnecessary fear related to health issues, or even 
sometimes result in crisis, devastating riots and strikes. 

Not having a steady definition compared to news, fake news is often defined according to the data 
used or the limited perspective of the study in the literature. For example; DiFranzo and Gloria-
Garcia defined the fake news as “false news stories that are packaged and published as if they 
were genuine.”3 On the other hand, Guess et al. see the term as “a new form of political 
misinformation” within the domain of politics, whereas Mustafaraj is more direct and defines it as 

mailto:umertoglu@hacettepe.edu.tr
mailto:bgenc@cs.hacettepe.edu.tr


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 2 

“lies presented as news.”4 A comprehensive list of 12 definitions can be found in Egelhofer and 
Lecheler.5 In simplified terms, news which is created to deceive or mislead readers can be called 
fake news. However, the concept of fake news is a quite broad one that needs to be specified 
meticulously. 

Fake news is created for many purposes and emerges in many different types. Having an 
interwoven structure, most of these types are shown in figure 1. Although, it is not easy to cluster 
these types into separate groups, they can be categorized according to the information quality or 
based on the intention as it is created to deceive deliberately or not, as Rashkin et al. did.6 We 
propose the following classification where the two dimensions represent the potential impact and 
the speed of propagation.  

 
Figure 1. The volatile distribution of the fake news types (clustered in four regions: sr, Sr, Sr, SR) with 
respect to two dimensions: speed of propagation and potential impact. 

The four regions visualized are clustered according to their dangerousness. First of all, it should be 
noted that to order types of fake news in a stable precision is quite a challenging task. The 
variations within the field highly depend on dynamic factors such as timespan, actors, and echo-
chamber effect. Hence, this figure should be considered as a clustering effort. There are possible 
intersecting areas of types within the regions. We will now give examples for two regions, “sr” and 
“SR.” 

For example, the SR grouping shows characteristics of high-risk levels and fast dissemination. This 
includes varieties of fake news such as propaganda, manipulation, misinformation, hate news, 


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 3 

provocative news, etc. We usually encounter this in the domain of politics. This kind of news may 
cause critical and nonrecoverable results in politics, the economy, etc., in a short period of time. 
The rise of the term fake news itself can also be attributed to this kind of news. On the other hand, 
the relatively less severe group (sr) of fake news, comprising of satire, hoax, click-bait, etc., has 
low-risk levels and a slow speed of dissemination. A frequently used type of this group, click-bait, 
is a sensational headline or link that urges the reader to click on a post, link, article, image, or 
video. These kinds of news have a repetitive style. It can be said that readers become aware of 
falsehood after experiencing a few times. So, risk level is lower, and dissemination is slower.  

Vosoughi et al. stated the assumption that “Falsehood diffuses significantly farther, faster, deeper, 
and more broadly than the truth.”7 So indeed, just one piece of fake news may affect many more 
people than thousands of true news items do because of the dramatic circulation of fake news. 

In their recent survey about fake news, Zhou and Zafarani highlighted that fake news is a major 
concern for many different research disciplines especially information technologies. 8 Being a 
trusted source of information for a long time, libraries will play an important role in fighting 
against fake news problem. Kattimani et al. claims that the modern librarian must be equipped 
with necessary digital skills and tools to handle both printed collections and newly emerging 
digital resources.9 Similarly, we foresee that digital libraries, which can be defined as collections of 
digital content licensed and maintained by libraries, can be a part of the solution as an authority 
service with a collective effort. Connaway et al. point to the key role of information professionals 
such as librarians, archivists, journalists, and information architects in helping society use the 
products and services related to news in a convenient way. 10 As libraries all over the world are 
transitioning into digital content delivery services, they should implement mechanisms to avoid 
fake and misleading content being disseminated through them under the guidance of information 
professionals. 

To lay out proper future directions for the solution strategy, a clear understanding of interaction 
between library and information science (LIS) community and fake news must be addressed. 
Sullivan states that the LIS community has been affected deeply in the aftermath of the 2016 US 
presidential elections.11 Moreover, he quotes many other scientists, emphasizing libraries’ and 
librarians’ role in the fight against fake news. For example, Finley et al. say that libraries are the 
direct antithesis of fake news, the American Library Association (ALA) called fake news an 
anathema to the ethics of librarianship in 2017, Rochlin emphasizes the role of librarians in this 
fight, and talks about the need to adopt fake news as a central concern in librarianship and many 
other researchers name librarians in the front lines of the fight against fake news.12 

Today, the struggle to detect fake news and prevent their spread is so popular that competitions 
are being organized (e.g., http://www.fakenewschallenge.org/) and conferences are being held 
(e.g., Bobcatsss 2020).  The struggle against fake news can be classified under three main venues: 

• Reader awareness 
• Fact-checking organizations and websites 
• Automated detection systems 

The first item requires awareness of individuals against fake news and a collective conscience 
within the society against spreading fake news. To this end, visual and textual checklists, 
frameworks, and guidance lists are being published by official organizations, such as IFLA’s13 

http://www.fakenewschallenge.org/


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 4 

(International Federation of Library Associations) infographic which contains eight steps to spot 
fake news. The RADAR framework and the Currency, Relevance, Authority, Accuracy, and Purpose 
(CRAAP) test are some of the efforts trying to increase reader-awareness of fake news.14 
Unfortunately, due to the nature of fake news and the clever way they are created triggering 
people’s hunger to spread sensational information, it is very difficult to achieve full control via this 
strategy. Some studies explicitly showed that humans are prone to get confused when it comes to 
spotting lies or deciding whether a news item is fake or not.15 Furthermore, people often overlook 
facts that conflict with their current belief, especially in politics and controversial social issues.16 

The second strategy focuses on third-party manually driven systems for checking and labelling 
content as fake or valid. Recently, we have seen many examples of offline and online organizations 
trying to work according to this strategy, such as a growing body of fact-checking organizations, 
start-ups (Storyzy, Factmata, etc.), and other projects with similar purposes.17 Unfortunately, 
these manually powered systems cannot cope with the huge amounts of digital content being 
steadily produced. Therefore, they focus only on a subset of digital content that they classify as 
having higher priority. Even for this subset of content, their reaction speed is much slower than 
the fake information’s spread speed. Therefore, automated and verified systems emerge as an 
inevitable last option. 

The third strategy offers automated fact-checking systems, which once trained, can deliver content 
labelling at unprecedented speeds. Today, many researchers are researching automated solutions 
and building models with different methodologies.18 Notwithstanding the latest studies, there is 
still a lot to do in the realm of automated fake news detection. Automated fact-checking systems 
will be detailed in the rest of the paper. 

Thanks to the internet, the collections of digital content served by digital libraries can be accessed 
by a great number of users without distance and time limits. Therefore, we propose a solution to 
the problem by positioning digital libraries as automated fact-checking services, which label 
digital news content as fake or valid as soon as or before it is served through library systems. The 
main reason we associate this approach with digital libraries is their access to a wide variety of 
digital content which can be used to train the proposed mathematical models, as well as their role 
in the society as the publisher of trusted information. To this end, we develop a mathematical 
model that is trained using existing news content served by digital libraries, and capable of 
labelling news content as fake or valid with unprecedented accuracy. The proposed solution uses 
machine learning techniques with an optimized set of extracted features and annotated labels of 
existing digital news content. Our study mainly contributes (a) a new set of features highly 
applicable for agglutinative languages, (b) the first hybrid model combining a lexicon/dictionary-
based approach with machine learning methods to detect fake news, and (c) a benchmark dataset 
prepared in Turkish for fake news detection.  

LITERATURE REVIEW 

Contemporary studies have indicated that social, economic, and political events in recent years, 
especially after the 2016 US presidential elections, are increasingly associated with the concept of 
fake news.19 Since then, fake news has begun to be used as a tool in many domains. On the other 
hand, researchers motivated by finding automated solutions started to make use of machine 
learning, deep learning, hybrid models, and other methodologies for their solutions.  

https://storyzy.com/


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 5 

Although computational deception detection studies applying NLP (Natural Language Processing) 
operations are not new, textual deception in the context of text-based news is a new topic for the 
field of journalism.20 Accordingly, we believe that there is a hidden body language of news text, 
which has linguistic clues indicating whether the news is fake or not. Thus, lexical, syntactic, 
semantic, and rhetorical analysis when used with machine learning and deep learning techniques 
offers encouraging directions. 

The textual deception spread over a wide spectrum and the studies have utilized many different 
techniques. There are some prominent studies which took the problem as a binary classification 
problem utilizing linguistic clues.21 Although it is still early to say the linguistic characteristics of 
fake news are fully understood, research into fake-news detection in English-language texts is 
relatively advanced compared to that in other languages. In contrast, agglutinative languages such 
as Turkish have been little researched when it comes to fake news detection. Agglutinative 
languages enable the construction of words by adding various morphemes, which means that 
words that are not practically in use may exist theoretically. For example, “gerek-siz-leş-tir-ebil-
ecek-leri-miz-den-dir,” is a theoretically possible word that  means “it is one of the things that we 
will be able to make redundant,” but it is not a practical one. 

Shu et al. classified the models for the detection of fake news in their study.22 According to this 
study, the automated approaches can focus on four types of attributes to detect fake news: 
knowledge based, style based, stance based, or propagation based. Among these, it can be said that 
the most useful approaches are the ones which focus on the textual news content. Th e textual 
content can be studied by an automated process to extract features that can be very helpful in 
classifying content as fake or valid. 

Many scholars have tried to build models for automatic detection and prediction of fake news 
using machine learning algorithms, deep learning algorithms, and other techniques. These 
scholars approach the detection of fake news from many different perspectives and domains. For 
example, in one of the studies, scientific news and conspiracy news were used.23 In Shu et al.’s 
study based on credibility of news, the headlines were used to determine whether the article was 
clickbait or not. In another study, Reis et al. worked on Buzzfeed articles linked to the 2016 US 
election using machine learning techniques with a supervised learning approach.24  

Studies which try to detect satire and sarcasm can be attributed to subcategories of fake news 
detection.25 Our observation, in line with the general view, is that satire is not always recognizable 
and can be misunderstood for real news.26 For this reason, we included satirical news in our 
dataset. It should be noted that although satire or sarcasm can be classified by automated 
detection systems, experts should still evaluate the results of the classification. 

While some scholars used specific models focusing on unique characteristics, some others such as 
Ruchansky et al. proposed hybrid deep models for fake news detection making use of multiple 
kinds of features such as temporal engagement between users and news articles over time and 
generated a labelling methodology based on those features.27 

In related studies, many features such as automatic extracted features, hand-crafted features, 
social features, network information, visual features, and some others such as psycholinguistic 
features, are applied by researchers.28 In this work, we focused on news content features, however 
the social context features can also be adapted using different tiers such as user activity patterns, 


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 6 

analysis of user interaction, profile metadata, social network/graph analysis etc. to extract 
features. We also have some of these features in our data but not having ground truth 
quantitatively, we avoided using these features. 

METHODOLOGY 

In this section, we present our motivation for this work which we visualized in a framework and 
named Global Library and Information Science (GLIS_1.0). Subsequently, we discuss the 
construction of the automated detection system as the key element of the GLIS_1.0 framework. We 
explain the framework, model, dataset, features, and the techniques used in this section. 

Framework 
The main structure of the proposed framework is shown in figure 2. This framework consists of 
highly cohesive but flexible layers.  

 
Figure 2. The GLIS_1.0 framework main structure. 

In the presentation layer one can find the different sources of news that are publicly available. 
These sources can be accessed directly using their websites or can be searched for via search 
engines. The news is received by fact-checking organizations which classify them manually, digital 
libraries which archives and serves them, and automated detection systems (ADS) which classify 
them automatically. Digital libraries work together with fact-checking organizations and ADSs to 
present clean and valid news to the public. Moreover, search engines use digital libraries systems 
to label their results as fake or valid. 


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 7 

Fact-checking organizations should also benefit from the output of ADSs, as instead of manually 
checking heaps of news content, they could now focus on news labeled as potentially fake by an 
ADS. Through GLIS, ADSs make the life of fact-checking organizations and digital libraries much 
easier, all the while increasing the quality of news served to the public.  

Considering this is a high-level overview of a structure given in figure 2, there may be many other 
components, mechanisms, or layers, but the key elements of this structure are automated 
detection systems and the digital libraries. A critical approach to this framework can be why we 
need such an authority mechanism. The answer will be quite simple, technological progress is not 
the only solution. On the contrary, tech giants have already been subject to regulatory scrutiny for 
how they handle personal information.29 Also, their policy related to political ads has been 
questioned. Furthermore, they are often blamed for failing to fight fake news. Indeed, there is an 
urgent need for a global action more than ever. Digital libraries are much more than a 
technological advancement. Hence, they should be considered as institutions or services which 
can be a great authority service to provide news to society since the printed media disappears day 
by day.  

The threats caused by fake news are real and dangerous, but only recently have researchers from 
different disciplines been trying to find possible solutions such as educational, technological, 
regulatory, or political. Digital librarianship can be the intersection of all these solutions for 
promoting information/media literacy. Hence, digital librarianship will make use of many 
automated detection systems (ADS) to serve qualified news. In the following section, we discuss 
ADS in detail. 

Model 
An overview of our model of automated detection system solution which is very critical for the 
framework is shown in figure 3. Our fake news detection model consists of two phases. First is the 
Language Model/Lexicon Generation and the second is Machine Learning Integration. In this 
work, we used machine learning algorithms via supervised learning techniques which learn from 
labeled news data (training) and helps us to predict outcomes for unforeseen news data (test). 

Dataset  

We collected our data from three sources:  

• The primary source is the GDELT (Global Database of Events, Language and Tone) Project 
(https://www.gdeltproject.org/), a massive global news media archive offering free access 
to news text metadata for researchers worldwide. It can almost be considered a digital 
library of news in its own right. However, GDELT does not provide the actual news text and 
only serves processed metadata along with the URL of the news item. GDELT normally does 
not check for the validity of any news items. However, we have only used news from 
approved news agencies and completely ignored news from local and lesser-known 
sources to maximize the validity of the news we have automatically obtained through 
GDELT. Moreover, we have post-processed the obtained texts by cross-validating with 
teyit.org data to clean any potential fake news obtained through GDELT links. 

https://www.gdeltproject.org/


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 8 

 
Figure 3. Integrated fake news detection model with main phases combining language-model based 
approach with machine learning approach. 

• The second source is teyit.org which is a fact-checking organization based in Turkey, 
compliant to the principles of IFCN (International Fact-Checking Network) aiming to 
prevent spreading of false information through online channels. Manually analyzing each 


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 9 

news item, they tag them as fake, true, or uncertain. We used their results to automatically 
download and label each news text. 

• Lastly, our team collected manually curated and verified fake and valid news obtained from 
various online sources and named it as MVN (Manually Verified News). This set includes 
fake and valid news that we have manually accumulated in time during our studies and that 
were not overlapping with the news obtained from GDELT and teyit.org sources.  

We named our dataset TRFN. In Phase 2, the data is very similar to the one we used in Phase 1. 
However, to see the effectiveness of model, we made modifications to exclude old news before 
2017 and added new items from 2019. The news in our dataset span a time frame between 2017–
2019 and are uniformly distributed. Table 1 outlines the dataset statistics, namely where the news 
text comes from, its class (fake or valid), the amount of distinct texts and the corresponding data 
collection method. It can be seen from the table that most of our valid news come from the GDELT 
source, whereas teyit.org, a fact-checking organization, contributes only fake news.  

Table 1. TRFN Dataset Summary after cleaning and duplicate removal. 

Dataset Class Size of Processed Data Collection Method 

GDELT NON-FAKE 82708 Automated 

Teyit.org FAKE 1026 

MVN NON-FAKE 1049 Manual 

FAKE 400 

 
All news items were processed through Zemberek (http://code.google.com/p/zemberek), the 
Turkish NLP engine for extracting different morphological properties of words within texts. After 
this processing phase, all obtained features were converted into tabular format and made 
available for future studies. This dataset is now available for scholarly studies upon request.  

In a study of this nature, the verifiability of the data used is important. As we have already 
mentioned, most of the data we used comes from verified sources such as mainstream news 
agencies accessed through GDELT and teyit.org archives which are verified by teyit.org staff. All  
data used in training the mathematical models which are to be explained in the rest of the paper 
are either directly or indirectly verified. 

Another important issue was generalizability of the dataset, which determines whether the results 
of the study are only applicable to specific domains or to all available domains. Although focusing 
on a specific news domain would clearly improve our accuracies, we preferred to work in the 
general domain and included news from all specific domains. The distribution of domains in our 
dataset is visualized in figure 4. This distribution closely matches the distribution one would 
experience reading daily news in Turkey. Hence, we have no domain specific bias in our training 
dataset. 

http://code.google.com/p/zemberek


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 10 

 
Figure 4. The distribution of domains in the dataset. (SciTechEnvWetNatLife = Science, 
Technology, Environment, Weather, Nature, Life.  EduCultureArtTourism = Education, Culture, Art, 
Tourism.) 

Moreover, we obtained highly correlated evidence showing syntactic similarities with the other 
NLP studies in Turkish during the exploratory data analysis. For example, the results of a study by 
Zemberek developers (http://zembereknlp.blogspot.com/2006/11/kelime-istatistikleri.html) to 
find the most common words in Turkish experimented with over five million words is compatible 
with most common words in our corpus. This evidence can be attributed to representability of our 
dataset. 

The last issue worth discussing is the imbalanced nature of the dataset. An imbalanced dataset 
occurs in a binary classification study when the frequency of one of the classes dominates the 
frequency of the other class in the dataset. In our dataset, the amount of fake news is highly 
surpassed by the amount of valid news. This generally results in difficulties in applying 
conventional machine learning methods to the dataset. However, it is a frequently observed 
phenomenon due to the disparity of variable classes in these kinds of problems in real world. To 
avoid potential problems due to the imbalanced nature of the dataset, we used SMOTE (Synthetic 

http://zembereknlp.blogspot.com/2006/11/kelime-istatistikleri.html


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 11 

Minority Over-sampling Technique) which is an over-sampling method.30 It creates synthetic 
samples of the minority class that are relatively close in the feature space to the existing 
observations of the minority class.  

Features 
In this study, we discarded some features because of their relatively low impact on overall 
performance during the exploratory data analysis and subsequently in the training phase. The 
most effective features we decided on are shown in table 2. 

Table 2. Main Features 

Features Group Definition 

nRootScore Language Model 
Features 

The news score calculated according to the Root 
Model nRawScore The news score calculated according to the Raw 
Model SpellErrorScore Extracted 

Features 
Spell errors per sentences 

ComplexityScore The score of the complexity/readibility of the news 

Source  

Labels 

The URL or identifier of the news 

MainCategory The category of the news 

NewsSite The unique address of the news 

 
The language model features nRootScore and nRawScore are features that we have borrowed 
from our earlier study on fake news detection.31 In that study, we focused on constructing a fake 
news dictionary/lexicon based on different morphological segments of the words used in news 
texts. These two scores were found to be the most successful ones in determining the 
fakeness/validity of a news text, one considering the raw form of the words, the other considering 
the root form. 

The extracted features are ComplexityScore and SpellErrorScore. ComplexityScore basically 
represents the readability of the text. Studies for determining a good readability metric exist for 
the Turkish language.32 We used a modified version of the Gunnig-Fog metric, which is based on 
word length and sentence length.33 Since Turkish is an agglutinative language, we used word 
length instead of using the syllable count. We also made some modifications to normalize the 
scores. The average number of syllables per word syllable in Turkish is 2.6, so we defined a word 
as a long word if it has more than 9 letters.34 For a given news text T, the Complexity Score (CS) 
can be computed by equation 1. 

(1) 

𝑇𝐶𝑆 = (

𝑊𝑜𝑟𝑑𝑐𝑜𝑢𝑛𝑡

𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠𝑐𝑜𝑢𝑛𝑡
+

𝐿𝑜𝑛𝑔𝑊𝑜𝑟𝑑𝑐𝑜𝑢𝑛𝑡∗100

𝑊𝑜𝑟𝑑𝑐𝑜𝑢𝑛𝑡

10
) 

The second Extracted Feature is SpellErrorScore. We foresee that there may be many more errors 
in fake news than in valid news. We calculated the spell error counts making use of Turkish 
Spellchecker class of Zemberek. Due to the text length of news varies, we calculate the ratio 


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 12 

according to the sentences. For a given news text T, SE (Spell Error Score) is calculated as shown 
in equation 2.  

(2) 

𝑇𝑆𝐸 = (
𝑆𝑝𝑒𝑙𝑙𝐸𝑟𝑟𝑜𝑟𝐶𝑜𝑢𝑛𝑡

𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠𝐶𝑜𝑢𝑛𝑡
) 

Finally, we included the metadata categories Source, MainCategory, and NewsSite as additional 
identifiers for the learning process.  

Then, we combined features extracted from text representation techniques with the features 
shown in table 2 and trained the model with different classifiers. For text representation, we 
followed two directions for the experiments. First, we converted text into structured features with 
Bag of Words (BOW) approach in which text data is represented as the multiset of its words. 
Second, we experimented with N-grams which represents the sequence of n words, in other words 
splitting text into chunks of size N-words.  

In the (BOW) model, documents in TRFN are represented as a collection of words, ignoring 
grammar and even word order, but preserving multiplicity. In a classic BOW approach, each 
document can be represented as a fixed-length vector with length equal to the vocabulary size. 
This means each dimension of this vector corresponds to the occurrence of a word in a news item. 
We customized the generic approach by reducing variable-length documents to fixed-length 
vectors to be able to use with varying lengths with many machine learning models. 


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 13 

 
Figure 5. An overview of BOW (Bag of Word) Approach. 

Because we ignore the word order, we reduced fixed length of counts as histograms as seen in 
figure 5. Assuming N is the number of news documents and W is the number of possible words in 
the corpus, it should be noted that in N*W count matrix, N is generally large but infrequent, 
because we have many news documents, but most words do not occur in any given document 
causing rareness of a term/word which is a drawback for the approach. Therefore, we modified 
the model to compensate the rarity problem by weighting the terms using TF-IDF measure which 
evaluates how important a word is to a document in a collection. 

The other technique we used, N-gram model is the generic term for a string of words in 
computational linguistics, and it is extensively used in text mining and NLP tasks. The prefixes that 
replace the n-part indicate the number of consecutive words in the string. So, a unigram is 
referred to one word, a bigram is two words, and an n-gram is n words. 

EXPERIMENTAL RESULTS AND DISCUSSION  

In this section, the experimental process and the results are presented. All experiments are 
performed using the Scikit-learn library. To evaluate the performance of the model and proposed 
features we employed the precision, recall, F1 score (the harmonic mean of the precision and 
recall), and accuracy metrics. We did many experiments using different combinations of features. 


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 14 

Several classification models have been trained. These are as follows: K-Nearest Neighbor, 
Decision Trees, Gaussian Naive Bayes, Random Forest, Support Vector Machine, ExtraTrees 
Classifier, and Logistic Regression.  

To be effective, a classifier should be able to correctly classify previously unseen data. To this end, 
we tuned the parameter values for all the classification models used. Then, models were trained 
and evaluated on TRFN dataset using 10-fold cross-validation. 

In table 3, we present the ultimate best scores of the proposed model. The results are highly 
motivating to exemplify how useful automated detection systems can be as a key component of 
the integrated solution framework in figure 2. We compared the algorithms with three ultimate 
feature sets for having respectively consistent results to the other feature set combinations. Set1 
stands for bigram+FOpt (Optimized Features), Set2 stands for BOWModified+ FOpt and Set3 stands for 
unigram+bigram+FOpt. The results show that there is a relative consistency in terms of 
performance across the models. In almost all models, the combination of unigram+bigram and 
optimized features sets (FOpt) gives better results than the other combinations. The ExtraTree 
Classifier model is chosen as the best due to its higher performance. This model is also known as 
Extremely Randomized Trees Classifier which is a type of ensemble learning technique 
aggregating the results of multiple decision trees collected in a “forest” to output its classification 
result. It is very similar to Random Forest Classifier and only differs in the manner of construction 
of the decision trees. So, we can also see closer results between these two classifiers. 

Table 3. Results. Evaluation results of all combinations of features and classification models. 

Model Feature Sets Precision%(0,1) Recall%(0,1) Accurac
y 

F1Scor
e  

 
Set1 93.32 93.96 93.92 93.3
6 

93.64 93.62 
Gaussian Naive 

Bayes 
Set2 93.37 94.02 93.98 93.4

2 
93.70 93.68 

 Set3 93.95 94.21 94.19 93.9
7 

94.08 94.07 
 Set1 93.70 93.50 93.52 93.6

9 
93.60 93.61 

K-Nearest Neighbour Set2 93.66 94.05 94.03 93.6
8 

93.85 93.84 
 Set3 94.42 94.21 94.22 94.4

1 
94.31 94.32 

 Set1 94.15 94.92 94.88 94.1
9 

94.53 94.51 
ExtraTrees Classifier Set2 94.09 94.94 94.90 94.1

4 
94.51 94.49 

 Set3 97.90 95.72 95.81 97.8
6 

96.81 96.85 
 Set1 89.61 88.92 88.99 89.5

4 
89.26 89.30 

Support Vector 
Machine 

Set2 89.70 88.96 89.04 89.6
2 

89.33 89.37 
 Set3 90.85 91.26 91.22 90.8

9 
91.05 91.03 

 Set1 91.56 92.28 92.23 91.6
2 

91.92 91.89 
Logistic Regression Set2 91.50 92.28 92.22 91.5

6 
91.89 91.86 

 Set3 92.25 92.90 92.86 92.3
0 

92.57 92.55 
 Set1 93.71 94.44 94.40 93.7

5 
94.07 94.05 

Random Forest Set2 93.87 95.00 94.94 93.9
4 

94.44 94.41 
 Set3 94.77 95.14 95.12 94.7

9 
94.96 94.95 

 Set1 93.95 94.59 94.56 93.9
9 

94.27 94.25 
Decision Trees Set2 94.05 95.08 95.03 94.1

1 
94.57 94.54 

 Set3 94.94 95.24 95.23 94.9
5 

95.09 95.08 
 

INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 15 

Every ADS in GLIS_1.0 framework may use its own way to detect fake news. The open source ADS 
may improve with feedbacks. Hybrid models and other techniques such as neural networks with 
deep learning methodology can also be used according to the data, language of news and the news 
features related with both social context and news content.  

CONCLUSION AND FUTURE WORK 

In this study we presented a novel framework which offers a practical architecture of an 
integrated system for identifying fake news. We have tried to illustrate how digital libraries can be 
a service authority to promote media literacy and fight against fake news. Because librarians are 
trained to critically analyze information sources, their contributions to our proposed model are 
critical. Accordingly, we see this work as an encouraging effort for the next collaborative studies 
among the communities of LIS and CS (computer science). 

We think that there is an immediate need for LIS professionals to participate and contribute to 
automated solutions that can help detecting inaccurate and unverified information. In the same 
manner, we believe the collaboration of LIS professionals, computer scientists, fact-checking 
organizations, and pioneering technology platforms is the key to provide qualified news within a 
real-time framework to promote information literacy. Moreover, we put the reader at the core of 
the framework as the feed reader position while consuming news. 

In terms of automated detection systems, we proposed a fake news detection model in tegration of 
dictionary-based approach and machine learning techniques offering optimized feature sets 
applicable to agglutinative languages. We comparatively analyzed the findings with several 
classification models. We demonstrated that machine learning algorithms when used together 
with dictionary-based findings yield high scores both for precision and recall. 

Consequently, we believe once operational in the field, proposed workflow can be extended in the 
future to support other news elements such as photographs and videos. With the help of Social 
Network Analysis (SNA) it may be possible to stop or slow down the spread of fake news as it 
emerges. During all the experiments we did, this work also highlighted several tasks as future 
research directions such as: 

• The studies can be deepened to mathematically categorize the fake news types and the 
dissemination characteristics of each type can be analyzed.  

• The workflow has the potential to provide an automated verification platform for all news 
content existing in digital libraries to promote media literacy.  

ENDNOTES 
 

1  M. Connor Sullivan, “Why Librarians Can’t Fight Fake News,” Journal of Librarianship and 
Information Science 51, no. 4 (December 2019): 1146–56, 
https://doi.org/10.1177/0961000618764258. 

2 “Definition of 'News',” available at: https://www.collinsdictionary.com/dictionary/english/news 

3 Dominic DiFranzo and Kristine Gloria-Garcia, “Filter Bubbles and Fake News,” XRDS: Crossroads, 
The ACM Magazine for Students 23, no. 3 (April 2017): 32–35, 
https://doi.org/10.1145/3055153. 

 
https://doi.org/10.1177/0961000618764258
https://www.collinsdictionary.com/dictionary/english/news
https://doi.org/10.1145/3055153


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 16 

 
4 Andrew Guess, Brendan Nyhan, and Jason Reifler, “Selective Exposure to Misinformation: 
Evidence from the Consumption of Fake News during the 2016 US Presidential Campaign,” 
European Research Council 9, no. 3 (2018): 4; Eni Mustafaraj and P. Takis Metaxas, “The Fake 
News Spreading Plague: Was It Preventable?” Proceedings of the 2017 ACM on Web Science 
Conference, (June 2017): 235–39, https://doi.org/10.1145/3091478.3091523. 

5 Jana Laura Egelhofer and Sophie Lecheler, “Fake News as a Two-Dimensional Phenomenon: A 
Framework and Research Agenda,” Annals of the International Communication Association 43, 
no. 2 (2019): 97–116, https://doi.org/10.1080/23808985.2019.1602782. 

6 Hannah Rashkin et al., “Truth of Varying Shades: Analyzing Language in Fake News and Political 
Fact-Checking,” Proceedings of the 2017 Conference on Empirical Methods in Natural Language 
Processing, (2017): 2931–37. 

7 Soroush Vosoughi, Deb Roy, and Sinan Aral, “The Spread of True and False News Online,” Science 
359, no. 6380 (2018): 1146–51, https://doi.org/10.1126/science.aap9559. 

8 Xinyi Zhou and Reza Zafarani, “A Survey of Fake News: Fundamental Theories, Detection 
Methods, and Opportunities,” ACM Computing Surveys (CSUR) 53, no. 5 (2020): 1–40, 
https://doi.org/10.1145/3395046. 

9 S. F. Kattimani, Praveenkumar Kumbargoudar, and D. S. Gobbur, “Training of the Library 
Professionals in Digital Era: Key Issues” (2006), 
https://ir.inflibnet.ac.in:8443/ir/handle/1944/1234. 

10 Lynn Silipigni Connaway et al., “Digital Literacy in the Era of Fake News: Key Roles for 
Information Professionals,” Proceedings of the Association for Information Science and 
Technology 54, no. 1 (2017): 554–55, https://doi.org/10.1002/pra2.2017.14505401070. 

11 Matthew C. Sullivan, “Libraries and Fake News: What’s the Problem? What’s the Plan?,” 
Communications in Information Literacy 13, no. 1 (2019): 91–113, 
https://doi.org/10.15760/comminfolit.2019.13.1.7. 

12 Wayne Finley, Beth McGowan, and Joanna Kluever, “Fake News: An Opportunity for Real 
Librarianship,” ILA reporter 35, no. 3 (2017): 8–12; American Library Association, “Resolution 
on Access to Accurate Information,” 2018; Nick Rochlin, “Fake News: Belief in Post-Truth,” 
Library Hi Tech 35, no. 3 (2017): 386–92, https://doi.org/10.1108/LHT-03-2017-0062; Linda 
Jacobson, “The Smell Test: In the Era of Fake News, Librarians Are Our Best Hope,” School 
Library Journal 63, no. 1 (2017): 24–29; Angeleen Neely–Sardon, and Mia Tignor, “Focus on the 
Facts: A News and Information Literacy Instructional Program,” The Reference Librarian 59, no. 
3 (2018): 108–21, https://doi.org /10.1080/02763877.2018.1468849; Claire Wardle and 
Hossein Derakhshan, “Information Disorder: Toward an Interdisciplinary Framework for 
Research and Policy Making,” Council of Europe report 27 (2017). 

13 IFLA, “How to Spot Fake News,” 2017. 
 

https://doi.org/10.1145/3091478.3091523
https://doi.org/10.1080/23808985.2019.1602782
https://doi.org/10.1145/3395046
https://doi.org/10.1002/pra2.2017.14505401070
https://doi.org/10.15760/comminfolit.2019.13.1.7
https://www.emerald.com/insight/publication/issn/0737-8831
https://doi.org/10.1108/LHT-03-2017-0062


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 17 

 
14 Jane Mandalios, “Radar: An Approach for Helping Students Evaluate Internet Sources,” Journal of 
Information Science 39, no. 4 (2013): 470–78, https://doi.org/10.1177/0165551513478889; 
Sarah Blakeslee, “The CRAAP test,” LOEX Quarterly 3, no. 3 (2004):4. 

15 Victoria L. Rubin and Niall Conroy, “Discerning Truth from Deception: Human Judgments and 
Automation Efforts,” First Monday 17, no. 5 (2012), https://doi.org/10.5210/fm.v17i3.3933; 
Verónica Pérez-Rosas et al., “Automatic Detection of Fake News,” arXiv preprint 
arXiv:1708.07104 (2017). 

16 Justin P. Friesen, Troy H. Campbell, and Aaron C. Kay, “The Psychological Advantage of 
Unfalsifiability: The Appeal of Untestable Religious and Political Ideologies,” Journal of 
Personality and Social Psychology 108, no. 3 (2015): 515–29, 
https://doi.org/10.1037/pspp0000018. 

17 Tanja Pavleska et al., “Performance Analysis of Fact-Checking Organizations and Initiatives in 
Europe: A Critical Overview of Online Platforms Fighting Fake News,” Social Media and 
Convergence 29 (2018). 

18 Yasmine Lahlou, Sanaa El Fkihi, and Rdouan Faizi, “Automatic Detection of Fake News on Online 
Platforms: A Survey,” (paper, 2019 1st International Conference on Smart Systems and Data 
Science (ICSSD), Rabat, Morocco, 2019), https://doi.org/10.1109/ICSSD47982.2019.9002823; 
Christian Janze, and Marten Risius, “Automatic Detection of Fake News on Social Media 
Platforms,” (paper, Pasific Asia Conference on Information Systems (PACIS), 2017); Torstein 
Granskogen, “Automatic Detection of Fake News in Social Media Using Contextual Information” 
(master’s thesis, Norwegian University of Science and Technology (NTNU), 2018). 

19 Jacob L. Nelson and Harsh Taneja, “The Small, Disloyal Fake News Audience: The Role of 
Audience Availability in Fake News Consumption,” New Media & Society 20, no. 10 (2018): 
3720–37, https://doi.org/10.1177/1461444818758715; Philip N. Howard et al., “Social Media, 
News and Political Information During the US Election: Was Polarizing Content Concentrated 
in Swing States?,” arXiv preprint arXiv:1802.03573 (2018); Alexandre Bovet and Hernán A. 
Makse, “Influence of Fake News in Twitter During the 2016 US Presidential Election,” Nature 

Communications 10, no. 7 (2019): 1–14, https://doi.org/10.1038/s41467-018-07761-2. 

20 Lina Zhou et al., “Automating Linguistics-Based Cues for Detecting Deception in Text-Based 
Asynchronous Computer-Mediated Communications,” Group Decision and Negotiation 13, no. 1 

(2004): 81–106, https://doi.org/10.1023/B:GRUP.0000011944.62889.6f; Myle Ott et al., 
“Finding Deceptive Opinion Spam by Any Stretch of the Imagination,” arXiv preprint 
arXiv:1107.4557  (2011); Rada Mihalcea and Carlo Strapparava, “The Lie Detector: 
Explorations in the Automatic Recognition of Deceptive Language,” (paper, Proceedings of the 
ACL-IJCNLP 2009 Conference Short Papers, (2009): Association for Computational Linguistics, 
309–12); Julia B. Hirschberg et al., “Distinguishing Deceptive from Non-Deceptive Speech,” 
(2005), https://doi.org/10.7916/D8697C06. 

21 Victoria L. Rubin, Yimin Chen, and Nadia K. Conroy, “Deception Detection for News: Three Types 
of Fakes,” Proceedings of the Association for Information Science and Technology 52, no. 1 
(2015): 1–4, https://doi.org/10.1002/pra2.2015.145052010083; David M. Markowitz, and 
Jeffrey T. Hancock, “Linguistic Traces of a Scientific Fraud: The Case of Diederik Stapel,” PloS 

 
https://doi.org/10.1177/0165551513478889
https://doi.org/10.5210/fm.v17i3.3933
https://psycnet.apa.org/doi/10.1037/pspp0000018
https://doi.org/10.1109/ICSSD47982.2019.9002823
https://doi.org/10.1177%2F1461444818758715
https://doi.org/10.1038/s41467-018-07761-2
https://doi.org/10.1023/B:GRUP.0000011944.62889.6f
https://doi.org/10.7916/D8697C06
https://doi.org/10.1002/pra2.2015.145052010083


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 18 

 
one 9, no. 8 (2014): e105937, https://doi.org/10.1371/journal.pone.0105937; Jing Ma et al., 
“Detecting Rumors from Microblogs with Recurrent Neural Networks,” (paper, Proceedings of 
the 25th International Joint Conference on Artificial Intelligence (IJCAI 2016), (2016): 3818–24), 
https://ink.library.smu.edu.sg/sis_research/4630. 

22 Kai Shu et al., “Fake News Detection on Social Media: A Data Mining Perspective,” ACM SIGKDD 

Explorations Newsletter 19, no. 1 (2017): 22–36, https://doi.org/10.1145/3137597.3137600. 

23 Eugenio Tacchini et al., “Some Like It Hoax: Automated Fake News Detection in Social 
Networks,” arXiv preprint arXiv:1704.07506 (2017). 

24 Julio C.S. Reis et al., “Supervised Learning for Fake News Detection,” IEEE Intelligent Systems 34, 
no. 2 (2019): 76–81, https://doi.org10.1109/MIS.2019.2899143. 

25 Victoria L. Rubin et al., “Fake News or Truth? Using Satirical Cues to Detect Potentially 
Misleading News,” (paper, Proceedings of the Second Workshop on Computational Approaches 
to Deception Detection, (2016): 7–17); Francesco Barbieri, Francesco Ronzano, and Horacio 
Saggion, “Is This Tweet Satirical? A Computational Approach for Satire Detection in Spanish,” 
Procesamiento del Lenguaje Natural, no. 55 (2015): 135-42; Soujanya Poria et al., “A Deeper 
Look into Sarcastic Tweets Using Deep Convolutional Neural Networks,” arXiv preprint 
arXiv:1610.08815 (2016). 

26 Lei Guo and Chris Vargo, “’Fake News’ and Emerging Online Media Ecosystem: An Integrated 
Intermedia Agenda-Setting Analysis of the 2016 Us Presidential Election,” Communication 
Research 47, no. 2 (2020): 178–200, https://doi.org/10.1177/0093650218777177. 

27 Natali Ruchansky, Sungyong Seo, and Yan Liu, “CSI: A Hybrid Deep Model for Fake News 
Detection,” Proceedings of the 2017 ACM on Conference on Information and Knowledge 
Management, (November 2017): 797–806, https://doi.org/10.1145/3132847.3132877. 

28 Yaqing Wang et al., “EANN: Event Adversarial Neural Networks for Multi-Modal Fake News 
Detection,” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge 
Discovery & Data Mining, (2018): 849–57, https://doi.org/10.1145/3219819.3219903; James 
W. Pennebaker, Martha E. Francis, and Roger J. Booth, “Linguistic Inquiry and Word Count: 
LIWC 2001”, Mahway: Lawrence Erlbaum Associates 71, no. 2001 (2001). 

29 “Facebook, Twitter May Face More Scrutiny in 2019 to Check Fake News, Hate Speech,” accessed 
May 17, 2020, available: https://www.huffingtonpost.in/entry/facebook-twitter-may-face-
more-scrutiny-in-2019-to-check-fake-news-hate-speech_in_5c29c589e4b05c88b701d72e. 

30 Nitesh V. Chawla et al., “Smote: Synthetic Minority Over-Sampling Technique,” Journal of 
Artificial Intelligence Research 16, (2002): 321–57, https://doi.org/10.1613/jair.953. 

31 Uğur Mertoğlu and Burkay Genç, “Lexicon Generation for Detecting Fake News,” arXiv preprint 
arXiv:2010.11089 (2020).  

32 Burak Bezirci, and Asım Egemen Yilmaz, “Metinlerin Okunabilirliğinin Ölçülmesi Üzerine Bir 
Yazilim Kütüphanesi Ve Türkçe Için Yeni Bir Okunabilirlik Ölçütü,” Dokuz Eylül Üniversitesi 

 
https://doi.org/10.1371/journal.pone.0105937
https://ink.library.smu.edu.sg/sis_research/4630
https://doi.org/10.1145/3137597.3137600
https://doi.org10.1109/MIS.2019.2899143
https://doi.org/10.1177%2F0093650218777177
https://doi.org/10.1145/3132847.3132877
https://doi.org/10.1145/3219819.3219903
https://www.huffingtonpost.in/entry/facebook-twitter-may-face-more-scrutiny-in-2019-to-check-fake-news-hate-speech_in_5c29c589e4b05c88b701d72e
https://www.huffingtonpost.in/entry/facebook-twitter-may-face-more-scrutiny-in-2019-to-check-fake-news-hate-speech_in_5c29c589e4b05c88b701d72e
https://doi.org/10.1613/jair.953


INFORMATION TECHNOLOGY AND LIBRARIES  DECEMBER 2020 

AUTOMATED FAKE NEWS DETECTION IN THE AGE OF DIGITAL LIBRARIES | MERTOĞLU AND GENÇ 19 

 
Mühendislik Fakültesi Fen ve Mühendislik Dergisi 12, no. 3 (2010): 49–62, 
https://dergipark.org.tr/en/pub/deumffmd/issue/40831/492667. 

33 Robert Gunning, “The technique of clear writing,” Revised Edition, New York: McGraw Hill, 1968. 

34 Ender Ateşman, “Türkçede Okunabilirliğin Ölçülmesi,” Dil Dergisi 58, no. 71–74 (1997). 

https://dergipark.org.tr/en/pub/deumffmd/issue/40831/492667

	ABSTRACT
	INTRODUCTION
	LITERATURE REVIEW
	METHODOLOGY
	Framework
	Model
	Dataset
	Features

	EXPERIMENTAL RESULTS AND DISCUSSION
	CONCLUSION AND FUTURE WORK
	ENDNOTES