key: cord-0466492-y32wawf2 authors: Chalkiadakis, Manolis; Kornilakis, Alexandros; Papadopoulos, Panagiotis; Markatos, Evangelos P.; Kourtellis, Nicolas title: The Rise and Fall of Fake News sites: A Traffic Analysis date: 2021-03-16 journal: nan DOI: nan sha: d742fe754b082cdf1e8ea3ad8f3c101d1da631e7 doc_id: 466492 cord_uid: y32wawf2 Over the past decade, we have witnessed the rise of misinformation on the Internet, with online users constantly falling victims of fake news. A multitude of past studies have analyzed fake news diffusion mechanics and detection and mitigation techniques. However, there are still open questions about their operational behavior such as: How old are fake news websites? Do they typically stay online for long periods of time? Do such websites synchronize with each other their up and down time? Do they share similar content through time? Which third-parties support their operations? How much user traffic do they attract, in comparison to mainstream or real news websites? In this paper, we perform a first of its kind investigation to answer such questions regarding the online presence of fake news websites and characterize their behavior in comparison to real news websites. Based on our findings, we build a content-agnostic ML classifier for automatic detection of fake news websites (i.e. accuracy) that are not yet included in manually curated blacklists. of fake news on social networks. For example, Shao et al. [12] studied the spread of fake news by social bots. Also, Fourney et al. [13] conducted a traffic analysis of websites known for publishing fake news in the months preceding the 2016 US presidential election. Apart from the important existing works in the area, yet little we know about the network characteristics of fake news distributing websites: What is the lifetime of these websites? What is the volume of traffic they receive and how engaged is their audience? How do they connect with other marked-as-fake news sites? In this study, we take the first step in answering these questions. We collect a dataset of 283 websites tagged from known fact-checking lists as delivering fake news and perform traffic and network analysis of such websites. In particular, and contrary to related work (e.g., [13] ), we study and compare the user engagement of fake and real news sites by analyzing traffic-related metrics. Additionally, we explore the lifetime of fake news sites and their typical uptime periods over a time range of more than 20 years, and we propose a methodology to detect websites that are synchronizing not only their uptime periods but also the content they serve during these periods. Based on our findings, we design a content-agnostic ML classifier for the automatic detection of fake news websites. Contributions In summary, this paper makes the following main contributions: (i) We conduct the first of its kind temporal and traffic analysis of the network characteristics of fake new sites aiming to shed light on the user engagement, lifetime and operation of these special purpose websites. We compose an annotated dataset of 283 fake news sites indicating when such websites are alive or dormant, which we provide open sourced 1 . (ii) We propose a methodology to study how websites may be synchronizing their alive periods, and even serving the exact same content for months at a time. We detect numerous clusters of websites synchronizing their uptime and content for long periods of time within the USA presidential election years 2016-2017. (iii) We study the third party websites embedded in the different types of news sites (real and fake) and we find that during the aforementioned election years there is a significant increase in the use of analytics in the fake news sites but not an increase in the use of ad related third-parties. Additionally, domains like doubleblick, googleadservices and scorecardresearch tend to have higher presence in real news sites than in fake ones. On the contrary, facebook and quantserve have higher presence in fake news sites. (iv) We build a novel, content-agnostic ML classifier for automatic detection of fake news websites that are not yet included in manually curated blacklists. We tested various supervised and unsupervised ML methods for our classifier which achieved F1 score up to 0.942 and AUC of ROC up to 0.976. To perform this study, we collect data from different sources and in this section we describe in detail our data. First, we obtain lists with manually curated news sites, categorized as "fake" and "real". We then use the "fake" news sites list as input for crawling historical data from Wayback machine to annotate the state of each website. Finally, to explore the web traffic characteristics of the two categories of news sites and how their audience behaves, we collect data from SimilarWeb [14] and CheckPageRank [15] . For this study, we compose two manually curated lists of news sites. One with sites that are marked as "fake" and one with sites marked as "real". For the fake news sites, we utilize the domains repository provided by the opensources.co Fig. 1 . CDF of traffic sources for real and fake news sites. The median fake news site is being accessed mostly directly, when the median real news site is being accessed mostly via search engines. website [16] . The repository contains 834 biased news sites, of which 283 domains are manually checked and flagged as "fake". This is a well-accepted list, and has been used in studies related to the fake news ecosystem of 2016 US elections [17, 18] , as well as in fake news detection tools such as the - [19] . Additionally, we compose a second list of same size, for "real" news sites by taking the top Alexa news sites (and ensuring that there are no sites there that are marked as fake in the repository). To assess the user engagement in the websites of our dataset, we collect web traffic data from popular data services like SimilarWeb and CheckPageRank (date of crawl: June '20) . SimilarWeb provides Web and other traffic related data per website, while CheckPageRank provides search engine-related information. In summary, we analyze volume of user visits, where the user visits come from, their duration and what subdomains they browse, the number of users who bounce off a domain, as well as Web connectivity of websites with respect to number and type of incoming or outgoing links from and to other sites. Next, we focus on the 834 news sites flagged for spreading misinformation (i.e., marked as fake or biased) and to identify their different states across time, we collect historical data from the Wayback Machine [20] . Specifically, we first query the Wayback CDX server for each such news site in our list, and we get an index of the available timestamps for the particular domain. Then, we proceed with downloading the landing page of each timestamp and storing it locally for further processing. In total, we downloaded the content of these websites from the last 23 years. and render them on screen. In each case, a dialogue box is prompted, and we categorize the timestamped website as one of the following: (i) alive: We consider a website being "alive" when it is offering news content (ii) zombie: We consider a website being "zombie" when it is offering content other than news (e.g., e-marketing or other news-irrelevant content). (iii) dead: If the timestamped content of a website is none of the above (e.g., no HTML content was returned, or HTTP errors were returned), the website is declared "dead". Given that a website may have been archived multiple times per month on Wayback, we aggregate the state of the website per month, by making the assumption that if it had at least one "alive" timestamp in said month, then it was "alive" for the entire month. Similarly, it was in a "zombie" state, if it had at least one such state in that month, or "dead", if none of the above applied. Finally, if there is a month that Wayback does not have a state for a website, that timestamp is marked as "missing" for said website. As a first step, we set out to analyze and compare various user traffic-related metrics for the fake and real news sites in our lists, in an attempt to understand what is the different behavior of the audience in these two categories of websites. In particular, we focus on (1) where users come from to land on such websites, (2) how many pages they visit within a website, (3) their visit's duration and what sub-domains they browse, (4) the number of users who bounce off a domain, and (5) the Web connectivity of websites with respect to number and type of incoming or outgoing links to other sites. Where do users come from? In Figure 1 , we study the different sources that drive traffic to fake and real news sites. As we can see, the median fake site is being accessed mostly directly (user navigates directly to the website) by the users ( ), or via links in media and search engines ( ℎ). On the other hand, the median real news site is being accessed mostly via search engines ( ℎ), and follows. Sources such as and drive similar traffic to fake and real news sites. How many pages do users visit? In Table 1 , we present the mean, standard deviation, median and 90th percentile of our user engagement metrics across all fake and real news sites. If we focus on the average number of pages per visit (in a time window of 6 months), we see that the median real news site tends to have a larger number of pages (i.e., 2.18 pages, on average) visited per user than the median fake news site (i.e., 1.72 pages, on average). Considering the 90th percentile, however, we see that there are fake news sites that have more (up to 3.54) pages visited on average than the corresponding real news sites (i.e., 3.49 pages visited, on average). How long do users stay per visit? In Figure 2 , we present the distribution of the average duration of the user visits per news site. This duration defines the time elapsed between the beginning of the first and the end of last page visit (sessions are considered closed after 30 minutes of user inactivity [22] ). As we can observe, in real news sites, the visit duration follows a distribution close to power law. Additionally, as presented also in Table 1 , visits last longer (i.e., 198.8 seconds) in real news sites, than in fake news sites (i.e., 163.4 seconds), with the visits of the 90th percentile lasting around 423.40 seconds in real news sites and 284.20 seconds in fake news sites, on average. Website Bounce Rate. In Figure 3 , we present the percentage of visitors who enter a site and then leave after visiting only the first page (also known as bounce rate). This metric is calculated by dividing the single-page sessions by all sessions [23] , and reflects how well a site is doing at retaining its visitors. A very high bounce rate is generally a warning that people are not willing to stick around to explore the website, and instead they choose to leave. As we can see in the figure, the median real news site has a significantly lower bounce rate (i.e., 66.55%), compared to the median fake news one (i.e., 72.49%) with the corresponding rates for the 90th percentile being at 78.33% and 85.08%, respectively. We deduct that fake news sites probably provide content of lower quality, that is less engaging or interesting compared to the real news sites. Website Backlinks & Referrals. In Figure 4 , we draw the distribution of the number of backlinks for fake and real news sites. A backlink (also called citation or inbound/incoming link) of a website is a link from some other website (i.e., the referrer) to website (i.e., the referent). As we can see in the figure, backlinks of fake news sites follow a power law distribution and they are significantly lower compared to the backlinks of real news sites. In particular, the median fake news site in our dataset scores 4.7K backlinks when the median real news site scores 23.2M backlinks! The 90th percentile of fake news sites has 1.12M backlinks when the corresponding real news site scores as high as 53M backlinks. It is of no doubt that this difference is caused by the lack of trust that a large portion of websites show to fake news distributing websites. Similarly, in Figure 4 Finally, we study a particular class of domains or links from EDU or GOV domains, which could provide more authority and trust to a website when being referenced or linked to. In Figure 5 , we plot the portion of backlinks and 2 Example: think of the referring domain as a phone number and backlinks as the number of times you've gotten a call from that particular number. referring domains related to EDU/GOV domains for fake and real news sites. We see that fake news sites have clearly lower portions of EDU backlinks and referrals, as well as GOV backlinks than the real news sites. In this section, we focus on the fake news ecosystem and perform a historical analysis by studying the following questions: (1) What is the lifetime of a fake news site? (2) Are there any such websites that synchronize their uptime and reproduce the same content through time? (3) Which third-party trackers were persistently embedded in such websites through time? We use three terms to study the lifetime of a fake news site. First, we define with "lifespan" the upper limit for which a website may have existed on the Web. This is computed as the time difference from the first and last timestamp with "alive" state. Furthermore, the terms "alive time" and "zombie time" define the number of timestamps (e.g., months) that the website under study has been tagged as "alive" or "zombie", respectively. Consequently, the timestamps for which the Wayback does not provide any data are considered "dead". During the lifetime of a website, various problems could arise, such as the owner not paying for the domain for some months, or the website being offline due to some technical issues, etc.During such periods, and due to the crawling nature of the Wayback Machine, not all websites are archived at the same rate, and therefore, we may not have snapshots of websites for all timestamps studied. In an attempt to infer what the state of a website was in such un-archived or "missing" timestamps, we use a 2-phase interpolation process. Phase 1. In the first phase ( 1), we identify for each website any gaps between two timestamps with the same label ( ={ , }). Thus, when the two timestamps are non-consecutive, and there is no other labelled timestamp between them, we proceed with propagating label to all "missing" timestamps of that gap. For example, if website was found alive in timestamps and , with timestamps in-between them (i.e., = + ), and no other state was captured between and (i.e., during the timestamps), then, we assume that was alive for all the timestamps between and . Similar process was applied if was zombie. This interpolation process can be applied for increasingly larger gaps, i.e., , for = 1, 2, . . . . Therefore, we applied it for increasingly larger , and we stopped at 3-year gaps (i.e., m=36), since after that time there was no more correction done to the dataset. Phase 2. In the second phase ( 2), we identify gaps on the output of P1 between two "alive" timestamps up to three years apart, but allowing for up to 12 "non-alive" (i.e., "zombie" or "dead") timestamps between them. The "missing" timestamps between these two alive ones were also labelled as "alive". In Figure 6 , we present a histogram of the state of the fake news sites in our dataset through our examined period of more than 20 years. In the y-axis, we show the number of domains for each state. In the histogram, we also show the results from the interpolation phases 1 and 2, and how the plot smooths out, as expected. In this plot, at a given timestamp, we show the number of alive websites (including our interpolation phases), the number of zombie websites, when the remainder from our list is considered dead. Interestingly, we can see the rise in fake news activity during the USA presidential election years 2016 and 2017, and their sudden fall afterwards. During that fall, a portion of these websites turned into zombie state, while the great majority of them was shut down, especially in the last two years. This can happened either because they fulfilled their purpose (i.e., cause polarization or political bias [24] ) or after being included in fake news lists and tools for blocking sources of misinformation. In Figure 7 , we plot the CDF of the lifespan, alive and zombie time for the fake news sites studied. The lifespan is the absolute maximum that such websites were found to exist on the web. As we can see, their lifespan is found to be about 4 years in median values, when on the other hand, alive and zombie times are lower, with median values of only 2 and 0.08 years, respectively! As a next step, we set out to explore, whether fake news sites appear to synchronize (i) on the times they are available on the Web, and (ii) the content they serve. Uptime synchronization. To investigate the possible synchronization of their uptime, we assume that each website's sequence of alive or zombie states represents a binary time series and we focus on the last 5 years of the fake news activity (i.e., 2015-2020). To retrieve a cleaner signal and differentiate time series that synchronize across websites, we perform an aggregation at the quarter level (i.e., 3-month granularity) instead of monthly level. Thus, the final time series reflects quarters, and each one has 3 possible values ( Content synchronization. To investigate how fake news sites may synchronize their content (in the same time window: 2015-2020), we developed a pipeline to compare pairs of fake news sites with respect to the content they publish. First, using the Beautiful Soup [25] library, we extract the text from each website 3 . After performing text pre-processing tasks on the extracted content (i.e., tokenization, removal of stop-words and lemmatization), we vectorize the documents using a typical TFIDF process [28] . Such vectors were created for each website and timestamp that it had content available. To compare these vectors, we use cosine the similarity metric [29] and we set a threshold of 0.5 to select pairs that appear to have high similarity. With this threshold, we ended up with 22 distinct pairs of websites. Upon manually inspecting the pairs at their matched timestamps, we make the following observations regarding fake news site content synchronizations: (3) [usatoday.com.co, washingtonpost.com.co, drudgereport.com.co]. This group was synchronized on 07/2015. 3 There are more modern techniques for article content extraction available such as 3 [26] or . [27] , but they are not applicable in our case because they are optimized with heuristics that extract content from full articles rather than landing pages. We observe that several of the pairs and even portions of the groups above overlap with the uptime synchronization study presented earlier. As a consequence, we believe our proposed methodology of studying synchronization of content and uptime of websites can enable a fake news detection process to select websites that have suspiciously high similarity in their uptime and content, for further examination and even blocking, if needed. Contrary to most popular news sites that progressively move towards paywalling their high quality content [30] , fake news sites rely on ads to make profit. Indeed, some of these websites were even created with the sole purpose of luring ad clicks by publishing clickbait content [31, 32] . To understand which third-party advertising entities provide tracking and other ad-related functionality to fake news sites, we study the third-party domains embedded in these sites through time. Specifically, we parse all collected HTML content per fake news site for each timestamp in our dataset and by using the AdblockPlus blacklist [33] , we identify 55 such third-party domains in the HTML body of at least one website for one timestamp. In Figure 8 , we show the number of websites that the top 10 third-party domains were embedded in, per timestamp (i.e., month) in the list of fake news sites. These top 10 third-party domains were selected based on their cumulative appearance across fake news sites and across all timestamps. Evidently, analytics and ad entities dominate the top 10 list residing in 91.8% of all the fake news sites in our dataset. Interestingly, during the aforementioned peak of 2016-2017, we do see a significant increase in the use of analytics (i.e., google-analytics, and googlesyndication) but not an increase in the use of ad related third-parties. This phenomenon shows that, the majority of the marked as fake news sites that were created within this time window (US pre-election period), had purposes other than monetizing their published content (e.g., polarize, deliver misinformation, etc.). Next, in Figure 9 , we use the data provided by ℎ . [34] to compare the most embedded third-parties on the web, with the ones found in the fake and real news sites of our dataset for the period of 2016-2017. Interestingly, we see Google's doubleclick and googleadservices residing in less than 6% and 2% of the fake news sites, respectively, when they have presence in more than 27% and 8% of the real news sites of our dataset, respectively. Similarly, scorecardresearch (third biggest web beacons-based tracking service, owned by ComScore [35] ) is present in less than 6% of the fake news sites but in more than 19% of the real news sites of our dataset. On the other hand, facebook and quantserve (second biggest web beacons-based tracking service, owned by Quantcast [36] ) are present in more marked-as-fake news sites than real ones. Table 3 . Performance metrics from ML binary classification of websites as showing fake or real news. Our earlier network traffic analysis of such websites revealed that it is possible for some of these features to be good at distinguishing the nature of the news website, such as number of visits, bounce rate, backlinks, etc. Thus, we were inspired to build an automated tool that performs the following tasks: (1) Retrieves network data for each website from common sources such as similarweb or checkpagerank (2) Preprocesses the data and extracts related features on network traffic activity (3) Applies a machine learning (ML) model that classifies the given website as serving fake or real news In order to built this ML classifier, we performed a fresh crawl (February 2021) on our previously mentioned lists of fake and real news websites, and we trained and evaluated our envisioned ML classifier. Based on previously mentioned network traffic metrics (summarized in Table 2 ) we train different ML classifiers for automatic classification of news websites as "real" or "fake". As a basic preprocessing step, we removed features with very little to zero variability. Our dataset for training and testing is fairly balanced, with "real" news websites being 278 and "fake" being 239. The difference from the previous numbers lies in the fact that we did extra steps for removing websites that did not have scores across all metrics. We applied 10-fold cross-validation on the available data, and trained and tested various techniques. We measured standard ML performance metrics such as True Positive and False Positive Rates, Precision and Recall, F1 score and Area Under the Receiver Operating Curve (AUC). The scores were weighted to take into account individual performance metrics per class weight. Table 3 shows the results achieved with different basic classifiers when all the dataset is used (upper part, classifiers #1-#4). We find that the typical Random Forest classifier performs very well across the board, with high True Positive and low False Positive rates, and higher Precision and Recall than the other ML methods. Given that the amount of traffic and other features used here are naturally correlated with each other (e.g., a highly ranked website should attract more visits, etc.), we also test the scenario where we split our dataset into two major groups of ranked websites (highly popular with rank <10K, and rest), to check if the ML classification is still possible under similarly ranked websites. The results, shown in Table 3 (lower part, classifiers #5 and #6) demonstrate that it is possible to achieve very good performance, even when controlling for the rank of websites. Furthermore, classifier row #7 checks the scenario where data from the lower ranked websites (i.e., rank>10K) are used to train a classifier that is then tested on data from higher ranked websites (i.e., rank<10K). Interestingly, the performance still remains high, showing that examples of fake news websites from lower ranking can be useful to distinguish such websites even at higher ranks. Numerous studies are attempting to explore the characteristics of fake news and its spread. In [13] , authors conducted a traffic analysis to websites known for publishing fake news in the months preceding the 2016 US presidential election. Although the study also includes features as traffic sources and temporal trends, our work significantly diverges from that. We analyze a much greater set of websites, we compare fake with real news websites and we do not focus on social networks or the elections. In [37] authors perform a 3-year long study on fake news websites prior (2014-2016). Their analysis includes time-series modelling for causality testing. This is one of the few studies that are including time-series analysis. However, the method differs from ours in substantial manners. As important as understanding fake news dynamics is, we cannot avoid mentioning the significant efforts to identify and flag fake news stories. Based on a survey [8] , fake news can be identified with content-based, feedback-based and intervention-based methods. In [38] authors characterize detection as knowledge-based, stance-based, style-based and propagation-based. While there is a large number of publications in the Fake News detection area, we will only give specific examples. Check-it [39] is an ensemble method that combines different signals to generate a flag about a news article / social media post. The aforementioned signals are the domain name, linguistic features, reputation score and others. NELA [18] creates and combines credibility scores of the news article and the news source. Although it is possible to combine different methods to solve this problem, most papers focus on a single -most narrowed down -approach. For example, in a different publication, Shu [40] also argues about the role of social context for fake news detection. In [3] , authors provide a comprehensive overview of existing research on the false information ecosystem. In [4] authors show that fake news aim to affect the emotions of the readers, and ultimately deceive them. The authors create a neural network capable of detecting false news from such an effect. In [13] , it is clear that aggregate voting patterns were strongly correlated with the average daily fraction of users visiting websites serving fake news. In [17] , authors dive into the dynamics and influence of fake news on Twitter during the 2016 US presidential election. With a dataset of 30 million tweets and the opensources.co list, it finds that 25% of these tweets spread either fake or extremely biased news. Based on [6] , the online social network ecosystems seem to interplay, and information shared in a network is affecting the information flow in another. A lot of work has been done on Twitter disinformation. In [7] , authors present a thorough analysis of rumour tweets from the followers of two presidential candidates. It is also shown by [9] that many trolls have been sponsored externally for ultimate goals. [12] proves that bots play a key role in the Twitter misinformation ecosystem, targeting influential users for misinformation spreading. As shown by [10] , the vast majority of fake news is spread by an extremely small number of sources, forming clusters. This study sheds light on the target groups as well, being conservative-leaning, older and highly engaged with political news individuals. Following a different trajectory, [5] points to images as being very crucial content for the news verification process. In this paper, we performed a first of its kind investigation on the fake news ecosystem. We studied and compared their user engagement by analyzing traffic-related metrics for fake and real news websites. Additionally, we explored the lifetime of fake news sites and their typical uptime periods over a time range of more than 20 years, and we proposed a methodology to study how websites may be synchronizing their uptime periods, and the content they serve during these uptime periods. Our findings can be summarized as follows: • The median real news site tends to have a larger number of pages (i.e., 2.18 pages, on average) visited per user than the median fake news site (i.e., 1.72 pages, on average). • On average, visits last longer (i.e., 198.8 seconds) in real news sites, than in fake news websites (i.e., 163.4 seconds). • The median fake news site is being accessed mostly directly, when corresponding real news site via search engines. • The median real news site has a significantly lower bounce rate, compared to the median fake news one. • The median fake news site in our dataset scores 4.7K backlinks, when the median real news website more than 23.2M backlinks! Fake news sites have lower portions of EDU backlinks and referrals, as well as GOV backlinks than the real news sites. • Fake news sites have about 2 orders of magnitude lower number of referring domains than real news. • The median alive and zombie times of fake news sites are as low as 2 and 0.08 years, respectively. • There was a significant rise in fake news activity during the USA presidential election years 2016-2017, and there was a rapid fall afterwards. • We detect numerous clusters of websites synchronizing their uptime and content for long periods of time. • During this period, we see a significant increase in the use of analytics but not an increase in the use of ad related third-parties from the fake news sites. This shows that the majority of the fake news sites created within this time window, had purposes other than monetizing their published content (e.g., polarize, deliver misinformation, etc.). • Domains like doubleblick, googleadservices and scorecardresearch tend to have higher presence in real news sites than in marked-as-fake ones. On the contrary, facebook and quantserve have higher presence in fake news sites. Our findings enabled us to characterize the traffic and behavior of fake news sites, and build a novel, content-agnostic machine learning (ML) classifier for automatic detection of fake news websites that are not yet included in manually curated blacklists. We tested various supervised and unsupervised ML methods for our classifier which achieved a very good performance: F1 score up to 0.942 and AUC of ROC up to 0.976. In the future, we plan to investigate how such ML model can be updated at pseudo-real time, with data collection that happens at regular intervals or at the discovery of a news website. This effort can be done in a crowd-sourced fashion across multiple online users, by deploying the ML pipeline envisioned earlier into a browser plugin. Then, the plugin can 1) perform the crawling of network metadata for the website visited from its user, and 2) apply the ML model we provide. The plugin can also report these metadata per website to a centralized location for updating our ML model. In case privacy of users is at stake, privacy-preserving methodologies can be used, that employ Federated Learning techniques for training the ML model, coupled with local differential privacy applied at the user devices. The research leading to these results received funding from the EU H2020 Research and Innovation programme under When fake news stories make real news headlines Some real news about fake news The web of false information: Rumors, fake news, hoaxes, clickbait, and various other shenanigans An emotional analysis of false information in social media and news articles Novel visual and statistical image features for microblogs news verification The web centipede: Understanding how web communities influence each other through the lens of mainstream and alternative news sources Proceedings of the 2017 Internet Measurement Conference, IMC '17 Combating fake news: A survey on identification and mitigation techniques Disinformation warfare: Understanding state-sponsored trolls on twitter and their influence on the web Fake news on twitter during the 2016 us presidential election The science of fake news The spread of fake news by social bots Geographic and temporal trends in fake news consumption during the 2016 us presidential election Similarweb: Website traffic statistics & analytics. www.larweb.com Check page rank -check your pagerank free! www.checkpagerank.net Influence of fake news in twitter during the 2016 us presidential election Assessing the news landscape: A multi-module toolkit for evaluating the credibility of news B.s. detector Internet Archive Puppeteer: Headless chrome node Similar web average visit Similar web bound rate Stop tracking me bro! differential tracking of user demographics on hyper-partisan websites Beautiful soup documentation Newspaper3k: Article scraping & curation Mozilla Foundation. Readability.js: A standalone version of the readability library used for firefox reader view Using tf-idf to determine word relevance in document queries Cosine similarity Keeping out the masses: Understanding the popularity and implications of internet paywalls The secret players behind macedonia's fake news sites BBC Future. I was a macedonian fake news writer Easylist filterlist project Rémi Berson, and Josep M. Pujol. Whotracks.me: Shedding light on the opaque world of online tracking Scorecardresearch (comscore): What is it and what does it Quantserve (quantcast): What is it and what does it The agenda-setting power of fake news: A big data analysis of the online media landscape from Fake news detection on social media: A data mining perspective Check-it: A plugin for detecting and reducing the spread of fake news and misinformation on the web Beyond news contents: The role of social context for fake news detection