1 Linnea Frangén Computational Literacy Final project 1. Research Question and Dataset In this project, I experiment with different computational methods to analyse the linguistic features of fake news. My research question is how language complexity differs between fact-checked fake and real news. Previous research has connected fake news with shorter words, higher lexical redundancy and thus lower lexical diversity as well as shorter article length in comparison to real news (Horne and Adali 2017). Other features that relate to language complexity are the number of prepositions, exclusion words and conjunctions, sentence length and the number of words with six or more letters (Tausczik and Pennebaker 2010). I have selected a number of these features which I analyse using Voyant (Sinclair and Rockwell 2016), Excel and Python. Deceptive speech has been connected with reduced complexity and also the level of complexity can indicate whether the text is showing multiple perspectives or not (Tausczik and Pennebaker 2010, 34–35). Based on these findings I hypothesize that the fake news will be less complex linguistically than real news, although the result may be less distinct since both the real and fake news come from fact-checking sites. The fact-checked news data come from a MisInfoText GitHub repository that contains articles that have been manually verified and labelled by fact-checking websites (Asr and Taboada 2019b, link in the references). Asr and Taboada scraped the websites of two fact-checking sites, Buzzfeed and Snopes, and then manually cleaned and assessed a randomly selected portion of them (2019a, 7-8). I chose to use only the Snopes dataset in this study since the topics of fake and real news in this dataset are more varied and the labels were the most suitable for comparing real and fake news (Asr and Taboada 2019a, 9). The Snopes dataset was also thoroughly cleaned, whereas the larger Buzzfeed dataset consistently lacked whitespaces between words which would have distorted the results. 2. Data Processing 2 The data was downloaded in CSV format, which I saved as an excel file for pre-processing the data. This included filtering the articles based on their label: I deleted articles labelled as “mixture”, “mostly false” and “mostly true” so that only articles labelled as “true” and “false” remained. I divided the news to separate files according to their label, one containing all fake news and one all real news. Additionally, I deleted all unnecessary information such as the URL, titles and additional notes, so that the files contained only the full text of the original articles. I also manually deleted the text “[Your user agent does not support frames or is currently configured not to display frames. However, you may visit the page menu.]” from the real news dataset as it is not meant as a part of the original body text of the article. I ran the data through OpenRefine as I initially assumed that the dataset contained duplicates. However, it turned out that there were no exact duplicates in the dataset, but instead separate articles written on the same topic which initially seemed as duplicates. Since the data had already been cleaned thoroughly no changes were done to the data in the OpenRefine. I then converted the files back into CSV for it to work more fluently with the Python code. Next, I will describe the workflow for each of the different methods used in this study. Voyant: - I input the data to Voyant and it automatically computed the following information: - Fake news dataset contains 33,712 total words and 6,526 unique word forms. - Real news dataset contains 54,997 total words and 8,987 unique word forms. - Vocabulary Density (type-token ratio): fake news 19%, real news 16%. - Average Words Per Sentence: fake news 21.8, real news 19.5. - I also searched the occurrence of a selection of conjunctions using the Voyant interface, which were chosen based on a reading of Longman Student Grammar of Spoken and Written English (Biber, Conrad and Leech 2002, 30–31). These were processed further in Excel. Excel - I entered the raw counts of the 14 conjunctions in Excel (and, but, or, nor, if, as, after, because, since, although, while, than, that, and whether), separately for both fake and real news datasets. 3 - I calculated the total number of conjunctions. - Entered the total number of words to the excel file which I previously got from Voyant. - I calculated the normed rates using the formula: Normed rate = (raw count / total word count) ∗ the fixed amount of text. In this study, the fixed amount of text was 100 words. - Rounded the normed rates of conjunctions off to two decimal places. - The normed rate of the 14 conjunctions in total is 5.76 per 100 words for real news and 5.74 for fake news. The exact rates for all the conjunctions are visible in the table in the next section. Python I used Python to compute first the average word lengths in the datasets and then the standard variation of the word lengths, which enabled me to better evaluate its statistical significance. A code from Stack Overflow was adjusted to suit the needs of this study, and below is a description of what the code does. The exact code can be found in the GitHub repository for this project. - The code filters out punctuation and numbers because otherwise each instance would have been counted as an individual word. The exact items that were excluded are ".,;!?- 0123456789()[]{}. - I initially intended to simply copy-paste the news data into the code, but because the data had empty lines between each sentence or each paragraph, the code was not able to read it beyond the first empty line. Removing the whitespaces would have been very time- consuming, so I created a step that fetched the data from the directory it is stored in. - Prints out the average word length with all the decimals. When rounded off to two decimals the average lengths are 5.08 letters for fake news and 5.04 for real news. - To test whether the difference is statistically significant, I included a step that counts the variance of the word lengths. - Variance for real news: 8.44 - Variance for fake news: 8.56 - The code also imports a library called “math” in order to calculate the square root of a number, which is needed for the standard deviation. The code calculates and prints the square root of the variance: 4 - Standard deviation for real news: 2.91 - Standard deviation for fake news: 2.93 - The same code is repeated twice on different datasets (first for the real news data and then the fake news data). 3. Analysis and Discussion The analysis shows that the differences in language complexity between the real news dataset and the fake news dataset are minimal. The average word lengths between the datasets differ by 0.04, and when considering that the standard deviation is between 2.91–2.93, the result is most likely not statistically significant. Similarly, the sentence length and conjunction use differ only marginally from each other, as shown in the table “Normed Rates of Conjunctions”. The type- token ratio which describes the vocabulary complexity would indicate that the fake news is even slightly more complex than the real news, as it contains less repetition. Thus, the hypothesis that real news would be more complex than fake news was incorrect. However, the contradictory result may be caused by defects in the dataset as well as a methodology that was not refined enough. One aspect that should be improved in this study is the size of the dataset. It was insufficient especially for some of the less frequent conjunctions, such as nor. The normed difference is 0.01, 5 and the actual raw occurrences were 2 in fake news and 0 in real news, which are too low to provide reliable information. Additionally, not all the articles in the MisInfoText dataset fit the criteria generally given to fake news, which defines fake news as non-factual content that tries to appear as if it were legitimate news to gain the credibility generally given to news media (Tandoc, Lim and Ling 2018, 143 and Gelfert 2018, 108). It seems that there is a lot more variation in the types of articles included in the data, the fake news dataset included, for example, an article from RationalWiki, which is a completely different genre. It is not news and it does not attempt to appear as such, which might have distorted the results. However, what caused the lack of difference was most likely the fact that both the fake and real news stories were collected from fact-checking sites, instead of comparing, for example, the most trusted news outlets with counter media. The choice to compare fact-checked real and fake news was made deliberately to avoid possible bias caused by the decision making of fact-checking sites. They have been criticised for biased decision making since the articles are picked by individual people whose beliefs may influence the process (Asr and Taboada 2019a, 4) and it has been shown that fact-checking sites are more likely to pick up negative ads rather than neutral or positive ones (Amazeen 2016, 442). Furthermore, fact-checking websites aim to find and reveal misinformation, not to confirm accurate news, which affects the articles they choose to check (Amazeen 2016, 451). Therefore the articles included in the current dataset are likely to have all been written in a similar style. This indicates that the difference in complexity is not related only to the veracity of the articles, but rather their news genre. 4. References Amazeen, Michelle A. 2016. “Checking the Fact-Checkers in 2008: Predicting Political Ad Scrutiny and Assessing Consistency.” Journal of Political Marketing 15 no. 4: 433–464. Asr, Fatemeh Torabi, and Maite Taboada. 2019a. “Big Data and Quality Data for Fake News and Misinformation Detection.” Big data & society 6, no. 1: 1–14. 6 ———. 2019b. MisInfoText. A collection of news articles, with false and true labels. Dataset. Accessed 3 November 2020. https://github.com/sfu-discourse- lab/Misinformation_detection. Biber, Douglas, Susan Conrad, and Geoffrey Leech. 2002. Longman Student Grammar of Spoken and Written English. Harlow: Pearson Education Limited. Gelfert, Axel. 2018. “Fake News: A Definition.” Informal Logic 38, no.1: 84–117. Horne, Benjamin D., and Sibel Adali. 2017. “This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News.” Accessed 15 October 2020. https://arxiv.org/abs/1703.09398. Sinclair, Stéfan and Geoffrey Rockwell. 2016. Voyant Tools. Accessed 3 December 2020. http://voyant-tools.org/. Tandoc, Edson C. Jr., Zheng Wei Lim and Richard Ling. 2018. “Defining “Fake News”: A typology of scholarly definitions.” Digital Journalism 6 no. 2: 137–153. Tausczik, Yla R., and James W. Pennebaker. 2010. “The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods.” Journal of Language and Social Psychology 29, no. 1: 24–54. https://github.com/sfu-discourse-lab/Misinformation_detection https://github.com/sfu-discourse-lab/Misinformation_detection https://arxiv.org/abs/1703.09398 http://voyant-tools.org/