key: cord-0145547-434munpz authors: Caled, Danielle; Carvalho, Paula; Silva, M'ario J. title: MINT -- Mainstream and Independent News Text Corpus date: 2021-08-13 journal: nan DOI: nan sha: d3b7a5dcb95a426daee0370f761ed0f70041f6cf doc_id: 145547 cord_uid: 434munpz Most corpora approach misinformation as a binary problem, classifying texts as real or fake. However, they fail to consider the diversity of existing textual genres and types, which present different properties usually associated with credibility. To address this problem, we created MINT, a comprehensive corpus of news articles collected from mainstream and independent Portuguese media sources, over a full year period. MINT includes five categories of content: hard news, opinion articles, soft news, satirical news, and conspiracy theories. This paper presents a set of linguistic metrics for characterization of the articles in each category, based on the analysis of an annotation initiative performed by online readers. The results show that (i) conspiracy theories and opinion articles present similar levels of subjectivity, and make use of fallacious arguments; (ii) irony and sarcasm are not only prevalent in satirical news, but also in conspiracy and opinion news articles; and (iii) hard news differ from soft news by resorting to more sources of information, and presenting a higher degree of objectivity. The detection of misinformation has been increasingly discussed by the Natural Language Processing (NLP) community, in particularly those concerned with the development of linguistic resources, and methods for identifying false or misleading information, generically known as fake news. Fake news detection focuses predominantly on distinguishing real from fake content [1] , approaching this issue as a dichotomous problem. As most of the resources conceived within the scope of misinformation studies comprise only these two categories, they fail to consider the diversity of existing textual genres and types, including soft news and fictional news stories, created for entertaining purposes. In turn, most misinformation corpora consider only the most extreme cases in the credibility spectrum (e.g., hard news collected from mainstream newspapers vs. news previously labeled as fake by fact-checking agencies), making the automatic classification task deceptively simple and misaligned with reality. However, credibility should be regarded as a complex construct, presenting several dimensions in a continuum. This paper presents the MINT (Mainstream and Independent News Text) corpus, which was specifically developed to address the gaps on misinformation corpora, especially for Portuguese. MINT is composed of more than 20 thousand articles collected from 33 Portuguese mainstream and independent media over a whole year, covering different styles, subjects, and serving different communication purposes. The collected articles are labeled under five categories, namely hard news, opinion, soft news, satire, and conspiracy. Although far from being exhaustive, this list includes categories presenting different properties that must be taken into account in misinformation studies. For example, hard news stories are supposed to involve a neutral and objective reporting, while opinions are characterized by their inherent subjectivity, which is a relevant feature for distinguishing reliable from unreliable news [2, 3] . On the other hand, soft news usually approach light topics, including sensational, disruptive and entertainment-oriented news, which generally resort to clickbait strategies to attract the readers' arXiv:2108.06249v2 [cs. LG] 18 Oct 2021 attention [4] . Some of these characteristics may also be found in non-credible news articles, namely in satirical news, created for humorous purposes, and conspiracy theories, fabricated to deceive the reader [5] . We discuss the main linguistic properties of each collection, and present the preliminary results of an evaluation on crowdsourced annotations by online news readers of Portuguese media. Those annotations, addressing aspects previously associated with news credibility [6] , can help to understand the main differences among the articles published by the media sources, allowing the development of computational models to correctly approach misinformation detection. Binary classification of misinformation has several conceptual problems, including the establishment of a proper definition of real (or credible) news. For example, some authors associate credibility with a greater degree of factuality and a lesser degree of sentiment in the text [7, 8, 9] ; others highlight a variety of aspects, including adherence to journalistic practices/editorial norms, impartial reporting, and the inclusion of statistical data, credible sources, quotes and attributions in text [10, 11] . However, the non-adherence to these standards should not be used by itself as a proxy to infer the text credibility. Accordingly, opinion pieces are expected to be subjective, and to present an emotionally charged tone [12, 11, 6] . A more paradigmatic example involves satire, which, despite being fictional, mimics the tone, style and appearance of factual news, leading some authors to label this type of content as fake [13] . Other efforts, in turn, argue that satire does not intend to deceive readers and, therefore, should be considered as an independent class when classifying misinformation [14, 15] . To address the unrepresented news categories in misinformation classification, Molina et al. organized a taxonomy differentiating real news from a variety of controversial news, namely hoaxes, polarized content, satire, misreporting, opinion, persuasive information, and citizen journalism [11] . Similarly, MINT's news articles are labelled to different categories, among which hard news, opinion, satire, and conspiracy, corresponding to Molina et al.'s real news, opinion, satire, and hoaxes, respectively. In our study, we have extended the MINT collection with a new nuance of real news, the soft news category, representing "light" or "spicy" stories with a "low level of substantive informational value" [16] . Despite the variety of corpora supporting misinformation classification, few linguistic resources are available for Portuguese (e.g., [17, 18] ). The Fake.Br corpus includes news articles in just two categories (true or false) [17] . This corpus comprises articles from four fake news sources, and three major news agencies. However, the Fake.Br corpus is strongly biased regarding text length, typos and sentiment [19] , which makes the analysis simplistic and the classification less challenging. Moura et al. developed another news articles collection focused on Portuguese [18] . This corpus is also a binary resource, with all the genuine articles in this collection scraped from a single source. Therefore, one cannot rely on this collection for misinformation classification due to the lack of representativeness of credible sources. The MINT corpus differs from these two resources by offering greater diversity of news categories and information sources, containing texts from 33 different mainstream and independent media channels. The corpora most similar to MINT are the ones comprising news articles from different topics. The NELA-GT series include articles harvested from different mainstream, hyperpartisan, and conspiracy sources [20] . FacebookHoax comprises information extracted from Facebook pages, scientific news and conspiracy websites [21] . Like MINT, both NELA-GT and FacebookHoax assign the credibility label based on the source-level reliability. Hardalov et al. also assembled a collection consisting of credible, fictitious, and funny news, resorting to similar strategies for building a corpus in a under-resourced language [22] . This work is related to ours since it contains news collected from mainstream, satirical and fictional sources, covering different topics, such as politics and lifestyle. The MINT corpus consists of two different, but complementary, resources, namely, MINT-articles, and MINTannotations. The main resource, MINT-articles corresponds to the entire collection of news articles extracted from mainstream and independent channels. We also provide MINT-annotations as a supplementary resource, containing the manual annotations for a subset of the MINT-articles collection (Subsection 3.2), obtained through a crowdsourcing process. With the insights gained from the annotations, we can therefore understand the specific and shared characteristics of the categories included in the corpus, available for the research community at https://github.com/**hidden-for-blind-review**. The MINT corpus includes 20,278 articles, published from June 1st, 2020 to March 31st, 2021, representing a full year sample of online content published by the Portuguese mainstream and independent media. All articles in the MINT Máscaras faciais representam riscos graves para a saúde corpus were semi-automatically harvested and assigned to a category through the heuristic rules defined bellow. Table 1 presents examples of article headlines from each MINT's category. Hard News (6000 documents): News collected from the politics, society, business, technology, culture, and sports sections from nine mainstream news websites. Since this content is published by reputable news sources, verified by the Portuguese regulatory authority for social communication (ERC 1 ), they were blindly labeled as hard news. Opinion (6000 documents): Articles collected from the opinion section of 10 mainstream and independent newspapers and magazines. In general, the collected articles discuss controversial and contemporary topics related to events with great notoriety in the mainstream media. Soft News (6000 documents): This category comprises soft news extracted from celebrity, fashion, beauty, family, and lifestyle sections of six magazines, tabloids and newspaper supplements. Satire (1029 documents): Articles extracted from two well-known websites, self-declared as fictional, humorous, and/or satirical in their editorial guidelines. They parody the tone and format of traditional news stories, by exploring the use of rhetorical devices, such as irony and sarcasm. For identifying conspiracy stories, we explored websites that had previously published at least five articles supporting conspiracy theories, particularly about the origin, scale, prevention, diagnosis, and treatment of the COVID-19 pandemic. We resorted to the COVID-19 theme as it is recurring issue, addressed both by the mainstream and independent media during the MINT collection period. Thus, we investigated the five conspiracy topics regarding the COVID-19 pandemic previously described by Shahsavari et al. [23] , and manually inspected a set of candidate websites; only six websites met the selection criteria. The topics covered by these sources are diverse, ranging from politics, economics, conflicts, health issues, to technology. In order to understand the readers' ability to distinguish the news articles belonging to different categories, we conducted a human assessment study focused on information content indicators [3] . These indicators are commonly used as proxies for assessing news articles credibility, addressing semantic and discourse dimensions, such as the headline accuracy, the presence of reasoning errors, and sentiment intensity [24] . The survey was disseminated to the Portuguese community through different news outlets, inviting online readers to assess a news article randomly selected from MINT-articles. Together, these annotations compose the MINT-annotations, which includes 750 judgments on 335 different news articles distributed, by category, as follows: 71 hard news, 63 opinion pieces, 66 soft news, 69 satires, and 65 conspiracy articles. Each assessment was carried out by a different reader, following annotation guidelines [24] . The annotators answered two types of questions: i) Dichotomous questions. Yes/No questions aimed at assessing the presence or absence in text of specific properties, namely the ones related to information sources, subjectivity, irony or sarcasm, and the particular use of strategies addressing personal attack or appeal to fear. ii) Five-point Likert scale questions. Questions assessing the overall article credibility, and other dimensions on the news headline and body (e.g., the degree of headlines' accuracy, clickbaitiness, sentiment intensity, reliability of the sources of information mentioned in text, linguistic accuracy, and sarcasm or irony). The information provided by online readers can be used to estimate the main similarities and divergences among the various categories of articles included in the MINT corpus, and understand which features are perceived as the most relevant by readers for assessing news credibility [6] . In this section, we present statistics derived from a set of metrics often used in computational linguistics to characterize the MINT news texts (Subsection 4.1). We also go through some insights obtained from the crowdsourced annotations in Subsection 4.2. Table 2 presents quantitative metrics related to style and text complexity, which estimate the average number of sentences (#s) and words (#w) comprised in the headline and body text. We have also calculated the average number of words per sentence (w/s), which may help distinguishing elementary from complex sentences. We notice that headlines from opinion articles tend to be shorter, while satire headlines are longer, when compared to the headlines belonging to the remaining categories. Despite the wide diversity on the body length of articles in each category, the statistics obtained show that satirical news stories are usually short, comprising a restricted number of simple sentences. This may indicate that the story introduced in the headline is not deeply developed in the body text. In contrast, the most extensive articles are from the conspiracy category, on average, up to three times longer than the articles reporting hard news. This apparently contradicts the previous studies focused on Portuguese stating that false articles are usually shorter than credible articles [17, 18] . When comparing hard with soft news, we can observe that the former tend to be longer, and use more complex linguistic structures. Table 3 provides a set of metrics that have been explored in the research on news credibility [25, 26] . To generate those statistics, texts were tagged with PoS 2 , and the sentiment information was estimated using SentiLex [27] . With regard to sentiment, we only present the information on the headline, since this information did not seem relevant in the characterization of the news body. The results indicate that adjectives are less used in sentences from shorter news texts, namely those belonging to soft news and satire categories, while adverbs are chiefly frequent in satirical news. Globally, these modifiers are mostly used in texts where a higher degree of subjectivity is expected, namely in opinion articles and conspiracy theories. Conversely, the hard news, which should be objective and neutral by principle, use comparatively less personal pronouns (only found in quotations or citations included in the news body), and more numerals, which is critical for attesting the text credibility [28] . Conjunctions and punctuation marks (pausality) are also more recurrent in hard news, corroborating the perception of textual cohesion and the idea that authors opt for more complex linguistic constructions. Additionally, sentiment terms are more frequent in headlines from soft news and conspiracies, which are often sensationalist, and employ a emotionally charged tone. On the other hand, soft news use comparatively fewer modal verbs and indefinite pronouns, which support the idea that they adopt a direct and focused narrative. Finally, the data shown in Table 3 also suggests that opinion and conspiracy articles are quite similar, with the exception of a slightly more pronounced use of indefinite pronouns in opinion articles. of content explores sentiment and emotions, e.g., amor (love), and make use of predicates such as revela (reveal), which are usually found in clickbait titles. On opposite, the most frequent verb in hard news is the declarative form of the verb dizer (say), which is probably used to introduce citations in text. With regard to conspiracies, with the exception of the use of the qualitative adjective grande (big), and the reference to USA (EUA), an important player in the global affairs, the most frequent terms are quite similar to the ones found in the hard news. This aspect is not surprising, since conspiracy approaches track news topics, and try to mimic real news. Figure 1 summarizes the answers to the dichotomous questions under the perspective of online news readers. The result reinforces the similarity between the opinion and conspiracy articles, also observed in Table 3 . The incidence of subjective information, a feature usually observed in opinion articles, also appears as a strong characteristic of conspiracies. Moreover, both categories present a high level of irony and/or sarcasm, and often use fallacies, in particular personal attack (i.e., the author attacks a specific individual or organization rather than attacking the substance of the argument itself). On the other hand, hard news usually follow the journalistic standards and practices, including accuracy (materialized, for instance, by the use of reliable sources of information), objectivity, and impartiality. Those characteristics are also observed in soft news, although to a lesser extent. As expected, users are capable of easily identifying irony and sarcasm in satirical news articles; however, their annotations also demonstrate that this property can be observed in multiple categories, namely in opinion news articles and conspiracy theories, as previously mentioned. Furthermore, the fallacious arguments typically used in conspiracy (namely, personal attacks and appeals to fear) can also be found in satirical and opinion articles. MINT, a corpus comprising news articles published by different Portuguese mainstream and independent sources, fills a gap in misinformation literature, providing annotated resources to enable studies ranging from social sciences to computational journalism. In particular, this corpus can help answering research questions involving the study of news credibility, and support the development of several NLP tasks, including automatic identification of misinformation, authorship attribution, and automatic detection of fallacies (for example, based on the conspiracy theories that surround the new coronavirus pandemic). A forthcoming release of MINT will add other news categories, new sources, and include more annotated articles. Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection Key dimensions of alternative news media A structured response to misinformation: Defining and annotating credibility indicators in news articles Automatic detection of fake news Situational irony in farcical news headlines What makes a news unreliable? a content analysis of online news articles by journalists and readers An information nutritional label for online documents Corpus of news articles annotated with article level sentiment Protection from 'fake news': The need for descriptive factual labeling for online content News Values: Reciprocal Effects on Journalists and Journalism fake news" is not simply false information: A concept explication and taxonomy of online content Akshat Pant, Priya Shetye, Rusha Shrestha, Alexandra Steinheimer, Aditya Subramanian, and Gina Visnansky. Fake news vs satire: A dataset and analysis Deception detection for news: Three types of fakes This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news A stylometric inquiry into hyperpartisan and fake news Hard news, soft news, 'general' news: The necessity and utility of an intermediate classification Contributions to the study of fake news in portuguese: New corpus and automatic detection results Automated fake news detection using computational forensic linguistics Towards automatically filtering fake news in portuguese NELA-GT-2018: A large multi-labelled news dataset for the study of misinformation in news articles Some like it hoax: Automated fake news detection in social networks In search of credible news Conspiracy in the time of corona: automatic detection of emerging COVID-19 conspiracy theories in social media and the news Assessing news credibility: Misinformation content indicators A comparison of classification methods for predicting deception in computer-mediated communication A survey of fake news: Fundamental theories, detection methods, and opportunities Building a sentiment lexicon for social judgement mining Using numbers in news increases story credibility