key: cord-0550748-yb1ag0sa authors: Chen, Yu-Chieh; Huang, Pei-Yu; Lin, Chun; Huang, Yi-Ting; Institute, Meng Chang Chen Haliciouglu Data Science; Diego, University of California San; Jolla, La; States, United; Management,; Innovation, Digital; London, University of; Singapore,; Science, Institute of Information; Sinica, Academia; Taipei,; Taiwan, title: Headline Diagnosis: Manipulation of Content Farm Headlines date: 2022-04-25 journal: nan DOI: nan sha: 6ff6ece48a110c8271259eb83766441f75644387 doc_id: 550748 cord_uid: yb1ag0sa As technology grows faster, the news spreads through social media. In order to attract more readers and acquire additional profit, some news agencies reproduce massive news in a more appealing manner. Therefore, it is essential to accurately predict whether a news article is from official news agencies. This work develops a headline classification based on Convoluted Neural Network to determine credibility of a news article. The model primarily focuses on investigating key factors from headlines. These factors include word segmentation, part-of-speech tags, and sentiment features. With integrating these features into the proposed classification model, the demonstrated evaluation achieves 93.99% for accuracy. With the rapid advancement in information technology, news agencies have expanded their platforms from newspapers to social media for greater influence. In the past, people kept current events up to date by reading newspapers and magazines. Nowadays, people gain new information from computers, phones, or tablets since publishing news on the Internet reduces money and time. Therefore, it requires less capital and technical skills to run news agencies. Some of these websites are created to profit. In order to maximize revenue and increase popularity, they spread massive amounts of advertisements along with articles on the Internet. Companies aiming for such behaviors are called "content farms." In order to attract more readers, content farms usually reproduce massive news articles from other media in a more appealing and provocative manner. Furthermore, this news is used as propaganda for public opinions, policy, and elections. For example, new agencies manipulate original news headlines to sentimentalized ones. Since origins of these news are often unknown or untraceable, they are not as credible as those produced from the national news agencies. To help people distinguish content farm news by their headlines, a sentenceclassification model based on Convoluted Neural Network is adapted. The main aspect is to find the key factors from headlines that determine whether the news is from the official news agency or content farms. As a result, word segmentations, part-of-speech tags, and sentiment scores are considered as the primary features in the model. Reference [1] focuses on analyzing incongruent headlines with their articles. Chesney et al. state that headlines on social media and press are often misleading. In order to determine clickbait, they mention that pronouns, adverbs, interrogatives, imperatives, numbers, and celebrity references are heavily used. With this information, part-of-speech is considered to be a feature in the later model. Additionally, [1] mentions that some headlines that are often classified as clickbait that do not provide information to force readers to click on the articles but use sensational technique to attract readers. Therefore, sensationalism is taken into consideration to determine whether headlines are from a content farm. Besides [1] , [2] comes to the result that forward-reference in headlines are expressed by eight different manifestations of forward-reference, a stylistic technique to attract more viewers. These manifestations include demonstrative pronouns, personal pronouns, adverbs, interrogatives, definite articles, ellipsis of obligatory arguments, and imperatives and general nouns with implicit discourse references. In order to analyze headlines, headline generation models are used to understand features of the headlines reversely. Reference [3] focuses on generating headlines based on contexts of articles with Stylistic Headline Generation (SHG) model. It generates headlines with controlled style features, including humor, romance, and clickbait. In addition to [3] , to generate text, [4] focuses on content and style. The content features include theme and sentiment. The style aspects include descriptive, length, personal, and professional. However, its sentiment and professional features rely on human labelling. As mentioned above, sensationalism is an aspect to generate sentences and analyze clickbait headlines. Therefore, a Traditional Chinese sentiment dictionary, Augmented NTU Sentiment Dictionary, is adapted to provide sentiment scores [5] . The final model that used for segregating news from content farms and national agencies are adapted from [6] . It is a sentence-classification Convoluted Neural Network, which focuses on the effects of different sizes of filters, number of feature maps, pooling strategies, and regularizations. The basis of the model follows a sentence classification model in [6] . Reference [6] introduces a general sentence classification for NLP tasks. Generally, it is a one-layer Convoluted Neutral Network with three different kernel sizes, (2, 3, 4) , and two filters for each kernel. 1D-max pooling is used to extract scalars from filters in the kernels. A total of three 2×1 feature vectors are concatenated into a matrix and flattened into a 6×1 matrix. In the end, an activation function is chosen to predict the binary categories. The word embedding is from [7] . It collected 655,000 words with 400 dimensions from Wikipedia in 2014. The overview of the model is shown in fig. 1 . From the baseline model to the final model, three features are proposed. In [6] , Zhang and Wallace consider word embeddings as lexical feature. However, to adapt more information other than embeddings, the final model requires more features that are related to the headline, including word segmentation, part-of-speech (POS) tags, and sentiment scores. The example in fig. 1 is "韓支持率揭曉" (Han's approval rate has been revealed) is segregated into "韓" (Han's) "支持率" (approval rate) "揭曉" (has been revealed). The part-of-speech tags are provided as well. They are one-hot encoded into vectors. Besides word segmentation and part-of-speech tags, words are given with their sentiment scores. In the example above, "揭曉 " (has been revealed) has a sentiment score of 0.05. The details of the features are introduced in the next section. In the model, three features are used to determine the source of the news from headlines. Two attributes are from CKIP Tagger and the other is from ANTUSD. CKIP Tagger are used to provide stylistic features, including word segmentations and part-of-speech tags, and ANTUSD is for sentiment scores. Partof-speech tags and sentiment scores are modified to fit with the models. The detailed process is shown in the following sections. CKIP Tagger CKIP Tagger is an open-source word segmentation (WS), part-of-speech tagging (POS), and name-entity recognition system (NER) in Traditional Chinese. It is developed and maintained by Li and Ma [8] . It outperforms CKIPWS and Jieba-zh_TW on word segmentation. CKIP Tagger helps to separate news headlines into words with their part-of-speech tags. It offers 68 different categories of tags. For example, the part-of-speech tags and word segmentation for the headline, "華 南金上半年獲利年減逾8成 每股賺0.12元" (HNFHC's profit has diminished over 80% in first two quarters; each share increases by 0.12 NTD ) are "華南金"(Nb) "上"(Nes) "半" (Neqa) "年"(Nf) "獲利"(VH) "年"(Na) "減逾"(VH) "8成 "(Neqa) " "(WHITESPACE) "每"(Nes) "股"(Nf) "賺"(VC) "0.12"(Neu) "元"(Nf). ANTUSD: ANTUSD (augmented NTU sentiment dictionary) collects its vocabulary from six Chinese dictionaries, including NTUSD, NTCIR MOAT task dataset, Chinese Opinion Treebank, ACBiMA, CopeOpi and E-HowNet [5] . It contains 27,221 words, and each word has a sentiment score from CopeOpi and the numbers of positive, negative, neutral, not-a-word, and nonopinion word annotations. The sentiment scores are in the range from -1 to 1. -1 is the most negative attitude and 1 is the most positive one. The dataset is collected from two news media, Central News Agency (CNA) and Mission (MISS). CNA is a national news agency in Taiwan. It is written in a relatively objective manner [9] . Among all other news agencies, MISS is chosen as the nonnational media [10]. The reason for choosing MISS as the non-national media is that it collects news from more than 200 new agencies, including ChinaTimes, Yahoo, and ETtoday. According to [9] , these three agencies are ranked as 51th, 5th, 3rd top sites in Taiwan. Besides this, MISS is known as a content farm for the following reasons. MISS modifies about 30% of headlines in the collected dataset. Among others, 50% remain the same, 12% contain broken links, and 8% do not have links. MISS changes the original headlines to a more sentimental style (table I) and uses punctuations more frequently (table II) . The CNA dataset that is collected has 22,312 news from June 3rd, 2020 to August 4th, 2020, and MISS has 43,327 news from January 4th, 2019 to August 7th, 2020. The information that are scrapped from their websites includes headline, date, passage, publisher, and original link for each news. B. Preprocessing 1) CKIP Tagger CKIP Tagger is used to separate the headlines into different words with their corresponding part-of-speech tags. However, it provides 68 different categories of the tags. In order to lower the dimension for part-of-speech tag, they are simplified into 11 categories. For instance, Caa (coordinate conjunction), Cab (conjunction such as "et cetera"), Cba (conjunction such as "in that case"), and Cbb (correlative conjunctions) are combined into a customized category, conjunction (C). DE (的 "of" or it is the word used after the adjective / 得 -ly), SHI (是 is), and FW (foreign word) were grouped as "OTHER," and all the punctuations are labeled as "PUNCT." Table III shows the customized part-of-speech tags with their abbreviations. As illustrated in fig. 1 , part-of-speech tags are transformed into 11×1 vectors. For example, "韓" (Han's) and "支持率" (approval rate) are nouns (N), so their part-of-speech vectors are both [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. "揭曉" is a verb (V). Therefore, its part-of-speech vector is [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]. The feature vectors are shown in table III. The preprocessing of the headlines is shown below in table IV. The headline is first separated into a list of different words then joined into a sentence, separating each word by space. Since whitespace is used as the filter of the tokenizer, to determine the difference between the actual whitespace from the original headline and the whitespace used to separate each segmented word, we changed the original whitespace to " ." It is shown in column "Replacement of Whitespace" in table IV. ANTUSD The sentiment score is obtained from matching the separated words of the headlines with ANTUSD library. However, CKIP Tagger does not segment headlines perfectly into words that can be found in dictionaries. Since about 60% of vocabularies in ANTUSD have the length of two, bi-gram is used for word segmentation. Those two-character words are matched with the library again to get their corresponding sentiment scores. Then, the original segmented words are separated into individual characters with their original sentiment scores. To fill those missing scores, the words that are generated by bigram, matched with the library, and not appeared in the original segmented words are selected. The new scores are calculated back to original words with following rules: For instance, original headline is "日本單日增1584例確診 創新高 東京將設武漢肺炎醫院" (Japan reaches its highest record of 1584 confirmed cases a day; Tokyo will construct COVID-19 hospital) and its word segmentation from CKIP Tagger is "日本 單日 增 1584 例 確診 創新 高 東京 將 設 武漢 肺炎 醫院". At first, the word 創新 is matched with the sentiment score of 0.382573 from ANTUSD library. By using bigram, the words and their sentiment scores, " 日增" (0.0381147) and " 新高" (0.0), are found. The process of generating new scores is shown in table V. In the end, it results in "日本(0) 單日(0.019057) 增(0.0381147) 1584(0) 例(0) 確診 (0) 創新(0.286929) 高(0.0) (0) 東京(0) 將(0) 設(0) 武漢(0) 肺炎(0) 醫院(0)." As a note, in the original sentence, the word with a "X" means that it does not match with ANTUSD and does not have a sentiment score. ["華南金", "上", " 半 ", " 年 ", " 獲 利", "年", "減逾", "8 成", " ", "每", "股", "賺", "0.12", "元"] 華南金 上 半 年 獲利 年 減逾 8成 每 股 賺 0.12 元 "Nb Nes Neqa Nf VH Na VH Neqa WHITESPAC E Nes Nf VC Neu Nf" "N", "N", "N", "N", "V", "N", "V", 'N', "WHITESPA CE", "N", "N", "V", "N", "N" The baseline of the model has only one input, the word embedding. Then, part-of-speech tags are added to support the model. In the end, sentiment scores are taken into consideration as an attribute. With the assumption that word choices are used differently in CNA and MISS, the baseline model only contains an input, the segmented words of the headline. For each headline, the max length is set as 100 words. Each word is embedded into 400 dimensions, resulting in 100×400 matrix. With the discovery that punctuations appear more often in MISS, the model also adapts the second input, part-of-speech tags. They are simplified from 68 categories into 11. They are also one-hot encoded to a 11×1 vector. It is concatenated with the previous embedded matrix, resulting in a size of 100×411 for each headline. In order to improve the models, different features are added as inputs. At first, only word embedding is implemented. Later, part-of-speech tags and sentiment scores are added. Besides adding the inputs, different activation functions are experimented to maximize accuracy. Binary-cross entropy is being used as the loss function. The model with word embedding, part-of-speech tags, and sentiment scores performs the best among the other two models, with the highest testing accuracy of 0.9399 and the lowest loss of 0.1968 shown in table VI. The model improves the most when part-of-speech is added. Model Among different activation functions, sigmoid outperforms others. While tanh and sigmoid do not differ too much in testing accuracy, tanh's loss is more than twice of sigmoid's loss shown in table VII. The purpose of this research is to determine whether a news article is from the national news agency, CNA, or the content farm, MISS, by its headline. It is to provide readers information whether the news is credible. To automatically classify the label of the news, a Convoluted Neural Network headline classification is proposed. It is a one-layer model that has two filters and three kernel sizes. The inputs of the models are the embeddings of words in 400 dimensions, part-of-speech tags in 11 dimensions, and a one-dimension sentiment score. Words segmentations are used from CKIP Tagger. CKIP Tagger also provides 68 different categories of part-of-speech tags. In order to simplify the dimensions, the tags are combined into 11 groups, adjective, conjunction, adverb, interjection, noun, preposition, verb, auxiliary word, punctuations, whitespace, and others. Besides the word segmentation and part-of-speech tags, sentiment score is used as a feature in the model. The sentiment scores are found in ANTUSD library. However, most of the words do not matched with the library. Therefore, bigram is used to segregate headlines into two-character words and find their scores again in the library. Later, an average of the scores are adapted. In the end, with sigmoid activation function and three inputs, the model results in a highest accuracy score of 93.99%. Sentiments and Styles of the news, including formats and semantics, change over time. Those changes are often the results of alterations in the environments, such as political, medical, technical, and financial fields. The biggest difference is that people change the way they read the news. In the past, people usually relied on newspapers to understand things happening in the world. However, nowadays, people receive new information by their computers or phones. Therefore, headlines of the news may change upon that. For example, because of the convenience and abundance of spreading information, news headlines are curtailed to make the readers easy to adapt news. With the uncertainties in future, it is hard to exploit the same model and features to determine the credibility of the news articles from their headlines. In order to continue the investigations, analysis of the contemporary news headlines is required. Incongruent Headlines: Yet Another Way to Mislead Your Readers Click Bait: Forward-reference as Lure in Online News Headlines Hooks in the Headline: Learning to Generate Headlines with Controlled Styles Controlling Linguistic Style Aspects in Neural Language Generation ANTUSD: A Large Chinese Sentiment Dictionary A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification 詞向量 Word Embedding CkipTagger Top Sites in Taiwan -Alexa