key: cord-0044759-79y1qlgg authors: Zhang, Yan; Zhou, Yue; Yao, JingTao title: Feature Extraction with TF-IDF and Game-Theoretic Shadowed Sets date: 2020-05-18 journal: Information Processing and Management of Uncertainty in Knowledge-Based Systems DOI: 10.1007/978-3-030-50146-4_53 sha: c7d8cf44fc48c1b343c5b083d3aa31b10d1bf550 doc_id: 44759 cord_uid: 79y1qlgg TF-IDF is one of the most commonly used weighting metrics for measuring the relationship of words to documents. It is widely used for word feature extraction. In many research and applications, the thresholds of TF-IDF for selecting relevant words are only based on trial or experiences. Some cut-off strategies have been proposed in which the thresholds are selected based on Zipf’s law or feedbacks from model performances. However, the existing approaches are restricted in specific domains or tasks, and they ignore the imbalance of the number of representative words in different categories of documents. To address these issues, we apply game-theoretic shadowed set model to select the word features given TF-IDF information. Game-theoretic shadowed sets determine the thresholds of TF-IDF using game theory and repetition learning mechanism. Experimental results on real world news category dataset show that our model not only outperforms all baseline cut-off approaches, but also speeds up the classification algorithms. Term Frequency-Inverse Document Frequency, or TF-IDF, is one of the most commonly used weighting metrics for measuring the relationship of words and documents. It has been applied to word feature extraction for text categorization or other NLP tasks. The words with higher TF-IDF weights are regarded as more representative and are kept while the ones with lower weights are less representative and are discarded. An appropriate selection of word features is able to speed up the information retrieval process while preserving the model performance. However, for many works, the cutoff values or the thresholds of TF-IDF for selecting relevant words is only based on guess or experience [6, 11, 12] . Zipf's law is used to select words whose IDF exceeds a certain value [10] ; Lopes et al. [9] proposed a cut-off policy by balancing the precision and recall from the model performance. Despite their success, those cut-off policies have certain issues. The cut-off policy that the number of words to keep is determined by looking at the precision and recall score of the model can be restricted in specific domains or tasks. In addition, the number of relevant words may vary in different categories of documents in certain domains. For instance, there exists an imbalance between the number of representative positive words and negative words in many sentiment classification tasks. Thus, a cut-off policy that is able to capture such imbalance is needed. To address these issues, we employ game-theoretic shadowed sets (GTSS) to determine the thresholds for feature extraction. GTSS, proposed by Yao and Zhang, is a recent promising model for decision making in the shadowed set context [22] . We calculate the difference of TF-IDF for each word between documents as the measurement of relevance, and then use GTSS to derive the asymmetric thresholds for word extraction. GTSS model aims to determine and explain the thresholds from a tradeoff perspective. The words with the difference of TF-IDF less than β or greater than α are selected. We regard the words whose difference of TF-IDF are between α and β as neutral. These words can be safely removed since they can not contribute much in text classification. The results of our experiments on a real world news category dataset show that our model achieves significant improvement as compared with different TF-IDF based cut-off policies. In addition, we show our model can achieve comparable performance as compared to the model using all words' TF-IDF as features, while greatly speed up the classification algorithms. TF-IDF is the most commonly used weighting metrics for measuring the relationship of words and documents. By considering the word or term frequency (TF) in the document as well as how unique or infrequent (IDF) a word in the whole corpus, TF-IDF assigns higher values to topic representative words while devalues the common words. There are many variations of TF-IDF [19, 20] . In our experiments, we use the basic form of TF-IDF and follow the notation given in [7] . The TF-IDF weighted value w t,d for the word t in the document d is thus defined as: where tf t,d is the frequency of word t in the document d, N is the total number of documents in the collection, and df t is the number of documents where word t occurs in. TF-IDF measures how relevant a word to a certain category of documents, and it is widely used to extract the most representative words as features for text classification or other NLP tasks. The extraction is often done by selecting top n words with the largest TF-IDF scores or setting a threshold below which the words are regarded as irrelevant and discarded. But an issue arises about how to choose such cut-off point or threshold so as to preserve the most relevant words. Many works choose such threshold only based on trial or experience [6, 11, 12, 23] . On the other hand, some approaches address the issue. Zipf's law is used to select the words whose IDF exceeds a certain value in order to speed up information retrieval algorithms [4, 10] . Lopes et al. [9] proposed a cut-off policy which determines the number of words to keep by balancing precision and recall in downstream tasks. However, such cut-off points should not be backward induced by the performance of downstream task; rather, the thresholds should be derived before feeding the extracted words to the classifier to speed up the model without reducing the performance. In addition, for certain domains, the number of relevant words may vary in different categories of documents. For instance, the number of words relevant to positive articles and the number of words relevant to negative articles are often imbalanced in many sentiment analysis tasks. Therefore, the cut-off points or thresholds may also vary in different categories. By observing these drawbacks, we attempt to find asymmetric thresholds of TF-IDF for feature extraction by using game-theoretic shadowed sets. In this section, we will introduce our model in details. Our model aims to find an approach of extracting relevant words and discarding less relevant words based on TF-IDF information so as to speed up learning algorithms while preserving the model performance. We first calculate the difference of TF-IDF for each word between documents as one single score to measure the degree of relevance, and then use game-theoretic shadowed sets to derive the asymmetric thresholds for words extraction. Consider a binary text classification task with a set of two document classes C = {c 1 , c 2 }. For each word t, we calculate the difference of TF-IDF weighted value between document c 1 and c 2 as: We use DW t to measure how relevant or representative the word t is to the document classes. The greater the magnitude of DW t , the more representative the word t is to distinguish the document categories. A large positive value of DW t indicates that a word t is not common word and more relevant to the document c 1 , while a significant negative value shows the word t is representative to document c 2 . If DW t is closed to zero, then we regard the word t as neutral. In the next section, we will choose the cut-off thresholds for selecting the most representative word features by using the Game-theoretic Shadowed Sets method. For convenience, we here normalize the DW t with min-max linear transformation. A shadowed set S in the universe U maps the membership grades of the objects in [13] . Shadowed sets are viewed as three-valued constructs induced by fuzzy sets, in which three values are interpreted as full membership, full exclusion, and uncertain membership [15] . Shadowed sets can capture the essence of fuzzy sets at the same time reducing the uncertainty from the unit interval to a shadowed region [15] . The shadowed set based three-value approximations are defined as a mapping from the universe U to a three-value set {0, σ, 1}, if a single value σ (0 ≤ σ ≤ 1) is chosen to replace the unit interval [0, 1] in the shadowed sets, that is [3] , The membership grade μ A (x) of an object x indicates the degree of the object x belonging to the concept A or the degree of the concept A applicable to x [21] . Given a concept A and an element x in the universe U , if the membership grade of this element μ A (x) is greater than or equal to α, the element x would be considered to belong to the concept A. An elevation operation elevates the membership grade μ A (x) to 1 which represents a full membership grade [14] . If the membership grade μ A (x) is less than or equal to β, the element x would not be considered to belong to the concept A. An reduction operation reduces the membership grade μ A (x) to 0 which represents a null membership grade [14] . If the membership grade μ A (x) is between α and β, the element x would be put in a shadowed area, which means it is hard to determine if x belongs to concept A. μ A (x) is mapped to σ which represents the highest uncertainty, that is we are far more confident about including an element or excluding an element in the concept A. The membership grades between α and σ are reduced to σ; The membership grades between σ and β are elevated to σ. We get two elevated areas, E 1 (μ A ) and E σ (μ A ), and two reduced areas, R 0 (μ A ) and R σ (μ A ) shown as the dotted areas and lined areas in Fig. 1 (a) . Figure 1 (b) shows the shadowed set based three-value approximation after applying the elevation and reduction operations on all membership grades. The vagueness is localized in the shadowed area as opposed to fuzzy sets where the vagueness is spread across the entire universe [5, 16] . Shadowed set based three-value approximations use two operations, the elevation and reduction operations, to approximate the membership grades μ A (x) to a three-value set {0, σ, 1}. Given an element x with the membership grade μ A (x), the elevation operation changes the membership grade μ A (x) to 1 or σ. The reduction operation changes the membership grade μ A (x) to 0 or σ. These two operations change the original membership grades and produce the elevated and reduced areas which show the difference between the original membership grades and the mapped values 1, σ, and 0, as shown in Fig. 1(a) . These areas can be viewed as the elevation and the reduction errors, respectively. The elevation operation produces two elevation errors E 1 (μ A ) and E σ (μ A ); the reduction operation produces two reduction errors R 0 (μ A ) and R σ (μ A ), that is -The elevation error E 1 is produced when the membership grade μ A (x) is greater than or equal to α (i.e., μ A (x) ≥ α), and the elevation operation elevates μ A (x) to 1. We have E 1 (μ A (x)) = 1 − μ A (x). -The elevation error E σ is produced when β < μ A (x) < σ, and the elevation operation elevates μ A (x) to σ. We have E σ (μ A (x)) = σ − μ A (x). -The reduction error R 0 is produced when μ A (x) ≤ β, and the reduction operation reduces μ A (x) to 0. We have R 0 (μ A (x)) = μ A (x). -The reduction error R σ is produced when σ < μ A (x) < α, and the reduction operation reduces The elevation errors E (α,β) (μ A ) is the sum of two elevation errors produced by elevation operation. The total reduction errors R (α,β) (μ A ) is the sum of two reduction errors produced by reduction operation. For discrete universe of discourse, we have a collection of membership values. The total elevation and reduction errors are calculated as [22] , Given a fixed σ, the elevation and reduction errors change when the thresholds (α, β) change. No matter which threshold changes and how they change, the elevation and reduction errors always change in opposite directions [22] . The decrease of one type of errors inevitably brings the increase of the other type of errors. The balanced shadowed set based three-value approximations are expected to represent a tradeoff between the elevation and reduction errors. Game-theoretic shadowed sets (GTSS) use game theory to determine the thresholds in the shadowed set context. The obtained thresholds represent a tradeoff between two different types of errors [22] . GTSS use a game mechanism to formulate games between the elevation and reduction errors. The strategies performed by two players are the changes of thresholds. Two game players compete with each other to maximize their own payoffs. A repetition learning mechanism is adopted to approach a compromise between two players by modifying game formulations repeatedly. The resulting thresholds are determined based on the game equilibria analysis and selected stopping criteria. Game Formulation. Three elements should be considered when formulating a game G, i.e., game player set O, strategy profile set S, and utility functions u, G = (O, S, u) [8, 17] . The game players are the total elevation and reduction errors which are denoted by E and R, i.e., O = {E, R}. The strategy profile set is S = S E × S R , where S E = {s 1 , s 2 , ..., s k1 } is a set of possible strategies for player E, and S R = {t 1 , t 2 , ..., t k2 } is a set of possible strategies for player R. We select (σ, σ) as the initial threshold values, which represent that we do not have any uncertainty on all membership grades and we have the smallest shadowed area. Starting from (σ, σ), we gradually make α and β further to each other and increase the shadowed area. c E and c R are two constant change steps, denoting the quantities that two players E and R use to change the thresholds, respectively. For example, we set the initial threshold values (α, β) = (0.5, 0.5). The player E performs increasing α and the player R performs decreasing β. When we set c E = 0.01 and c R = 0.02, we have S E = {α no change, α increases 0.01, α increases 0.02}, and S R = {β no change, β decreases 0.02, β decreases 0.02}. The payoffs of players are u = (u E , u R ), and u E and u R denote the payoff functions of players E and R, respectively. The payoff functions u E (α, β) and u R (α, β) are defined by the elevation and reduction errors, respectively, that is, where E (α,β) (μ A ) and R (α,β) (μ A ) are defined in Eqs. (4) and (5). We try to minimize the elevation and reduction errors, so both players try to minimize their payoff values. We use payoff tables to represent two-player games. Table 1 shows a payoff table example in which both players have 3 strategies. Repetition Learning Mechanism. The involved players are trying to maximize their own payoffs in the formulated games. But one player's payoff is effected by the strategies performed by the other player. The balanced solution or game equilibrium is a strategy profile from which both players benefit. This game equilibrium represents both players reach a compromise or tradeoff on the conflict. The strategy profile (s i , t j ) is a pure strategy Nash equilibrium, if for players E and R, s i and t j are the best responses to each other [17] , this is, The above equations can be interpreted as a strategy profile such that no player would like to change his/her strategy or they would loss benefit if deriving from this strategy profile, provided this player has the knowledge of other player's strategies. The equilibrium of the current formulated game means the threshold pair corresponding this equilibrium are the best choices within the current strategy sets. We have to check if there are some threshold pairs near the current equilibrium that are better than the current ones. Thus we repeat the games with the updated initial thresholds. We may be able to find more suitable thresholds with repetition of thresholds modification. We define the stopping criteria so that the iterations of games can stop at a proper time. There are many possible stopping criteria. For example, the payoff of each player is beyond a specific value; the thresholds (α, β) violate the constraint 0 ≤ β ≤ σ ≤ α ≤ 1; the current game equilibrium does not improve the payoffs gained by both players under the initial thresholds; no equilibrium exists. In this research, we compare the payoffs of both players under the initial thresholds and the thresholds corresponding to the current equilibrium. We set the stopping criteria as one of the players increases its payoff values, or there does not exist a pure strategy Nash equilibrium in the current game. We now select the most representative words by applying the thresholds (α, β) derived in previous sections. The words with normalized DW t being greater than the upper threshold α and less than the lower threshold β will be kept as our word features for text classification while the rest words are discarded. We evaluate our approach on the HuffPost news category dataset [18] . This dataset consists of 200,853 news headlines with short descriptions from HuffPost website during the year 2012 to 2018. It contains 31 categories of news such as politics, entertainment, business, healthy living, art, and so forth. We use the largest two categories, the 32,739 politics news and 14,257 entertainment news, as the binary text classification data in our experiments. The news text is obtained by concatenating the news headline and the corresponding short description. We extract 381449 words from these selected news. We use 80% data for training and 20% for testing, and adopt accuracy and F1 scores as metrics for model evaluation. We first normalize DW t using min-max normalization linear transformation. The distribution of normalized DW t is shown in Fig. 2 Almost 80% of words have the normalized DW t 0.548054 so we set σ = 0.5481 aiming to minimize the errors produced by mapping all DW t values to three values {0, σ, 1} via game-theoretic shadowed set model. If we set σ as other value instead of 0.548054, mapping the large amount of DW t 0.548054 to σ definitely will produce more errors. Table 2 is the payoff table. The cell at the right bottom corner is the game equilibrium whose strategy profile is (α increases 0.02, β decreases 0.04). The payoffs of the players are (17689, 8742). We set the stopping criterion as one of players' payoff increases. When the thresholds change from (0.55, 0.54) to (0.57, 0.5), the elevation error is decreased from 17933 to 17689, and the reduction error is decreased from 10472 to 8742. We repeat the game by setting (0.57, 0.5) as the initial thresholds. The competitive games are repeated four times. The result is shown in Table 3 . In the fourth iteration, we find out that the payoff value of player E increases. The repetition of game is stopped and the final result is the initial thresholds of the fourth game (α, β) = (0.61, 0.42). We got (α, β) = (0.61, 0.42), which means we keep the words with DW t greater than 0.61 and less then 0.42, and discard the words with DW t between 0.61 and 0.42. We calculate the DW t value for each single word and bi-gram, and then use Support Vector Machine (SVM) [1, 2] as our unique classifier to compare our approach with: (1) ALL, in which we keep all words with no feature extraction; (2) Sym-Cutoff, in which the symmetrical cut-off values are drawn purely based on a simple observation of the statistical distribution of DW t ; (3) Sym-N-Words, where we select 2n words given n smallest DW t and n largest DW t such that 2n is approximately equal to the total number of words extracted with our approach. We show the model performance of different extraction approaches on the new category dataset in Table 4 . Our model is named as "Asym-GTSS-TH". From the results, we can observe that: (1) Our approach achieves superior performance compared with Sym-Cutoff which is purely based on a guess given TF-IDF distribution. It verifies our claim that the GTSS can better capture the pattern of TF-IDF and provide a more robust range for selecting relevant words given TF-IDF for text classification; (2) The Sym-N-Words approach achieves close performance as ours, because it takes the advantage of the information of the number of words to keep derived with our thresholds. However, our approach still outperforms the Sym-N-words approach since it evenly selects relevant words for document c 1 and c 2 . It indicates that there exists imbalance of representative words between different categories of documents which is better captured by our model; (3) Compared with using all words' TF-IDF score as input, our model discards more than 52% words and speed up the process of classification while preserving the performance. In this paper, we propose a feature extraction approach based on TF-IDF and game-theoretic shadowed sets in which the asymmetric thresholds for selecting relevant words are derived by repetitive learning on the difference of TF-IDF for each word between documents. Our model can explore the pattern of TF-IDF distribution as well as capture the imbalance of the number of representative words in different categories. The experimental results on the news category dataset show that our model can achieve improvement as compared to other cutoff policies and speed up the information retrieval process. In the future, we will explore the consistency of our model performance on more real world datasets and test the generalization ability of our GTSS model on different metrics that measures the relevance of words, such as BNS and Chi-square. Support vector clustering Support-vector networks Decision-theoretic three-way approximations of fuzzy sets Information Retrieval: Algorithms and Heuristics Fuzzy number approximation via shadowed sets Improved feature selection approach TFIDF in text mining Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition Essentials of game theory: a concise multidisciplinary introduction Evaluation of cutoff policies for term extraction Introduction to Information Retrieval Automatic term extraction and document similarity in special text corpora Text categorization with class-based and corpusbased keyword selection Shadowed sets: representing and processing fuzzy sets Shadowed sets: bridging fuzzy and rough sets From fuzzy sets to shadowed sets: interpretation and computing Granular computing with shadowed sets Games and Information: An Introduction to Game Theory. Blackwell News category dataset from HuffPost website Term-weighting approaches in automatic text retrieval Introduction to Modern Information Retrieval Toward extended fuzzy logic -a first step Game theoretic approach to shadowed sets: a three-way tradeoff perspective Grooming detection using fuzzyrough feature selection and text classification