Feature Selection in Text Clustering Applications of Literary Texts: A Hybrid of Term Weighting Methods


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 2, 2020 

99 | P a g e  
www.ijacsa.thesai.org 

Feature Selection in Text Clustering Applications of 

Literary Texts: A Hybrid of Term Weighting 

Methods 

Abdulfattah Omar 

College of Science and Humanities 

Prince Sattam Bin Abdulaziz University, Saudi Arabia 

Department of English, Faculty of Arts, Port Said University 

 
Abstract—The recent years have witnessed an increasing use 

of automated text clustering approaches and more particularly 

Vector Space Clustering (VSC) methods in the computational 

analysis of literary data including genre classification, theme 

analysis, stylometry, and authorship attribution. In spite of the 

effectiveness of VSC methods in resolving different problems in 

these disciplines and providing evidence-based research findings, 

the problem of feature selection remains a challenging one. For 

reliable text clustering applications, a clustering structure should 

be based on only and all the most distinctive features within a 

corpus. Although different term weighting approaches have been 

developed, the problem of identifying the most distinctive 

variables within a corpus remains challenging especially in the 

document clustering applications of literary texts. For this 

purpose, this study proposes a hybrid of statistical measures 

including variance analysis, term frequency-inverse document 

frequency, TF-IDF, and Principal Component Analysis (PCA) 

for selecting only and all the most distinctive features that can be 

usefully used for generating more reliable document clustering 

that can be usefully used in authorship attribution tasks. The 

study is based on a corpus of 74 novels written by 18 novelists 

representing different literary traditions. Results indicate that 

the proposed model proved effective in the successful extraction 

of the most distinctive features within the datasets and thus 

generating reliable clustering structures that can be usefully used 
in different computational applications of literary texts. 

Keywords—Feature selection; frequency; PCA; term weight; 

text clustering; TF-IDF; variance; VSC 

I. INTRODUCTION 

With the increasing access to e-texts and the availability 
and power of computational tools, there has been an 
increasing amount of humanities computing literature on text 
analysis and interpretation. Studies of this kind are generally 
classified under the broad heading computer-assisted text 
analysis (CATA). CATA includes numerous applications 
including authorship attribution, stylometric analysis, theme 
analysis, the use of imagery, genre classification, 
characterization, and textual analysis [1-4]. In spite of the 
effectiveness of VSC methods in resolving different problems 
in these disciplines and providing evidence-based research 
findings, the problem of feature selection remains a 
challenging one. For reliable text clustering applications, a 
clustering structure should be based on only and all the most 
distinctive features within a corpus. For this purpose, this 

study proposes a hybrid of statistical measures including 
variance analysis, term frequency-inverse document 
frequency, TF-IDF, and Principal Component Analysis (PCA) 
successively for selecting only and all the most distinctive 
features that can be usefully used for generating more reliable 
document clustering that can be usefully used in authorship 
attribution tasks. The study is based on a corpus of 74 novels 
written by 18 novelists representing different literary 
traditions. 

II. LITERATURE REVIEW 

The literature suggests that text clustering (simply putting 
similar texts together) is central in almost all CATA 
applications [5, 6]. It is used as a starting point for many of the 
CATA applications including thematic analysis, genre 
classification, stylometry, and authorship attribution [5, 7-14]. 
It is known that studies in these disciplines have always been 
done using non-computational methods. With the 
development of computational approaches; however, critics 
and researchers have come to think about how effective 
computational approaches are in identifying meanings within 
texts. Now, it is often assumed that computational approaches 
prove effective in better understanding texts in question [15]. 
This is best described as a process of decoding meanings 
within texts [16]. Despite the relative success of studies of this 
kind, they are met with a strong wave of objections from a 
number of critics and scholars. They still think that their 
success in the interpretation of texts is still far from detecting 
what a text is exactly about [17, 18]. This can be attributed to 
the unfamiliarity of the world of computational theory and 
methodology to literary scholars. Ramsay [19] suggests that 
―the inability of computing humanists to break into the 
mainstream of literary critical scholarship may be attributed to 
the prevalence of scientific methodologies and metaphors in 
humanities computing research‖ [19, P. 167]. One might even 
suggest that the unfamiliarity with computational and 
mathematical approaches has generated in literary scholars the 
belief that all computational and statistical approaches are 
somehow antithetical to literary critical approaches. This 
would explain the gap we see between literary critical theory 
on the one hand and computer-based text analysis and 
quantitative approaches on the other: the majority of critical 
theory researchers have never argued the need for using 
computational mathematical approaches to supplement widely 

Paper Submission Date:  January 19, 2020 

Acceptance Notification Date:  February 12, 2020 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 2, 2020 

100 | P a g e  
www.ijacsa.thesai.org 

used critical approaches [20-22]. Critics of the involvement of 
computational methods in literary criticisms always argue that 
human reasoning is crucial and can never be replaced in 
understanding and interpreting texts. They argue that so far 
there is no computer-assisted system that is capable of 
accounting only for all the linguistic and meta-linguistic 
features of texts. 

Defenders of computational text analysis, on the other 
hand, argue that the use of a computational framework in 
literary studies is objective, quantifiable, and 
methodologically consistent [23-27]. Hockey asserts that 
computational tools are useful adjuncts to literary criticism. 
She contends that without computational tools, critics have 
only human reading, intuition, and serendipity to use in 
literary criticism. Many of the defenders even go beyond that, 
arguing ―without the computer, the interpreter is nothing more 
than some Romantic Aeolian harpist drowning in the 
phenomenological abyss of their own impressions‖ [19, P. 
168]. This can be reflected in the significant increase in the 
application of computational methods in literary studies over 
the recent years. In numerous thematic reviews of different 
literary texts, text clustering is central in thematic analysis 
applications. This is the arrangement of texts by topic with the 
purpose of investigating thematic interrelationships within 
texts [7, 9, 14, 28, 29]. The main assumption is that text 
clustering methods are effective in identifying what a text is 
about. Consequently, thematic hypotheses can be based on 
clustering results.  It is even argued that computational 
techniques are effective in generating new insights and 
interpretative ideas about thematic reviews of different literary 
texts [14, 28]. Likewise, Ramsay [13] indicates that genre 
classification which remained distant from computational and 
mathematical applications for a long time, is now making use 
of computation technologies and more specifically text 

clustering approaches to adjudicate some genre classification 
problems and objectively assign literary texts to appropriate 
genres. With the high development of text clustering 
algorithms and methods, genre classification studies draw 
more heavily on computational methods for more accurate 
results and better performance [13, 30-35]. Interestingly, the 
works of Shakespeare have been the subject of many 
computer-based genre classifications [13, 34, 36]. Using 
cluster analysis methods, Jockers classified 37 Shakespearean 
plays into three main clusters, comedy, history, and tragedy as 
shown in Fig. 1. 

The literature also suggests that text clustering methods are 
now used in stylometry- the investigation of the quantitative 
properties of an author‘s style, and authorship attribution [33, 
37-45]. The claim is that results based on computer-based 
methods are accepted by many as more accurate than those 
based on conventional non-computational methods. In spite of 
the potentials of computational approaches and text clustering 
methods especially the capacities for analyzing large 
quantities of data and generating results that are objective and 
replicable, there are still many problems and challenges with 
these approaches that may affect the reliability and 
acceptability of such methods [46-49]. One main problem is 
the effectiveness of text classifiers to identify and extract only 
and all the most distinctive features or variables within a 
corpus for generating clustering structures that can be usefully 
used in different applications. Although the issue has been 
extensively investigated in different disciplines including data 
mining and information retrieval, very little has been done in 
relation to the problem of feature selection in text clustering 
applications on literary texts. This study addresses this gap in 
the literature by proposing a model that combines together 
three statistical methods, namely variance, TF-IDF, and PCA. 

 
Fig. 1. Jockers‘ Genre Classification of 37 Shakespearean Plays. 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 2, 2020 

101 | P a g e  
www.ijacsa.thesai.org 

III. METHODOLOGY 

A. Methods 

For the purposes of the study, an experimental study is 
used where different term-weighting methods are tried to 
develop a model that best identifies and extracts only and all 
the most distinctive variables within datasets. Term weighting 
is a pre-processing step in text clustering applications where 
each term is assigned its appropriate weight in all documents 
within a corpus with the purpose of enhancing the text 
clustering performance [50-52]. Term frequency is still one of 
the most widely used term weighting approaches in text 
clustering applications [53-57]. However, term frequency 
approaches alone are unsuitable for the text clustering of 
literary texts. This study experiments a combination of 
different term weighting methods including variance, TF-IDF, 
and PCA. 

1). Variance: Document clustering depends on there being 

variation in the characteristics of interest to the research 

question; if there is no variation, the documents are identical 

and cannot be classified relative to one another [57-60]. The 

assumption is that variables describing the characteristics of 

interest are thus only useful for clustering if there is significant 

variation in the values they take. The intuition for variance is 

that if a word is used in all or most of the documents in a 

document collection then that word is more likely to be more 

important than words that do not vary considerably [53]. 

Accordingly, documents can be clustered according to the 

basis of variance. The implication is that variables of 

significant variation can be retained and variables with little or 

no variation can be removed. Although variance is an 

important factor in the assessment of variable importance, 

retaining the variables that have significant significance is not 

a guarantee that the data matrix is built up of the most 

distinctive vectors. Consequently, it should be used along with 

different term-weighting methods. 

2). TF-IDF: TF-IDF is currently the most common method 

of calculating term frequency. It is widely used in information 

retrieval and text mining for identifying the most important 

variables within datasets. Numerous studies have concluded 

that TF-IDF works well but they do not explain why this 

happens [51, 59, 61-64]. The development of IDF came at the 

hands of Karen Spärck Jones in 1972 with the publication of 

her article ―A statistical interpretation of term specificity and 

its application in retrieval‖. Spärck Jones [65] was the first to 

propose the measure of term specificity and the term came to 

be known as Inverse Document Frequency IDF later.  The 

underlying principle of specificity is the selection of particular 

terms, or rather the adoption of a certain set of effective 

vocabulary that collectively characterizes the set of 

documents. In statistical terms, specificity is a statistical 

property of index terms. Statistical specificity is explained in 

relation to term frequency. This is based on counting the 

number of documents in the collection being searched which 

contain the query [61, 65]. Given that the term frequency of a 

document is the number of terms it contains, specificity of a 

term is the number of documents to which it pertains [65]. 

Logically, if descriptions are longer, terms will be used more 

often. This may lead to the assumption that if a query is 

frequently repeated in a document, this document is related to 

the query. This assumption can be, however, falsified. Spärck 

Jones [65] argues that a query term that occurs in many 

documents is not necessarily a good discriminator, and should 

be given less weight than one which occurs in a few 

documents. Spärck Jones‘ specificity or inverse document 

frequency IDF was later coupled with term frequency where it 

has been extensively used in many term weighting schemes 

[61, 66, 67]. In TF-IDF, the most discriminant terms are the 

highest TF-IDF variables. This is computed by summing the 

TF-IDF for each query term and a high weight in TF-IDF is 

reached by a high term frequency in the given document and a 

low document frequency of the term in the whole collection of 

documents [51, 59, 66, 67]. The implication to document 

clustering is that if the highest TF-IDF variables, which are 

taken to be the most discriminant terms, are identified, then 

unimportant variables can be deleted and data dimensionality 

is reduced. 

3). PCA: PCA is one of the basic geometric tools that are 

used to produce a lower-dimensional description of the rows 

and columns of a multivariate data matrix [50, 68-70]. The 

main function of PCA is to find the most informative vectors 

within a data matrix. Jolliffe [71] explains ―The central idea of 

PCA is to reduce the dimensionality of a data set consisting of 

a large number of interrelated variables, while retaining as 

much as possible of the variation present in the data sets [71]. 

It can be thus described as a technique for data quality [69]. 

To put it simply, PCA performs two complementary tasks: 

(1) organizing sets of data and (2) reducing the number of 

variables without much loss of information. In many text 

clustering applications, PCA is used along with cluster 

analysis so that clustering is based on the most distinctive 

vectors within data sets. The literature suggests that PCA is 

used a great deal in text clustering applications prior to 

performing cluster analysis. The link between both cluster 

analysis and PCA is that both are concerned with finding 

patterns in data. It is sometimes advised that cluster analysis is 

based on PCA results so that the clustering structure is built on 

uncorrelated vectors. In spite of the computational 

mathematical nature of PCA, this section is only concerned 

with the idea of data reduction. 

The main assumption behind PCA is that a matrix with 
huge data sets can be reduced so that the most distinctive 
vectors are identified with the purpose of best expressing the 
data and revealing hidden structures. Although some of the 
discarded or deleted variables can be important for clustering, 
PCA works to perform a ‗good‘ dimensionality reduction with 
no great loss of information. The underlying principle of PCA 
is that it removes correlated variables within datasets so that it 
describes the covariance relationships among these variables. 
Fielding [72] explains that PCA ―transforms an original set of 
variables (strictly continuous variables) into derived variables 
that are orthogonal (uncorrelated) and account for decreasing 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 2, 2020 

102 | P a g e  
www.ijacsa.thesai.org 

amounts of the total variance in the original set‖ [72, P. 16]. 
The process is done by means of computing the principal 
components scores by measuring all the variables in the data 
set. In so doing the variables that have the highest loading or 
weight are identified as principal components and other 
variables are discarded. The resulting principal components 
can then be used in subsequent analyses. Given a two-

dimensional vector space with dimensions x and y shown in 
Fig. 2A, it is possible to transform the distribution of the data 
as an orthogonal linear representation as shown in Fig. 2B. 

The data vector coordinates are then recalculated relative 
to the new basis. This has the effect of generating a highly 
correlated 2-dimensional vector space, as shown in Fig. 3. 

Finally, the data vector coordinates are then computed on a 
given principal component. The variables are weighted in such 
a way that the resulting components account for a maximal 
amount of variance in the dataset. This is shown in Fig. 4. 

 
(a) A Two-Dimensional Space of the Data. 

 
(b) An Alternative Orthogonal Basis for Data. 

Fig. 2. A Representation of a 2-Dimensional Space in Two different Ways. 

 
Fig. 3. A Highly Correlated 2-Dimensional Vector Space. 

 
Fig. 4. Testing Variance in Data using TF-IDF. 

As seen in the above figure, X' captures almost all the 
variation in the data, and Y' only a small amount. If Y' is 
simply disregarded, then the data can be restated in just one 
rather than the original two dimensions with minimal loss of 
information, and the data dimensionality has been reduced. 
The idea is extended to any data dimensionality. So given a 
data matrix of 100 rows and 1000 columns, the data matrix 
can be re-described in a lower number of dimensions given 
that there is redundancy among the variables; that is, they 
overlap with one another in terms of the information they 
present. One of the main issues in PCA, however, is 
determining the number of meaningful principal components 
(PCs). 

B. Data 

This is based on a corpus of 74 novels written by 18 
novelists representing different literary traditions. These were 
alphabetically ordered and coded as shown in Table I. 

C. Procedures 

For text clustering purposes, a data matrix M was built. 
The matrix included all the 74 novels. Three pre-processing 
steps were carried out. First, all non-alphabetical ad 
punctuation marks were removed. The texts were converted 
into what is called bag of words (BOW). Second, stemming 
was carried out where only lexical types were retained. Third, 
texts were normalized in terms of length so that variation in 
text length has no negative impacts on the reliability of text 
clustering results. A matrix M was thus generated consisting 
of 74 rows (the number of texts) and 37435 vectors (all the 
lexical types in the texts). One major problem with this matrix 
is data dimensionality. That is, the matrix is composed of so 
many variables which makes it impossible for any text 
clustering system to generate reliable clustering structures. In 
the face of this problem, a model of three term weighting 
methods was proposed. 

First, a variance analysis test using ANOVA was carried 
out for the M74, 37435. It was found out that the only 1000 
variables are the highest density ones. So it was decided that 
variables 1-1000 to be retained and variables 1001-37435 to 
be removed. This can be shown in Fig. 5. 

Second, a TF-IDF analysis was carried out. Based on the 
TF-IDF test shown in Fig. 6, only the variables with the 
highest TF-IDF values are retained. It was decided that the 
highest 200 TF-IDF frequencies to be retained. So far, the 
Matrix is composed of only 200 variables (M74, 200). 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 2, 2020 

103 | P a g e  
www.ijacsa.thesai.org 

TABLE. I. A LIST OF THE SELECTED NOVELS AND SHORT STORIES 

Code Title of the novel/short story Author 

M01 A Daughter of Isis  Nawal El-Saadawi 

M02 
A Portrait of the Artist as a Young 

Man  
James Joyce 

M03 A Shabby Genteel Story Thackeray 

M04 Adventures of Huckleberry Finn Mark Twain  

M05 Aisha  Ahdaf Soueif 

M06 Arabian Jazz  Diana Abu Jaber 

M07 Basil Wilkie Collins 

M08 Beloved Toni Morrison 

M09 Bird Summons  Leila Aboulela 

M10 Birds of Paradise  Diana Abu Jaber 

M11 Catherine Thackeray 

M12 Colored Lights  Leila Aboulela 

M13 Daisy Miller Henry James 

M14 David Copperfield Charles Dickens  

M15 Dubliners James Joyce 

M16 Elsewhere, Home  Leila Aboulela 

M17 Emma Jane Austen 

M18 Far From the Madding Crowd Thomas Hardy 

M19 God Help the Child Toni Morrison 

M20 Hard Times Charles Dickens 

M21 Home Toni Morrison 

M22 I Think of You Ahdaf Soueif 

M23 
In Love and Trouble: Stories of 

Black Women 
Alice Walker 

M24 In the Eye of the Sun Ahdaf Soueif 

M25 Jude the Obscure Thomas Hardy 

M26 Lady Chatterley's Lover D. H. Lawrence 

M27 Memoirs of a Woman Doctor Nawal El-Saadawi 

M28 Meridian  Alice Walker 

M29 Minaret Leila Aboulela 

M30 Mrs. Dalloway Virginia Woolf 

M31 My Name is Salma Fadia Faqir 

M32 Nisanit Fadia Faqir 

M33 Northern Abbey Jane Austen 

M34 Oliver Twist Charles Dickens 

M35 Origin Diana Abu Jaber 

M36 Orlando: A Biography Virginia Woolf 

M37 Paradise Toni Morrison 

M38 Persuasion Jane Austen 

M39 Pillars of Salt Fadia Faqir 

M40 Pride and Prejudice Jane Austen 

M41 Sandpiper Ahdaf Soueif 

M42 Sense and Sensibility Jane Austen 

M43 Song of Solomon Toni Morrison 

M44 Sons and Lovers D. H. Lawrence  

M45 Sula Toni Morrison 

M46 Tar Baby Toni Morrison 

M47 Tess of the D'Urberville Thomas Hardy 

M48 The Bluest Eye Toni Morrison 

M49 The Captain's Doll D. H. Lawrence  

M50 The Cask of Amortillado Edgar Allan Poe 

M51 
The Celebrated Jumping Frog of 

Calaveras County 
Mark Twain 

M52 The Color Purple  Alice Walker  

M53 The Fox D. H. Lawrence  

M54 The Glided Age Mark Twain 

M55 The Luck of Barry Lyndon Thackeray 

M56 The Map of Love Ahdaf Soueif 

M57 The Mayor of Casterbridge Thomas Hardy 

M58 The Moon Stone Wilkie Collins 

M59 The Portrait of a Lady Henry James 

M60 The Rainbow D. H. Lawrence  

M61 The Raven  Edgar Allan Poe 

M62 The Tell Tale Heart Edgar Allan Poe 

M63 The Translator Leila Aboulela 

M64 The Voyage Out Virginia Woolf 

M65 The Waves Virginia Woolf 

M66 The Woman in White Wilkie Collins 

M67 To the Lighthouse Virginia Woolf 

M68 Ulysses James Joyce  

M69 Under the Greenwood Tree Thomas Hardy 

M70 Vanity Fair Thackeray 

M71 Washington Square Henry James 

M72 Willow Trees Don't Weep Fadia Faqir 

M73 Women in Love D. H. Lawrence  

M74 Zeina Nawal El-Saadawi 

 
Fig. 5. Variance Analysis Test of the Matrix M74, 37435 using ANOVA. 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 2, 2020 

104 | P a g e  
www.ijacsa.thesai.org 

 
Fig. 6. TF-IDF Test of the Data Matrix M74, 1000. 

As a final step, PCA was carried out in order to extract 
only the most distinctive variables within the matrix M74, 200. 
Based on the PCA test shown in Fig. 7, only the first 50 
variables were retained. The matrix thus is reduced to only 50 
variables which are thought to be the most distinctive features 
within the corpus. 

 
Fig. 7. A PCA of the Data Matrix M74, 200. 

IV. ANALYSIS 

In order to test the effectiveness of the proposed model, 
cluster analysis is used. This is a technique whereby similar 
texts are grouped together. The assumption is that there is a 
strong association between members of the same group or 
cluster as sharing the same characteristics. The closer texts to 
each other, the more similar they are and vice versa. These 
should be texts that can be classified under a given genre 
and/or written by the same author. K-means clustering, one of 
the simplest and most popular cluster analysis methods, is 
used for the task [73-75]. In this process, every data point (the 
novels in our case) is assigned to the closest center or nearest 
mean based on their Euclidean distance. Then, new centers are 
calculated and the data points are updated. This process 
continues until there is no further iterations and changes 
within the clusters as seen in Fig. 8. 

Using K-means clustering, the texts or data points of the 
matrix M72, 50 were assigned to three groups as seen in Fig. 9. 
This is based on the number of centroids within the clustering 
structure. It should be noted, however, that the identification 
of the number of classes can be different from one researcher 
to another. 

In order to validate the results of the clustering 
performance, hierarchical cluster analysis is used. Hierarchical 

clustering is as simple as K-means clustering and it results in 
a clustering structure consisting of nested partitions. The 
results can be seen in Fig. 10. 

In testing the clustering performance based on our 
proposed model, results of the K-means clustering are 
compared to those of hierarchical cluster analysis. Results 
indicate that there is complete agreement between the 
members of each cluster/group in the two clustering 
structures. In the two clustering structures, there are three 
main distinct classes. These are shown as follows. 

Group 1 includes 20 texts. These are 20 novels and short 
stories. The most distinctive lexical features of this group are 
words like Islam, veil, marriage, obedience, exile, young, 
woman, and virginity. Texts included in this cluster are Ahdaf 
Soueif‘s Aisha, I Think of You, In the Eye of the Sun, 
Sandpiper, and The Map of Love; Diana Abu Jaber‘s Arabian 
Jazz, Birds of Paradise, and Origin; Fadia Faqir‘s My Name is 
Salma, Nisanit, Pillars of Salt, and Willow Trees Don't Weep; 
Leila Aboulela‘s Bird Summons, Colored Lights, Elsewhere, 
Home, Minartet, and The Translator; and Nawal EL-
Saadawi‘s A Daughter of Isis, Memoirs of a Woman Doctor, 
and Zeina. These texts can be suggested to be belonging to a 
class of literature known as Anglophone Arabic literature [76-
78]. 

 
Fig. 8. K-Means Clustering. 

 
Fig. 9. K-Means Clustering of the Data Matrix M 74, 50. 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 2, 2020 

105 | P a g e  
www.ijacsa.thesai.org 

 
Fig. 10. A Cluster Analysis of the Data Matrix M 74, 50. 

Group 2 is the biggest one as it includes 49 novels and 
short stories. These include Charles Dickens‘ Bleak House, 
David Copperfield, Great Expectations, Hard Times, and 
Oliver Twist; Thomas Hardy‘s Jude the Obscure, Far From the 
Madding Crowd, Tess of the D'Urbervilles, The Mayor of 
Casterbridge, and Under the Greenwood Tree; Henry James‘ 
Washington Square, D. H. Lawrence‘s Sons and Lovers, and 
Virginia Woolf‘s Mrs. Dalloway and The Wave. It can be seen 
that these texts share some features such as the portrayal of the 
world as we know it and the discussion of realistic problems. 
This cluster includes the novels that can be described as 
realistic novels. 

Within Cluster 2, however, we can identify 4 sub-clusters 
or subclasses. The first subclass includes the texts written by 
Charles Dickens, Thomas Hardy, William Thackeray, and 
Wilkie Collins. These are described as social realistic novels 
[79, 80]. The second subclass includes the texts written by 
American Victorian writers Henry James, Mark Twain, and 
Edgar Allan Poe. Poe‘s texts are, however, distant from those 

of James and Twain as Poe is adopting a different style, the 
Gothic tradition, in addressing some realistic problems. The 
third subclass includes the novels and short stories that best 
described as modernist novels. These are the books written by 
James Joyce, D. H. Lawrence, and Virginia Woolf. These 
represent the modernist novels. The fourth subclass includes 
11 novels. These are Toni Morrison‘s novels  Beloved, God 
Help the Child, Home, Paradise, Song of Solomon, Sula, Tar 
Baby, and The Bluest Eye; and Alice Walker‘s In Love and 
Trouble: Stories of Black Women, Meridian and The Color 
Purple. These texts are similar to other members of the same 
group (Cluster 2) in the sense that they all address realistic 
problems. However, they form a distinct class by themselves 
as focusing more on the problems of the Black communities. 

Group 3 includes only 5 novels. These are Emma, 
Northanger Abbey, Persuasion, Pride and Prejudice, and Sense 
and Sensibility. These are all written by Jane Austen and 
belong to the same literary tradition of what is referred to as 
the Romanticism [81-83]. It is also clear that the four texts 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 2, 2020 

106 | P a g e  
www.ijacsa.thesai.org 

Emma, Persuasion, Pride and Prejudice, and Sense and 
Sensibility are very close to each other forming a subclass 
while Northanger Abbey represents a separate subclass. This 
hints that the first four texts are thematically similar to each 
other while Northanger Abbey has a different theme. 

It is obvious that the intra-cluster similarity is high. That 
is, members of each group are similar to each other as the data 
inside each cluster is similar to one another. It is also clear that 
each cluster holds information that isn‘t similar to the other 
clusters. It can be claimed then that the clustering performance 
based on our proposed model generated a distinct structure 
even though different interpretations can be suggested. 

V. CONCLUSION 

This study addressed the problem of feature selection in 
the text clustering applications of literary texts. It proposed an 
integrated model for extracting the most distinctive features 
within datasets. The proposed model combines together three 
different term weighting methods: variance, TF-IDF, and 
PCA. In order to test the proposed model, a corpus of 74 
novels and short stories was designed. Using VSC methods, 
the selected texts were classified into three distinct classes. It 
can be concluded that the proposed model is successful in 
extracting the most distinctive features within datasets. The 
findings of this study support the claim that traditional or 
conventional term weighting methods based solely on 
frequency methods are not sufficient or effective in extracting 
the most distinctive features within datasets. The proposed 
model is suggested to be usefully used in CATA applications 
for its high accuracy in grouping similar texts together. 

ACKNOWLEDGMENT 

I take this opportunity to thank Prince Sattam Bin 
Abdulaziz University in Saudi Arabia alongside its Scientific 
Deanship, for all technical support it has unstintingly provided 
towards the fulfillment of the current research project. 

REFERENCES 

[1] R. Popping, Computer-assisted Text Analysis, London: SAGE, 2000. 

[2] G. Wiedemann, Text Mining for Qualitative Data Analysis in the Social 
Sciences: A Study on Democratic Discourse in Germany. Springer 

Fachmedien Wiesbaden, 2016. 

[3] D. N. Bengston and U. S. F. S. N. C. R. Station, Applications of 
Computer-aided Text Analysis in Natural Resources. U.S. Department 
of Agriculture, Forest Service, North Central Research Station, 2000. 

[4] B. D. Hirsch, Digital Humanities Pedagogy: Practices, Principles and 
Politics. Open Book Publishers, 2012. 

[5] B. Yu, "An Evaluation of Text Classification Methods for Literary 
Study," Lit Linguist Computing, vol. 23, no. 3, pp. 327-343, September 

1, 2008 2008. 

[6] C. Crompton, R. J. Lane, and R. Siemens, Doing Digital Humanities: 
Practice, Training, Research. Taylor & Francis, 2016. 

[7] B. Yu and J. Unsworth, "Toward Discovering Potential Data Mining 
Applications in Literary Criticism," presented at the Digital Humanities, 
5-9 July 2006, Paris-Sorbo, 2006. 

[8] J. Unsworth, "Scholarly Primitives: What Methods do Humanities 
Researchers Have in Common, and How Might Our Tools Reflect This? 
," Symposium on Humanities Computing: Formal Methods, 

Experimental Practice, King‘s College, London, 13 May 2000 2000. 

[9] S. Argamon and M. Olsen, "Toward Meaningful Computing," 
Communications of ACM, vol. 49, no. 4, pp. 33-35, 2006. 

[10] G. Tambouratzis and M. Vassiliou, "Employing Thematic Variables for 
Enhancing Classification Accuracy Within Author Discrimination 
Experiments," Lit Linguist Computing, vol. 22, no. 2, pp. 207-224, June 

1, 2007 2007. 

[11] C. Labbe and D. Labbe, "A Tool for Literary Studies: Intertextual 
Distance and Tree Classification," Lit Linguist Computing, vol. 21, no. 

3, pp. 311-326, September 1, 2006 2006. 

[12] J. Nakamura and J. Sinclair, "The World of Woman in the Bank of 
English: Internal Criteria for the Classification of Corpora," Lit Linguist 

Computing, vol. 10, no. 2, pp. 99-110, January 1, 1995 1995. 

[13] S. Ramsay, "In Praise of Pattern," TEXT Technology: the Journal of 
Computer Text Processing, vol. 14, no. 2, pp. 177-190, 2005. 

[14] T. Horton, C. Taylor, B. Yu, and X. Xiang, "‗Quite Right, Dear and 
Interesting‘: Seeking the Sentimental in Nineteenth Century American 
Fiction," presented at the Digital Humanities, Paris-Sorbonne, France, 5-

9 July 2006, 2006. 

[15] G. Rockwell, "What is Text Analysis, Really?," Lit Linguist Computing, 
vol. 18, no. 2, pp. 209-219, June 1, 2003 2003. 

[16] P. Boot, "Decoding Emblem Semantics," Lit Linguist Computing, vol. 
21, no. suppl_1, pp. 15-27, January 1, 2006 2006. 

[17] T. Rommel, "Literary Studies," in ACompanion to Digital Humanities, 
S. Schreibman, R. Siemens, and J. Unsworth, Eds. Oxford: Blackwell, 
2004, pp. 88-97. 

[18] T. N. Corns, "Computers in the Humanities: Methods and Applications 
in the Study of English Literature," Lit Linguist Computing, vol. 2, no. 
2, pp. 127-130, January 1, 1987 1987. 

[19] S. Ramsay, "Special Section: Reconceiving Text Analysis: Toward an 
Algorithmic Criticism," Lit Linguist Computing, vol. 18, no. 2, pp. 167-
174, June 1, 2003 2003. 

[20] T. W. Machan, "Late Middle English Texts and the Higher and Lower 
Criticisms," in Medieval Literature: Texts and Interpretation. Medieval 
and Renaissance Texts and Studies, T. W. Machan, Ed. New York: 

Binghamton, 1991, pp. 3–16. 

[21] R. Siemens, "A New Computer-assisted Literary Criticism?," Computers 
and the Humanities, vol. 36, no. 3, pp. 259-267, 2002. 

[22] R. Cohen, The Future of literary theory. New York: Routledge, 1989, 
pp. xx, 445 p. 

[23] S. M. Hockey, Electronic Texts in the Humanities: Principles and 
Practice. Oxford: Oxford University Press, 2000, pp. xii, 216 p. 

[24] M. Terras, J. Nyhan, and E. Vanhoutte, Defining Digital Humanities: A 
Reader. Taylor & Francis, 2016. 

[25] T. H. Howard-Hill, Literary Concordances: A Complete Handbook for 
the Preparation of Manual and Computer Concordances. Elsevier 
Science, 2014. 

[26] M. L. Jockers, Macroanalysis: Digital Methods and Literary History. 
University of Illinois Press, 2013. 

[27] N. Dershowitz and E. Nissan, Language, Culture, Computation: 
Computing - Theory and Technology: Essays Dedicated to Yaacov 
Choueka on the Occasion of His 75 Birthday (no. pt. 1). Springer Berlin 

Heidelberg, 2014. 

[28] C. Plaisant, J. Rose, and B. Yu, "Exploring Erotics in Emily Dickinson‘s 
Correspondence with Text Mining and Visual Interfaces," presented at 

the Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital 
Libraries (JCDL ‘06), Chapel Hill, North Carolina, 11 -15 June 2006, 

2006. 

[29] R. Horton, M. Olsen, G. Roe, and R. Voyer, "Mining Eighteenth 
Century Ontologies: Machine Learning and Knowledge Classification in 

the Encyclope´die," presented at the Digital Humanities,. Urbana-
Champaign, Illinois, 2-8 June 2007, 2007. 

[30] Z. Xiao and A. McEnery, "Two Approaches to Genre Analysis: Three 
Genres in Modern American English," Journal of English Linguistics, 
vol. 33, no. 1, pp. 62-82, March 1, 2005 2005. 

[31] M. Koppel, S. Argamon, and A. R. Shimoni, "Automatically 
Categorizing Written Texts by Author Gender," Lit Linguist Computing, 
vol. 17, no. 4, pp. 401-412, November 1, 2002 2002. 

[32] M. Wolters and M. Kirsten, "Exploring the Use of Linguistic Features in 
Domain and Genre Classification," presented at the Proceedings of the 


(IJACSA) International Journal of Advanced Computer Science and Applications, 
Vol. 11, No. 2, 2020 

107 | P a g e  
www.ijacsa.thesai.org 

ninth conference on European chapter of the Association for 

Computational Linguistics, Bergen, Norway, 1999. 

[33] D. I. Holmes, "The Evolution of Stylometry in Humanities Scholarship," 
Lit Linguist Computing, vol. 13, no. 3, pp. 111-117, September 1, 1998 

1998. 

[34] M. L. Jockers. (2009, 16 March 2010). Machine-Classifying Novels and 
Plays by Genre. Available: https://www.stanford.edu/~mjockers/cgi-

bin/drupal/node/27. 

[35] B. Kessler, G. Numberg, and H. Schtze, "Automatic Detection of Text 
Genre," presented at the Proceedings of the eighth conference on 

European chapter of the Association for Computational Linguistics, 
Madrid, Spain, 1997. 

[36] S. Ramsay, "Algorithmic Criticism," in A companion to digital literary 
studies, vol. A companion to digital literary studies, R. G. Siemens and 
S. Schreibman, Eds. no. Blackwell companions to literature and culture) 

Malden, MA: Blackwell Publishers, 2007, pp. xx, 620 p. 

[37] J. F. Burrows, "Modal Verbs and Moral Principles: An Aspect of Jane 
Austen's Style," Lit Linguist Computing, vol. 1, no. 1, pp. 9-23, January 
1, 1986 1986. 

[38] J. F. Burrows, "Word Patterns and Story Shapes: The Statistical 
Analysis of Narrative Style," Literary and Linguistic Computing, vol. 2, 
pp. 60-71, 1987. 

[39] J. F. Burrows, Computation into criticism : a study of Jane Austen's 
novels and an experiment in method. Oxford: Clarendon, 1987, pp. 
xii,255p. 

[40] J. F. Burrows, "‗An ocean where each kind. . .‘: Statistical analysis and 
some major determinants of literary style," Computers and the  
Humanities, vol. 23 (4), no. 4, pp. 309-321, 1989. 

[41] R. A. J. Matthews and T. V. N. Merriam, "Neural Computation in 
Stylometry I: An Application to the Works of Shakespeare and 
Fletcher," Lit Linguist Computing, vol. 8, no. 4, pp. 203-209, January 1, 

1993 1993. 

[42] M. Q. Patton, Qualitative Research & Evaluation Methods, 3rd ed ed. 
London: Sage, 2002, pp. xxiv, 598, [65]. 

[43] R. S. Forsyth and D. I. Holmes, "Feature-finding for test classification," 
Lit Linguist Computing, vol. 11 (4), no. 4, pp. 163-174, December 1, 

1996 1996. 

[44] D. I. Holmes, "Authorship Attribution," Computers and the Humanities, 
vol. 28, pp. 87-106, 1994. 

[45] D. I. Holmes and R. S. Forsyth, "The Federalist Revisited: New 
Directions in Authorship Attribution," Lit Linguist Computing, vol. 10, 
no. 2, pp. 111-127, January 1, 1995 1995. 

[46] M. W. A. Smith, "Shakespeare, Stylometry and "Sir Thomas More"," 
Studies in Philology, vol. 89, no. 4, pp. 434-444, 1992. 

[47] M. W. A. Smith, "An investigation of Morton's method to distinguish 
Elizabethan playwrights," Comput. Hum., vol. 19, no. 1, pp. 3-21, 1985. 

[48] C. Delcourt, "About the statistical analysis of co-occurrence," 
Computers and the Humanities, vol. 26, no. 1, pp. 21-29, 1992. 

[49] T. Sing, S. Siraj, R. Raguraman, P. Marimuthu, and K. Nithiyananthan, 
"Cosine similarity cluster analysis model based effective power systems 
fault identification," International Journal of Advanced and Applied 

Sciences, vol. 4, no. 1, pp. 123-130, 2017. 

[50] M. W. Berry, Survey of Text Mining: Clustering, Classification, and 
Retrieval. Springer New York, 2013. 

[51] I. Zelinka, P. Vasant, V. H. Duy, and T. T. Dao, Innovative Computing, 
Optimization and Its Applications: Modelling and Simulations. Springer 
International Publishing, 2017. 

[52] S. Sirmakessis, Text Mining and its Applications: Results of the NEMIS  
Launch Conference. Springer Berlin Heidelberg, 2012. 

[53] T. Jo, Text Mining: Concepts, Implementation, and Big Data Challenge. 
Springer International Publishing, 2018. 

[54] C. C. Aggarwal and C. X. Zhai, Mining Text Data. Springer New York, 
2012. 

[55] K. L. Du and M. N. S. Swamy, Neural Networks and Statistical 
Learning. Springer London, 2019. 

[56] C. C. Aggarwal and C. K. Reddy, Data Clustering: Algorithms and 
Applications. CRC Press, 2018. 

[57] R. Nisbet, G. Miner, and K. Yale, Handbook of Statistical Analysis and 
Data Mining Applications. Elsevier Science, 2017. 

[58] S. M. Weiss, N. Indurkhya, T. Zhang, and F. Damerau, Text Mining: 
Predictive Methods for Analyzing Unstructured Information. Springer 

New York, 2010. 

[59] S. M. Weiss, N. Indurkhya, and T. Zhang, Fundamentals of Predictive 
Text Mining. Springer London, 2015. 

[60] C. Bouveyron, G. Celeux, T. B. Murphy, and A. E. Raftery, Model-
Based Clustering and Classification for Data Science: With Applications 
in R. Cambridge University Press, 2019. 

[61] S. Robertson, "Understanding inverse document frequency: on 
theoretical arguments for IDF," Journal of Documentation, vol. 60, no. 
5, pp. 503-520, 2004. 

[62] D. H. Kraft, E. Colvin, and G. Marchionini, Fuzzy Information 
Retrieval. Morgan & Claypool Publishers, 2017. 

[63] B. Mitra and N. Craswell, An Introduction to Neural Information 
Retrieval. Now Publishers, 2018. 

[64] M. Gopal, Applied Machine Learning. McGraw-Hill Education, 2019. 

[65] K. Spärck Jones, "A statistical interpretation of term specificity and its 
application in retrieval " Journal of Documentation, vol. 28, pp. 11-21, 

1972. 

[66] G. Salton and C. Buckley, "Term-weighing approaches in automatic text 
retrieval," Information Processing & Management, vol. 24, no. 5, pp. 

513-523, 1988. 

[67] G. Salton and C. Buckley, "Term Weighting Approaches in Automatic 
Text Retrieval," Cornell University1987. 

[68] W. Härdle and L. Simar, Applied multivariate statistical analysis. Berlin 
; New York: Springer, 2003, p. 486 p. 

[69] J. E. Jackson, A user's guide to principal components (Wiley series in 
probability and mathematical statistics. Applied probability and 
statistics). New York: Wiley, 1991, pp. xvii, 569. 

[70] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to 
Statistical Learning: with Applications in R. Springer New York, 2013. 

[71] I. T. Jolliffe, Principal component analysis, 2nd ed. ed. (Springer series 
in statistics). Berlin ; London: Springer, 2002, p. 500 p. 

[72] A. Fielding, Cluster and Classification Techniques for the Biosciences. 
Cambridge, UK ; New York: Cambridge University Press, 2007, pp. xii, 
246 p. 

[73] A. Khan, S. Baseer, and S. Javed, "Perception of students on usage of 
mobile data by K-mean clustering algorithm," International Journal of 
Advanced and Applied Sciences, vol. 4, no. 2, pp. 17-21, 2017. 

[74] P. Kaur, S. Singla, and S. Singh, "Detection and classification of leaf 
diseases using integrated approach of support vector machine and 

particle swarm optimization," International Journal of Advanced and 
Applied Sciences, vol. 4, no. 8, pp. 79-83, 2017. 

[75] Z. Ullah, S. Lee, and M. Fayaz, "Enhanced feature extraction technique 
for brain MRI classification based on Haar wavelet and statistical 
moments," International Journal of Advanced and Applied Sciences, vol. 

6, no. 7, pp. 89-98, 2019. 

[76] L. Maleh and L. A. Maleh, Arab Voices in Diaspora: Critical 
Perspectives on Anglophone Arab Literature. Rodopi, 2009. 

[77] G. Nash, The Anglo-Arab Encounter: Fiction and Autobiography by 
Arab Writers in English. Peter Lang, 2007. 

[78] Z. Halabi, Unmaking of the Arab Intellectual: Prophecy, Exile and the 
Nation. Edinburgh University Press, 2017. 

[79] E. Freedgood, Worlds Enough: The Invention of Realism in the 
Victorian Novel. Princeton University Press, 2019. 

[80] D. David, D. Deirdre, P. E. E. D. David, and C. U. Press, The 
Cambridge Companion to the Victorian Novel. Cambridge University 
Press, 2001. 

[81] C. Lamont and M. Rossington, Romanticism's Debatable Lands. 
Palgrave Macmillan UK, 2007. 

[82] S. Ailwood, Jane Austen's Men: Rewriting Masculinity in the Romantic 
Era. Taylor & Francis, 2019. 

[83] M. Ferber, Romanticism: A Very Short Introduction. OUP Oxford, 
2010.