Microsoft Word - 17 Extraction of Semantic Domains through Corpus Tools - Azka.docx DOI: 10.31703/glr.2020(V-I).17 URL: http://dx.doi.org/10.31703/glr.2020(V-I).17 Citation: Khan, A., & Rasul, S. (2020). Extraction of Semantic Domains Through Corpus Tools. Global Language Review, V(I), 153- 168. doi:10.31703/glr.2020(V-I).17 Extraction of Semantic Domains through Corpus Tools Azka Khan* Sarwet Rasul† p-ISSN: 2663-3299 e-ISSN: 2663-3841 L-ISSN: 2663-3299 Vol. V, No. I (Winter 2020) Pages: 153 – 168 Introduction The widespread use of computer technology in the last decade of the previous century drastically increased the number and scope of computer-aided researches made in the fields of corpus linguistics. Surprisingly, even after the availability of huge amounts of computer readable textual data and numerous computer-assisted automatic text analysers, computer-aided text analysis is still not a common approach in the sub fields of social sciences and humanities. This article endeavours to show the benefits and hurdles of using (semi-) automatic text analysis technologies for making qualitative studies in the field of digital humanities. This article does not suggest that the hindrances or limitations have been completely removed though; it proposes that there is a dire need to unlock the potential opportunities by encouraging the innovative researchers of digital humanities to explore, adapt and modify the newly developed approaches to the tons of digital texts available these days. This article also voices the concerns in extracting the dominant semantic domain from a fictional discourse with the help of corpus tools. This study also presents a systemized form of the selected features of three computer software for making qualitative researches easier. For many years now, computer-aided text analysis is not limited to just counting words. Many new corpus software help the researchers explore the qualitative aspects of the data too. Abstract: The increased interest in the techniques of corpus linguistics in the first decade of 21st century was based on the most important premises, which are valid even today – investigation of larger datasets in less time. This article compares the results of different corpus techniques employed for exploring the dominant semantic domains in a corpus. These corpus techniques include use of word clouds, frequency lists and KWIC of a text. This study uses fictional discourse by Kamila Shamsie – namely Broken Verses (2005) – to illustrate the corpus methodology. In addition to different corpus techniques, this study also compares the usability of different corpus software for this purpose such as, Antconc (3.2.4), Nvivo 11, and Sketch Engine. This article will prove to be a good beginning point for the researchers exploring a text in any field of corpus linguistics and digital humanities. Key Words: CADS, Digital Humanities, E-Humanities, KWIC, Lemma, Semantic Fields, Stemmed. Having a clear corpus methodology for extraction of semantic domains is important in a two-fold manner: for the language researchers it helps to understand the meaning of the text in less time; for the computational linguists it provides help to go beyond the simple counting of the most frequent words towards more complex understanding of human language by computer systems. Thus, a conceptual understanding of the context, bridges the gap between quantitative and qualitative research designs which can eventually lead to more sophisticated automatic extraction of "meaning" from of a discourse. * PhD Scholar, Department of English, Fatima Jinnah Women University, Rawalpindi, Punjab, Pakistan. Email: azkakhan80s@gmail.com † Associate Professor, Department of English, Fatima Jinnah Women University, Rawalpindi, Punjab, Pakistan. Azka Khan and Sarwet Rasul Page | 154 Global Language Review (GLR) Development of computer automated systems has helped to overcome many challenges faced by the researchers of digital humanities but working in natural languages is still not free from ambiguity and complexity and extraction of semantic domains remains a challenge for the social scientists even now. Aim and objectives The current research has two main aims. First, it discusses the application and comparison of different corpus techniques to establish the dominant semantic domains in any discourse. A novel by Kamila Shamsie titled Broken Verses (2005) is used as an example to illustrate the findings but the methodology is applicable to any corpus in the field of digital humanities and social sciences. The corpus techniques used in this research include word clouds, frequency lists of both stemmed and synonymous words and KWIC. Secondly, the potential benefits of using different corpus software for extracting dominant semantic domains in a discourse are also pointed out, mainly by discussing three computer software, Antconc (3.2.4), Nvivo 11, and Sketch Engine. This research is guided by the following research questions. 1. How can we extract dominant semantic domains from a literary text by using corpus techniques? 2. Which features of the selected computer software help in this context? Structure of the Current Research This article is structured in three distinct parts. The first part reviews the related researches in digital humanities especially focusing on corpus assisted discourse studies (henceforth CADS) as an example. This section also explains the need for a replicable corpus methodology in extracting semantic domains from the selected text. In the second part the three methods for extracting the dominant semantic domains have been discussed. These three methods include usability of corpus techniques, namely, frequency lists, word clouds and KWIC for extraction of semantic domains. This part also discusses the limitations and reliability of these corpus software. The last part of this article consists of concluding remarks about the three methods employed for the extraction of semantic domain. E-Humanities/Digital Humanities and Computer-Aided Researches in Social Sciences Digital humanities (generally represented as DH) is an emerging field of study at the intersecting boundaries of digital technologies, mainly computers, and different sub-disciplines of humanities. In DH the development of scholarship involves collaboration in transdisciplinary researches and demands teaching and publication of computationally engaged researches (Terras, 2011). Production and employment of new computer applications and techniques, allows the DH researchers to experiment with new teaching techniques and adapted research approaches (Burdick, et. al. 2012). Thus, cultivation of a two-way collaborative relationship between the humanities and the digital, results in the development of a new scholarship. Corpus linguistics is one such sub-discipline of DH rapidly flourishing by the use of innovative research methodologies. On one side it involves participation of computational linguists for development of computer software and on the other it relies on the verification and validation of these software by corpus linguists. Historically, digital humanities have been associated with fields other than linguistics, such as humanistic computing, media studies, social computing but since the turn of the century, corpus linguistics has gained a prestigious position owing to the innovative researches made in it. Methodological Scepticism and Semantic Ambiguity in Computer-aided Analysis Using innovative modes give the researchers new insights but poses methodological problems too. Distribution of immense amount of informative data distribution on the World Wide Web, emails, blogs, memos, articles etc. demands extraction of useful information quickly and at a low cost. Text mining, topic modelling, computational content analysis (CCA) and Computer Assisted Qualitative Data Analysis Extraction of Semantic Domains Through Corpus Tools Vol. V, Issue I (Winter 2020) Page | 155 (CAQDAS) are some of the areas which focus on refinement of automated computational methods for dealing with enormous amount of knowledge in DH (Pollak, etal. 2011). The biggest hindrance in dealing with natural language texts is the problem of ambiguity of meaning and semantic uncertainty. Very few automated text analysis software can claim to extract semantically correct information from linguistic texts. Extracting linguistic information requires knowledge of lexemes and lemmas, a sound grip on specific syntax of the texts and understanding of the contextual context (Pollak, etal. 2011). Although syntactic parsing is used to solve the problem of lexical ambiguity, the problem is still not solved completely. The point is illustrated by discussing two examples given by (Wiedemann, 2013). Consider the following two sentences in this context. 1. I have put the baby in the pen. 2. He runs the company. The syntactic processing (POS tagging) will help the computer system determine that the word pen belongs to the noun category of lexemes. Similarly the word runs is categorized as a verb. However, when the software tries to extract the semantic information of these two words, semantic ambiguity and uncertainty cause a problem. There can be at least three possible meanings of the word pen: a writing tool, a female swan, or an enclosure where babies can be lift. Similarly the word run has two meanings: an activity of controlling or a physical action. A reliable automated text analyser should be able to correctly interpret such problems of semantic ambiguity. So far the automated text analysers available are not reliable for such semantic ambiguities of natural languages. Thus using computational techniques for extraction of semantic domains in DH is not without problems and demands human intervention to avoid misleading results. Therefore, the studies made in this field are relatively small scaled. Secondly, the conclusions of such studies cannot be generalised to a broader scale. Thirdly, the experts of natural languages need more explanations of the step wise statistical methods adopted in the computer based studies, even more so if they want to replicate the methodological framework. I have mainly drawn examples from the field of corpus linguistics and discourse studies in the next section to discuss some of the researches made in the interdisciplinary field of CADS (Corpus Assisted Discourse Studies) by employing computer software to review the status of researches available. Current Trends in Discourse and Corpus Linguistics Corpus linguistic techniques help to reveal and analyse the recurrent linguistic patterns in any discourse in a way that is not possible intuitively. In the last decade of 20 century, corpus stylistics established itself as a new field of interest (Sinclair 1991, Stubbs 1996).One early influence on the corpus stylistic analyses is Halliday (1971) who suggested that analysing the use of transitive and intransitive verbs in The Inheritors by Golding can lead to induce literary meanings from the text. Halliday demonstrated that the unique usage of a grammatical feature influences the meaning and message of the literary text. Tracing this link between the grammatical feature and the hidden message or, in other words, the link between form and content is almost imperceptible intuitively. Corpus techniques can help the researchers to analyse large sample of writing by a single author in a little time and therefore, provide empirical proofs for the analysis of form/structure which eventually helps in understanding the content/theme. Halliday (1971) concluded his research by suggesting that excessive use of intransitive verbs for describing a Neanderthal tribe helped the writer to highlight the passivity and lack of innovativeness. These traits made the survival of the tribe impossible in the course of evolution. However, Halliday’s analysis has received strong censure by Hoover (1999) for problems of replicating the research methodology by future researchers. Hoover (1999) considers Halliday’s methodology lacking explicit documentation as well as transparency of analysis to other analysts for their own research work. Burrows (1987) extracts literary meanings of discourse from linguistic data by discussing the relationship between idiolects used by the protagonist and their personality traits. Examples of corpus stylistic analyses include Burgess (1999), Hardy & Durian (2000) and Tribble (2000). All of them adopted Burrows’ (1987) methodology to understand the relationship between the usage of lexical and grammatical Azka Khan and Sarwet Rasul Page | 156 Global Language Review (GLR) Critical Discourse Analysis (CDA) (as theoretical framework) Extraction of Semantic Domains (Using LD as Sample) Corpus Linguistics (CL)(as methodology) words in literary discourse and the meaning of the data. While analysing a discourse, the linguistic sample under investigation needs to be understood in relation to the accompanying context. This is one main reason that so far, discourse analysis has not been defined as a universal set of procedures which could be formalised into a computer package (Antaki et al., 2003) and poses new problems. Nevertheless, the use of corpus techniques for analysing discourse is termed a methodological synergy by Baker (2006). This methodological shift allowed the corpus linguists and discourse analysts to access a large scale data for generating more quantitative evidence than the small-scale data used previously. Corpus techniques allow not only for exploring the traditional texts like newspaper articles/editorial and speeches but also newer mediated texts for example face book comments and tweets. So far, fictional texts have not been explored much by them. The main reason of this neglect seems certain methodological problems faced by the language researchers. Firstly, a corpus tool cannot differentiate between the reported and reporting speech. Recently a software called CLiC has been introduced to analyse the local textual functions in fiction but its use is limited to searching only Dickens corpus and a few other 19 century reference corpora (Mahlberg etal. 2016). Nevertheless, the interface does not allow uploading a new text. Secondly, it cannot identify which pronoun is used for which fictional character. Thirdly, the figures of speech like metaphorical meaning, irony and pun on words, which are of great importance for meaning making in fictional discourse, cannot be identified by the corpus tool. The gap is still there and literary texts are used as a sample mostly in the field of corpus stylistics. The next section discusses the researches already available in the fields of CADS. Need for a Systematic and Replicable Linguistic Analytical Framework for Extraction of Semantic Domains Owing to the few researches made by using corpus techniques, there is an increased need to fill the gap by proposing the replicable and systematic methodologies, especially to resolve the issue of semantic ambiguity. Need for new Methodologies Fig 1: Corpus Methodologies for Extraction of Semantic Domains Sally Hunt (2015) is one of those few researchers who analysed the process of representation of gender and agency in Harry Potter series by using corpus techniques. Hunt (2015) has focused on the words used for body parts of the social actors in this series. Since the field of CADS is in its incipient years, the choice of literary text selected for such a research is very important. Fischer-Starcke’s work on Pride and Prejudice (2009) and Stubbs work on Heart of Darkness (2004 & 2005) are discussed as examples who give very important rationale for selecting these texts. Fischer-Starcke (2009) states that he has deliberately chosen a Extraction of Semantic Domains Through Corpus Tools Vol. V, Issue I (Winter 2020) Page | 157 novel which has been widely discussed and analysed for nearly last two hundred years by numerous critics. This makes the novel an especially attractive text for developing and verifying new corpus methodologies since it enables a comparison of findings by traditional methods of text analysis and findings by corpus based analysis. This helps the researcher to evaluate the effectiveness of the corpus techniques employed on the novel. The analyst can also focus on the linguistic/discursive processes used by the writer to construct meaning. Following the same rule Stubbs (2004 & 2005) used a century old novel Heart of Darkness for corpus stylistic analysis in which he tried to illustrate that the cultural and literary aspects of the novel can be shown with the help of frequency lists and distribution of words and recurrent phrases. This analysis also helped to identify important linguistic features which are usually missed by literary critics. Extraction of Semantic Domains Semantic domains as defined by Brinton (2001) are the groups of lexemes that share a common semantic property. Mostly these fields are defined by commonality of subject matter, such as landforms, colours, names of food items, or kinship relations. Computer-aided extraction of semantic domains from large amounts of texts can be useful in all the fields of Digital Humanities. Establishing credibility or high- precision in terms of methodology demands checking credibility of the tools and software available for corpus analysis. Extraction of semantic domains requires a three steps method: (i) Syntactically categorizing the lexemes called POS tagging (ii) Recognition of the lexemes from the same semantic fields (iii) Clarifying semantic ambiguities (if any) to understand the relation between the selected lexemes and categorizing them semantically (Sematic tagging). The reliability of some of the corpus techniques for extracting semantic domains available to the researchers of DH are discussed in the next section. Employing Frequency Lists of Stemmed Words and Synonyms for Extracting Semantic Domains An important principle, on which the foundation of corpus studies is laid, is the assumption that the most frequent lexical items are the most significant ones for establishing the dominant semantic fields and understanding the discourse structures (Sinclair 1991). Therefore, the frequency of lexical items is directly related to the structure and the content of the discourse. On the basis of this assumption the first corpus linguistic tool used in this research is to establish the dominant semantic fields are frequency lists. A novel by Shamsie titled Broken Verses is used as a sample in this research. The study corpus is abbreviated as study corpus broken verse (SCBV). While generating the frequency lists the functional words are not taken into the account believing that the main semantic load is carried by the content words. The software Nvivo 11 is used for generating frequency list because of its unique features discussed in the next section. Unique Features of NVIVO 11 The unique features of NIVIVO 11 include the ease in uploading the corpus files. Nvivo 11 (Edhlund & McDougall, 2019) is a powerful software for qualitative data analysis which can run pdf. txt. rtf. and other files containing visuals and graphics. Unlike Antconc it does not require the study corpus (SC) to be changed into TXT. format prior to uploading it to the software. Another important feature in NVIVO 11 is that for generating the frequency lists, it automatically deletes the function words from the SC (Table 01 and 02). This way the researcher can focus only on semantically loaded words which are content words. This software provides two types of settings for generating the frequency lists. 1. Frequency lists may be generated by considering all the stemmed words as one entry e.g., like, likes, liked, liking etc. For the purpose of ease, in this research this list is termed as Stemmed Freq. List (see Azka Khan and Sarwet Rasul Page | 158 Global Language Review (GLR) table 01). The good thing in this setting is that it gathers all the lemmas of a lexeme as a single category. Thus the stemmed freq. list can be helpful in identifying the most frequent lexeme in a corpus (See table 01). The most frequent lexeme in SCBV is mother. Among the top twenty entries this is the only word which tells us something about the thematic content of the novel. The plot line in SCBV revolves around the most important character in the novel named Samina Akram. Her daughter Aasmani is the narrator of the novel and she uses the word mother very frequently for Samina Akram. Other than this word all the other words do not give any clue to the researcher for further exploration. Table 1. Stemmed Freq. List of SCBV (top 20 entries) Word Length Count Weighted Percentage (%) Similar Words mothers 7 468 0.92 mother, mother’, mothers, mothers’ ones 4 430 0.85 one, ones just 4 383 0.76 Just looked 6 363 0.72 look, looked, looking, looks knowing 7 339 0.67 know, knowing, knowingly, knows hands 5 275 0.54 hand, handed, handful, handing, hands back 4 256 0.51 back, backed, backing, backs years 5 240 0.47 year, years even 4 237 0.47 even, evening, evenings time 4 235 0.46 time, timed, times, times’, timing poet 4 234 0.46 poet, poet’, poets lovely 6 234 0.46 love, loved, lovely, loves, loving, loving’ days 4 220 0.43 day, days think 5 210 0.41 think, think’, thinking, thinks way 3 202 0.40 way, ways away 4 201 0.40 Away want 4 194 0.38 want, wanted, wanting, wants knew 4 191 0.38 Knew never 5 185 0.37 Never now 3 185 0.37 Now 2. The second setting used for generating frequency lists through NVIVO 11 involves categorizing all the synonymous words present in the text as one entry e.g., the most common word in SCBV is look. The software NVIVO has the ability to categorise all its synonyms under one head. Some of the words included in entry 01 Table 02 carry a very different semantic shade. To illustrate this point some words from the beginning of the list of synonyms are compared to the end of the list of synonyms. Words such as appear, count, front, smell, sound, await have many different shades of meanings (Table 02). The original entry look may be used as a synonym for these words but they are very different in meaning from one another. For example the word appear has a completely different meaning from the word search and wait has a completely different meaning from the word smell. This holds true for all the ten entries listed in table 02. Therefore, relying solely on the synonym Freq. list does not help a lot in the extraction of semantic fields. For the sake of brevity, top ten entries have been added to table 02. The words which have a very different meaning in the list of synonyms in front of each entry are put in the bold font. Extraction of Semantic Domains Through Corpus Tools Vol. V, Issue I (Winter 2020) Page | 159 Table 2. Synonym Freq. List SCBV (top 10 entries) Word Length Count Weighted Percentage (%) Similar Words looked 6 1158 1.33 appear, appearance, appeared, appearing, appears, aspect, attend, await, awaiting, bet, count, counted, counting, depended, depending, depends, expect, expectant, expectation, expectations, expected, expecting, express, expressed, expresses, expressing, expression, expressions, face, faced, faces, facing, feel, feeling, feelings, feelings’, feels, front, fronts, look, looked, looking, looks, search, searched, searching, see, seeing, seem, seemed, seemingly, seems, sees, smell, smells, sound, sounded, sounding, sounds, spirited, tone, tones, wait, waited, waiting mother 6 701 1.15 engender, father, fathers, fuss, generate, generated, generation, generation’, generations, get, gets, getting, maternal, mother, mother’, mothers, mothers’ Know 4 914 1.12 acknowledge, acknowledged, acknowledgement, acknowledgements, bang, banged, banging, bed, experience, experiment, experimenting, humps, intent, intention, intentions, intently, intents, jazz, know, knowing, knowingly, knowledge, knows, learn, learned, learning, learns, letter, lettering, letters, live, live’, lived, lives, living, love, loved, lovely, loves, loving, loving’, recognize, recognized, screw, wit, witness, witnessed, witnesses’ Going 5 1416 1.04 adam, become, becomes, becoming, belong, belonged, belongs, break, breaking, breaks, choke, choked, crack, cracked, cracks, departed, departure, die, died, dies, dying, endure, enduring, exit, exited, exiting, extended, extending, fail, failed, failing, failings, fit, fitted, fitting, flings, get, gets, getting, going, last, lasted, lead, leading, leads, leave, leaves, leaving, live, live’, lived, lives, living, loss, move, moved, moves, moving, moving’, offer, offered, offering, offerings, offers, operate, operating, operators, pass, passed, passing, plumpness, proceeded, proceedings, release, released, run, running, sound, sounded, sounding, sounds, spell, start, started, starting, starts, survive, survived, surviving, tour, touring, travel, traveller, travellers, travels, turn, turned, turning, turns, whirling, work, worked, working, workings, works Just 4 703 1.03 bare, barely, exact, exacted, exacting, exactly, fair, fairly, good, goods, hard, hardly, just, justice, justify, mere, merely, precise, precisely, precision, right, righted, rightful, rightly, rights, scarcely, simply, upright Azka Khan and Sarwet Rasul Page | 160 Global Language Review (GLR) Think 5 889 0.99 believe, believe’, believed, believing, conceive, consider, considered, considering, guess, guess’, guessed, imagination, imaginations, imagine, imagined, imagining, intelligence, intelligent, intend, intended, mean, meaning, means, reason, reasonable, reasonably, reasons, recall, recalled, recalling, remember, remembered, remembering, remembers, retrieve, retrieved, suppose, supposed, supposing, think, think’, thinking, thinks, thought, thoughtful, thoughts One 3 467 0.88 one, ones, single, unity Make 4 1104 0.86 attained, brand, build, building, buildings, cause, caused, causing, clear, cleared, clearly, clears, constitute, constitution, constitutional, construct, constructed, construction, cook, cooked, cooking, create, created, creates, creating, devised, draw, drawing, draws, earned, fashion, fashioned, fashions, fix, fixed, fixedly, fixing, form, formed, forming, forms, gain, gained, gains, get, gets, getting, give, gives, giving, hit, hitting, hold, holding, holdings, holds, make, makes, making, name, named, names, naming, piss, preparation, prepare, prepared, preparing, pretend, pretended, pretending, produce, produced, producer, producers, produces, producing, puddle, reach, reached, reaching, ready, realization, realize, realized, realizes, score, scored, scores, seduce, seduced, seduces, shit, shit’, shuffled, stools, take, takes, taking, throw, throwing, throws, urine, work, worked, working, workings, works Hand 4 496 0.73 custody, deal, dealing, fist, fistful, fists, give, gives, giving, hand, handed, handful, handing, hands, handwriting, men, pass, passed, passing, paws, reach, reached, reaching, script, scripts Years 5 493 0.72 age, aged, ages, classes, day, days, year, years For the sake of brevity, the complete frequency lists are not added here. Nevertheless the top twenty entries in the stemmed freq.list (table 01) and top ten entries in the synonym freq.list (table 02) make this evident that we need to apply some other corpus technique for the extraction of semantic fields. For this purpose the reliability of word cloud is discussed in the next section. Employing word clouds as a beginning point to Extract Semantic Domains A word cloud is commonly defined as a visualization of most prominent and frequent content words in a corpus. Word clouds are generated through frequency lists. The functional words are not added to word clouds as they reveal little about the semantic content of the corpus. They provide a low-cost and faster alternative than coding. Word clouds are generated on the basis of frequency by breaking the whole text into component words. The font point assigned to the words is directly proportional to the frequency of the word in the corpus. Word clouds have some benefits as well as some inadequacies as a corpus technique for revealing the semantic content of the corpus. It reveals only the essential information and provides an overall sense of the text. They have a visual appeal and are more Extraction of Semantic Domains Through Corpus Tools Vol. V, Issue I (Winter 2020) Page | 161 engaging than data in the stemmed tabloid form. The visual representation of word clouds generates interest but stimulates more questions than it answers. It can be a good entry point in a discussion about the data. The cons of word clouds in extracting semantic fields is that they can be misleading in interpretations. At times the size of equally frequent words is affected by the number of alphabets in a word or the size/shape of the glyphs. Randomly assigned coloured word clouds can also be misleading as some colours stand out more than others. Decorative fonts may have visual appeal but they sacrifice communication. Word Cloud based on stemmed freq.list Word Cloud based on synonym freq.list Figure 2: Word clouds of SCBV based on stemmed freq.list and synonym freq.list Two word clouds are generated for the SCBV, one is based on the stemmed freq.list while the other is based on synonym freq.list. Just like frequency lists the word clouds reveal little about the dominant thematic content of SCBV. In the next section reliability and efficiency of key words in context (KWIC) for extraction of semantic fields is discussed. Employing KWIC (Key Words in Context) for Extracting Semantic Domains List of keywords in context (KWIC) is different from simple frequency lists. Phillips (1985) suggests that keywords function to indicate the ‘aboutness’ of the corpus. The keywords may not be the most frequent words of the study corpus, yet they are the most significant ones. Analysing the keyword list and categorizing the words according to their meaning reveal the dominant thematic content of the corpus. Scott in 2002 and more recently, Rayson (2008) and Culpeper (2009) have used this approach to reveal the meaning contained in various corpora. Creating a Reference Corpus Unlike frequency lists, word clouds, collocation lists and list of concordance lines generating KWIC requires a reference corpus (RC), in addition to the study corpus (SC). Keyness of any SC can be found out only by comparing it to another body of data. Some researchers (for example Sperberg-McQueen 1988) suggest that the keyword calculation of a sample text is somewhat effected by the RC chosen by the researcher. Others such as Baker (2006) and Stubbs (2005) suggest that by increasing the size of RC three times the size of SC, a keyword list free of any bias can be generated. There are two options available to all the researchers, either they can use the available large corpus as a reference corpus or they can build their own RC and feed it into software like Ant. Conc 3.5.8. Some software such as Sketch Engine and NIVIVO 11 have the in-built RC. In this research, English Web 2013 (enTenTen13) is available in the software Sketch Engine and is used to generate KWIC Identification of the frequently occurring content-bearing lexemes in KWIC helped me derive the gist or aboutness or the dominant thematic content of SCBV. The KWIC are indeed the tip of the iceberg of meaning Azka Khan and Sarwet Rasul Page | 162 Global Language Review (GLR) but still provide reliable indications and manageable data for the detailed analysis of the main themes in the corpus. Instead of simple frequency lists, only the KWIC are focused for extraction of semantic fields in this section. The reason is that the word frequency lists are usually very long (reaching up to 2,041 items in SCBV) and the manual extraction of semantically relevant terms requires a lot of time. In order to make the length of target lists manageable the cut-off point is set 100 words. Table 03 contains the first 100 keywords of SCBV. The words which scored the highest in the keyness are the proper nouns. This is understandable because in Broken Verses most of the characters have Pakistani names that do not appear very frequently in the RC, thus these words qualify for a high score of keyness. The proper names do not tell us much about the semantic content of the corpus. Therefore, the names of the characters have been manually deleted from the list and after removing the names of the characters, top 100 keywords have been categorised and colour coded in Table 03. Table 3. The Top 100 KWIC from SCBV Top 100 KWIC of SCBV 1 Single-word Score F Ref F 2 Karachi 176.72 60 35,063 3 Laila 146.26 24 5,295 4 STD 132.12 49 40,431 5 Ramzan 125.58 18 1,796 6 Mama 121.63 87 98,938 7 Urdu 120.52 30 19,735 8 Qais 101.35 14 956 9 grazia 97.77 13 86 10 minion 95.05 34 38,260 11 Iblis 91.73 13 1,587 12 Macbeth 83.71 22 22,179 13 Hilal 80.78 12 2,782 14 Eid 79.39 20 20,340 15 Aadam 74.75 10 296 16 Archivist* 68.10 16 17,510 17 Fugue 64.09 11 6,775 18 Frass 58.70 8 806 19 Inqalab* 53.28 7 13 20 Ghazal 48.47 7 2,266 21 Shawl 47.89 18 41,579 22 Beloved 47.81 12 20,372 23 Lathi 44.69 6 583 24 Hikmet 44.69 6 583 25 Kabab 44.57 6 643 26 Nimue 44.38 6 743 Extraction of Semantic Domains Through Corpus Tools Vol. V, Issue I (Winter 2020) Page | 163 27 Maulana 42.80 8 9,550 28 Zia 41.90 8 10,237 29 reshoot 41.86 6 2,156 30 schoolmaster 41.53 7 6,444 31 hoax 39.10 16 47,352 32 bougainvillea 38.92 6 4,038 33 Sadequain 38.28 5 48 34 Rafael 38.22 17 53,415 35 Hudood* 37.61 5 454 36 schoolfriend 37.35 5 613 37 mirage 36.35 8 15,277 38 Fata 35.94 5 1,528 39 Islamabad 35.27 12 35,692 40 mediaeval 34.33 6 7,620 41 crossword 34.17 10 27,635 42 Dad 33.83 73 344,417 43 ān (Quran) 33.56 5 3,249 44 dialled 33.39 5 3,385 45 impassioned 33.22 8 18,855 46 calligraphy 32.27 9 25,342 47 Amma 32.10 5 4,436 48 Morgana 31.92 5 4,586 49 stepmother 31.51 7 15,725 50 haiku 31.15 7 16,165 51 jalaibee 30.89 4 0 52 encrypt 30.88 24 110,022 53 captor 30.79 8 22,143 54 seekh 30.67 4 163 55 Sprezzatura, 30.64 4 187 56 falsa 30.56 4 248 57 maulana 30.54 4 259 58 resent 30.20 15 62,384 59 decrypt 29.83 7 17,889 60 Weep* 29.81 25 120,480 61 strangeness 29.50 6 12,581 62 iftar 29.02 4 1,462 63 absurdly 29.00 7 19,051 Azka Khan and Sarwet Rasul Page | 164 Global Language Review (GLR) 64 fizz 28.68 6 13,598 65 grandness 28.18 4 2,189 66 Ajar (open) 27.93 5 8,491 67 aur 27.78 4 2,541 68 punchline 27.56 5 8,908 69 Multan 27.52 4 2,778 70 kurta 27.38 4 2,914 71 ummah 27.32 4 2,970 72 mother 27.15 465 2,886,782 73 kameez 26.65 4 3,614 74 policewoman 26.62 4 3,641 75 Tyrant 26.07 4 4,202 76 unforgivable 26.00 5 10,811 77 newsreader 25.55 4 4,748 78 couplet 25.46 5 11,516 79 Gonzales 25.37 7 25,034 80 Bhutto 25.33 5 11,696 81 bookshelf 25.08 7 25,578 82 resentful 24.40 6 19,962 83 postmark 23.48 5 14,402 84 FUGUES 23.41 3 5 85 Nashaa 23.41 3 8 86 variedness 23.40 3 21 87 seventeen 23.39 13 72,642 88 Raqeeb 23.38 3 40 89 Frass 23.35 3 69 90 IMPRISONED* 23.33 3 85 91 sixteen 23.33 22 138,422 92 Mohtarma 23.27 3 143 93 chowkidar 23.26 3 157 94 calligraphed 23.11 3 299 95 Leucippus 23.07 3 340 96 Aashiq 23.07 3 341 97 unnaturalness 23.00 3 408 98 KDA 22.97 3 441 99 EXILE 22.93 3 484 100 gesture 22.87 43 297,602 Extraction of Semantic Domains Through Corpus Tools Vol. V, Issue I (Winter 2020) Page | 165 The number of type and token of the first 100 KWIC in SCBV is calculated in the following way. The total number of top 100 keyword tokens are 1550. The total number of tokens in SCBV is 133,829 and total number of types is 10,288. The total number of types and tokens of the top 100 keywords from different semantic domains, their frequency, and percentage is given in the Table 04. Table 4. Percentage of Token of top 100 KWIC in SCBV Semantic Fields/ Topic indicators No. of Types in KWIC No and % of Tokens in KWIC 1550 Definition and Comment Most Frequent Examples from the Novel 2 Geographical locations 6 133 8.5% To show the setting of the novel, there is frequent referring to Karachi and a studio STD Karachi, Fata, Islamabad, Multan, KDA 3 Marriage and family life 7 674 43% Familial ties and institute of marriage are a recurrent theme in SCBV. Dad, mother, mama, beloved, stepmother, amma 4 Words from Regional Languages 18 150 9.6% This category consists of words mainly from Urdu, and Punjabi. Nashaa, Raqeeb, Mohtarma, Chowkidar, Aashiq, aur, Kurta, kameez, ghazal, Shawl, Lathi, Hikmet, kabab, Laila, Ramzan, Urdu, Qais 5 Political setup 11 63 4% Many words included in this category needed the context to be reviewed and then they are put in this category. Zia, Exile, captor, imprisoned, Tyrant Bhutto, archivist. Inqalab 6 An atmosphere of gloom and hopelessness 6 38 2.5% The words in this category refer to negative feelings experienced by different characters but on the whole this group does not signify any one theme. unforgiveable, resentful, unnaturalness, resent, absurdly 8 Miscellaneous 29 220 14% Keywords not indicating any category Azka Khan and Sarwet Rasul Page | 166 Global Language Review (GLR) Religion Family & marriage Politics Negative emotions Historical allusions geographical location Regional languages Miscellaneus words The KWIC analysis helped to identify eight semantic fields out of which three categories, negative feelings, natural environment and miscellaneous did not help to signify a single theme. Figure 03 shows a graphic representation of the most dominant and the less dominant themes. Fig 3: The Most Dominant and the Least Dominant Themes in SCBV It needs to be made clear that some keywords are overlapping in terms of their thematic content for example a word which refers to an indigenous place can be put in either geographical locations or it can be taken as a historical reference. Similarly, the name of a regional language can be used for discussing a literary allusion. Therefore, figure 03 does not represent very clear boundaries; nevertheless, it does give an idea of the dominant themes in SCBV. It also shows the limitations of corpus techniques in terms of aboutness of the discourse. It is found that KWIC lists can give only a vague idea and blurred picture of the thematic content of the discourse and detailed collocation or concordance analysis is essential for understanding the detailed picture. Some Methodological Concerns Extracting semantic fields from SCBV through top 100 KWIC helped to gain the following methodological insights. 1. At the stage of categorizing KWIC into different semantic fields, I realised that I cannot rely only on KWIC for categorizing these words and the broader context of the words needs to be examined before categorising them into different semantic domains. Two examples has been given to illustrate this point. The word heaven occurs 8 times in SCBV. Superficially, it seems that this word belongs to the domain of religion but when the broader context is analysed, the findings were contrary to the initial expectations. This word is used two times for continuing the conversation in the phrase for heaven’s sake. Similarly, the word God in SCBV is used as thanks God, for God’s sake, God forbid etc. The researcher needs to note that these words are not actually referring to religion. On the other hand, some of the words such as terror/ism, fundamentalist, radical and extremist do not belong directly to Extraction of Semantic Domains Through Corpus Tools Vol. V, Issue I (Winter 2020) Page | 167 the semantic field of religion but when the broader context of occurrence is observed through concordance lines and paragraph retrieval, it is found that they are actually referring to religion. 2. Some words do not fit into any category. The category of words named miscellaneous in table 04 do not signify any one theme. 3. Some words with negative connotation (shown in grey colour in table 03, 04 and figure 03) in the corpus but they do not fit any one theme. It is still possible to conclude after concordance analysis of these words that the plot line is tragic or shows a gloomy atmosphere. 4. The code words used by Asmani (one of the main protagonists in SCBV) are recognised by the software as keywords because of their uniqueness but they do not reveal anything about the semantic content of the corpus so they are excluded from the list. These code words are Ikrfb, fyfno, efac, Smaani, Anonkoh are excluded from these lists. Despite these methodological concerns, the use of KWIC for the extraction of semantic domains from a novel proved to be the most helpful when compared to all the other methods employed in this research. Conclusion This article demonstrated the use of three corpus techniques for the extraction of dominant semantic domains from a corpus. For this purpose, the fictional discourse produced by Shamsie titled Broken Verses has been used. The first two techniques, namely, frequency lists and word clouds can be used as the starting points to enter the data but they are not helpful in extracting the dominant semantic domains. The unique feature of the software NVIVO 11 is to produced frequency list based on synonyms also proved to be of little help due to the vast difference in the semantic shades of the words. The third method is consisted of manually categorizing the top 100 KWIC for extracting the semantic domains. This method is proved to be the most useful for the purpose of discourse analysis. The dominant semantic domains identified in SCBV through KWIC analysis are the same which are pointed out by literary critics after close reading of the texts. Azka Khan and Sarwet Rasul Page | 168 Global Language Review (GLR) References Baker, P. (2006). Glossary of corpus linguistics. Edinburgh University Press. Brinton, L. J. (2000). The structure of modern English: A linguistic introduction. John Benjamins Publishing. Brinton, L. J. (Ed.). (2001). Historical Linguistics 1999: Selected papers from the 14th International Conference on Historical Linguistics, Vancouver, 9 13 August 1999 (Vol. 215). John Benjamins Publishing. Burdick, A., Drucker, J., Lunenfeld, P., Presner, T., & Schnapp, J. (2012). Digital_Humanities. Mit Press. Edhlund, B., & McDougall, A. (2019). NVivo 12 Essentials. Lulu. com. Hu, C. (2015). Using Wmatrix to Explore Discourse of Economic Growth. English Language Teaching, 8(9), 146-156 Hunt, S. (2015). Representations of gender and agency in the Harry Potter series. In Corpora and Discourse Studies (pp. 266-284). Palgrave Macmillan, London. Knowles, G., & Don, Z. M. (2004). The notion of a “lemma”: Headwords, roots and lexical sets. International Journal of Corpus Linguistics, 9(1), 69-81. Mahlberg, M., Stockwell, P., Joode, J. D., Smith, C., & O'Donnell, M. B. (2016). CLiC Dickens: novel uses of concordances for the integration of corpus stylistics and cognitive poetics. Corpora, 11(3), 433-463. Pollak, Senja, Coesemans, R., Daelemans, W., & Lavrac, N. (2011). Detect ing contrast patterns in newspaper articles by combining discourse analysis and text mining. Pragmatics 21 (4): 647- Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics, 13(4), 519-549. Rayson, P. (2009). Wmatrix: a web-based corpus processing environment. Rayson, P., Archer, D. E., Baron, A., Culpeper, J., & Smith, N. (2007). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of the Corpus Linguistics conference: CL2007. Rayson, P., Archer, D., Piao, S., & McEnery, A. M. (2004). The UCREL semantic analysis system. Sharoff, S. (2004, May). Towards Basic Categories for Describing Properties of Texts in a Corpus. In LREC. Stubbs, M. (2004). Conrad, concordance, collocation: heart of darkness or light at the end of the tunnel?' The Third Sinclair Open Lecture. Stubbs, M. (2005). Conrad in the computer: examples of quantitative stylistic methods. Language and Literature, 14(1), 5-24. Terras, M. (2011). Quantifying digital humanities. UCL Centre for Digital Humanities. Wiedemann, G. (2013). Opening up to big data: Computer-assisted analysis of textual data in social sciences. Historical Social Research/Historische Sozialforschung, 332-357.