key: cord-0495026-du29yblw authors: Baker, Kirk title: XSTEM: An exemplar-based stemming algorithm date: 2022-05-09 journal: nan DOI: nan sha: 068a49a131de107b36fdf8c9ca95050bf8e0db5d doc_id: 495026 cord_uid: du29yblw Stemming is the process of reducing related words to a standard form by removing affixes from them. Existing algorithms vary with respect to their complexity, configurability, handling of unknown words, and ability to avoid under- and over-stemming. This paper presents a fast, simple, configurable, high-precision, high-recall stemming algorithm that combines the simplicity and performance of word-based lookup tables with the strong generalizability of rule-based methods to avert problems with out-of-vocabulary words. Stemming is the process of reducing related words to a standard form by removing affixes from them. For example, eating, eats, and eaten can be reduced to the standard form eat by removing the -ing, -s, and -en from each. Stemming is a foundational step in many text processing pipelines, including information retrieval and language modeling [2] , and as a feature reduction step for classification or lexical transfer learning tasks [1] . Most stemming algorithms are either rule-based or corpus-based. Rule-based stemmers use a set of manually created rules to transform words to their base form, typically in conjunction with reference to a dictionary for handling exceptions or modulating their output in some fashion. Corpus-based stemmers typically employ statistical machine learning methods to equate word forms based on distributional regularities derived from large amounts of text. Although researchers have demonstrated impressive results with corpus-based approaches (e.g., [2, and references therein]), rule-based stemming implementations are more widely available through software distributions such as Apache Lucene 1 or Python NLTK 2 . Stemming algorithms differ with respect to whether they aim to produce a real-word form (often called a lemma) or simply a truncated version of the input (often called a root ). For example, a word-oriented stemmer might produce polite as the output for politely whereas a truncating algorithm might produce polit. Common problems encountered by stemming algorithms are overstemming and under-stemming. Over-stemming occurs when unrelated words are reduced to the same form, such as reducing both rations and rational to rat. Under-stemming occurs when related words are reduced to distinct forms, such as reducing recognize to recogn and recognition to recognit. Another common problem is failing to handle words that are not contained in a referenced lexical resource. This paper presents a rule-based, word-oriented stemming algorithm for English that is simultaneously high precision and high recall. It does not over-or under-stem, handles a wide range of irregular word classes, and generalizes accurately to unknown words. The algorithm is fast, straightforward to implement, easy to configure for different levels of stemming, and comes with an open-source reference implementation. The stemmer described herein overcomes each of the problems noted for previous approaches in the section below. This section describes previous approaches to English stemming that are widely used today or were fundamental to the development of those widely used algorithms. Lovins Stemmer The Lovins stemmer uses multiple steps to remove suffixes, rewrite truncated forms, and conflate similar forms. Each stage relies on a set of ordered rules (294 suffix rules, 29 conditions by which they are applied, 34 transformation rules, and 17 conflation rules) to produce a final result [10, 5, 6, 11] . Suffixation rules are based on the longest match found. The Lovins algorithm is an aggressive stemmer (e.g., nationally stems to nat [5] ) and does not aim to produce real-word forms. It has been noted to improperly conflate distinct words (e.g., probe and probate are respectively stemmed to probe and prob, and then conflated by partial-matching [11] ). Dawson Stemmer The Dawson stemmer [3] extends the work of Lovins by including a larger set of about 1200 suffixes [6] . Suffixes are reversed-indexed by length and last character. Like Lovins [10] , the Dawson stemmer is a single-pass algorithm that removes the longest matching suffix in a single step and recodes the remaining stem into a valid word using a mapping table [4] . Previous authors have noted that the Dawson algorithm is complex to implement and lacks a reference implementation [5, 6] . Porter Stemmer The Porter algorithm successively removes suffixes from a predefined set using a multi-step approach [13, 9, 6] . Within each of the algorithm's five steps, suffix rules are tried until one is accepted and the next step is applied to the output of the previous step. The resulting truncated form is returned at the end of the fifth step. The Porter algorithm also does not aim to produce real-word forms, and may conflate words with different meanings to the same stem while shortening related words to distinct stems. For example, generous, generalization, and generic all stem to gener, while recognize and recognition are stemmed to recogn and recognit, respectively [5] . Krovetz Stemmer The Krovetz stemmer [9] modifies the Porter algorithm to check words against a dictionary prior to performing each suffixation step. If the output of a prior step is found in the dictionary, the algorithm terminates and the previous output is returned. For example, if the dictionary contains generalization, the stemming of the plural form generalizations will terminate after removal of the final -s rather than proceeding to subsequent steps that would result in further truncation. For word forms that are not in the dictionary, the Krovetz stemmer generally returns heavily stemmed output resembling the output of the Lovins algorithm. The Krovetz stemmer is explicitly designed to return complete words and to be highly configurable, providing six supplemental files (exceptions, countries, proper nouns, primary dictionary, supplemental dictionary, and direct conflations) that can be customized for desired stemming results [8] . The main drawback of the Krovetz stemmer is its inability to handle words that are not in the lexicon [5] . Because its lookup tables rely on full-word matching, it is unable to generalize to classes of desired stemming behavior, instead requiring each specific example to be recorded. For example, the Krovetz algorithm reduces viruses to viruse, which means that even after stemming, the singular and plural forms of virus would fail to match. Although we can configure a mapping from viruses to virus, this configuration will not modify the output for compound words ending in -virus, such as retroviruses, adenoviruses, coronaviruses, etc. A PubMed title search 3 for *viruses yields 450 distinct compounds of this type, illustrating the limitations of dictionary-based stemming methods. Paice Stemmer The Paice stemmer [12] is an iterative, rule-based stemmer that groups rules by the final character of the suffix to which they apply. For example, rules pertaining to words ending in -s form a group; rules applying to words ending in -ing are grouped under the letter g, and so forth. Each rule consists of the following components: • the suffix to be matched, stored in reverse order • the number of characters to remove, which may be zero • the new suffix to be appended, which may be null • a continuation symbol, which indicates whether to continue stemming the intermediate form • an optional flag that controls stemming of intact vs. partially stemmed word forms In addition to these rules, the Paice stemmer requires an additional set of constraints that limit improper stemming of certain words such as rant, rice, and river, among others, although Paice notes that a lookup table may be a preferable solution [12, p. 58 ]. This section describes our proposed stemming algorithm and compares it to previous work, showing how it builds upon those earlier methods and mitigates problems noted for each. XSTEM is a fast, configurable, high-precision, high-recall stemming algorithm that combines the performance and interpretability of word-based lookup tables with the strong generalizability of rule-based methods. Like Lovins [10] and Dawson [3] , XSTEM is based on the longest-matching suffix principle. Suffix rules are stored in a reverse character trie, providing a highly performant lookup structure that is equivalent to the constant runtime performance of retrieving hashed strings from a map. Dawson [3] and Paice [12] utilized similar reverse storage mechanisms to limit the number of iterations within an applicable class of rules. The components of an XSTEM suffix rule are analogous to those of Paice, and specify 1) the suffix to match, 2) the number of characters to remove from the end of the input (this number may be zero), and 3) a replacement suffix to be appended (this replacement may be null). Like Krovetz [9] , XSTEM produces valid words as output. Unlike Krovetz, XSTEM does not rely on a lexicon 4 to determine its stemming behavior. By eliminating the dependency on a dictionary file as a requirement for producing real-word output, XSTEM overcomes the out-of-vocabulary problem documented for Krovetz in [5] . Rather than relying on strict lexical lookup, XSTEM generalizes from a pool of exemplar suffixes to handle regular and irregular forms with equal proficiency. Also like Krovetz, but in contrast to Dawson [5, 6] , XSTEM is modular and highly configurable. For the Krovetz stemmer, these properties are most notably reified by its supplemental vocabulary files, which are separated into lexical classes (such as countries, proper nouns, and direct conflations) for ease of modification [9] . XSTEM follows this paradigm and utilizes a list of proper nouns (names whose structure otherwise makes them eligible for stemming, such as Denning or Maldives) and a list of exceptions (primarily short words that would otherwise be ineligible for stemming, such as tied/tie or irregular verbs, such as brought/bring). The exceptions list is relatively short; the majority of irregular suffixes are handled by the suffix trie and do not require separate configuration. Finally, XSTEM requires a specification of suffix rules, which are also contained in an editable text file. This file currently contains approximately 1500 rules; each rule consists of either one, two, or three fields representing the suffix to match, the number of characters to remove (if any), and the new suffix to be appended (if any). Related to its configurability, XSTEM is implemented as a multi-pass stemming algorithm, a feature shared with Porter [13] and Paice [12] . Each step outputs real-word forms, and any given step is optional. This design allows XSTEM to be configured as a light stemmer (i.e., plural only) to a more aggressive one with minimal effort. Compared to previous iterative rule-based approaches, XSTEM's algorithm is simple: a lookup function performed by trie traversal. Rules which attempt to capture the subtlety and complexity of English stemming behavior, which is a mishmash of productive and semiproductive processes and historic borrowings from other languages with their own idiosyncracies, are complex and arbitrary, typically referring to specific combinations of consonants and vowels, syllable struture, or specific characters in certain positions to make the rule work. In our view, it is only possible to write such rules by gathering and examining sufficient data and counterexamples to refine them, but it is more productive to omit the intermediate step of converting exemplars to rules and utilize a model that generalizes from the exemplars directly. To summarize, the overall advantages of XSTEM compared with previous approaches are: • it is fast, with performance equivalent to direct lookup methods • it is configurable, both in terms of controlling the behavior of specific word classes and the aggressiveness of the stemmer • it generalizes to unknown words, providing high recall and largely eliminating problems with out-of-vocabulary items • it handles irregular and regular forms with the same mechanism, providing high-precision stemming • it is simple, utilizing longest matching lookup rather than conditional stemming rules • an open-source reference implementation is provided. This section describes how XSTEM works in more detail. The example below considers the different stemming behaviors of words ending in -elves, such as selves, delves, and pelves (the Latin plural form of pelvis, which is commonly encountered in biomedical text). We begin with a very general pattern rule that works for a large number of words, "remove the final -s". This exemplar is stored in a reverse character trie as shown below. s A path from the root ( ) through the letter s terminates in a suffix operation that specifies truncating the input by one character and appending a null suffix (i.e., drop the final -s and do nothing). This model produces the following output for the three words in our example selves → selve delves → delve pelves → pelve resulting in two non-words (selve and pelve) that did not stem to their base forms (self and pelvis, respectively). We add an exemplar that generally works for words ending in -lves, "replace the final -ves with -f". The rule is added to the trie as shown below. A path from the root through the letters sevl (reverse traversal of the input) terminates in a suffix operation that specifies truncating the input by three characters and appending a new suffix -f. This model produces the following output for the three words in our example selves → self delves → delf pelves → pelf resulting in two non-words (delf and pelf ) that did not stem to their base forms (delve and pelvis, respectively). By providing more specific exemplars for delves and pelves, the model is able to correctly resolve each of the three words in our example to their correct base form, and defaults to -s removal for all other matching words. The updated model produces the following example output selves → self delves → delve pelves → pelvis Although the transformation of word final -lves to singular -is or -f is no longer a productive morphological process in English 5 , such words actively participate in the formation of compound nouns such as aardwolves, beewolves, coywolves, midpelves, hemipelves, micropelves, etc. Other non-productive inflectional patterns, such as the -osis/-oses alternation common in medical terminology derived from Greek, participate even more freely in the formation of novel compound nouns, such as thrombosis/thromboses, microthrombosis/microthromboses, dermatosis/dermatoses, genodermatosis/genodermatoses, etc. Examples such as these illustrate the necessity of a stemming model that can generalize beyond a dictionary to handle novel input. An open-source 6 implementation of XSTEM written in Java is available at https://github.com/kirkbaker/xstem. The implementation does not require any third-party libraries and favors clarity over any particular optimizations for speed or memory, but is nonetheless highly performant. At the time of writing, the reference implementation handles 11 suffix classes, and contains approximately 1500 exemplar suffixes that are distributed across these classes. It also contains an exceptions file (approximately 250 short words with irregular morphology) and a proper names file (approximately 22,400 proper nouns derived from PubMed author names and geographic location names). Not all of these proper names are strictly required to limit proper suffixation, but are included as a general resource. As with the Krovetz stemmer, it is not strictly necessary to utilize separate exceptions and name files [8] ; they are kept distinct for ease of modification. Each suffix class is stored in a separate trie for configurability; any of these may be omitted from XSTEM in order to produce lighter stemming behavior. Each of the stemming modules is described below. plural suffixation In addition to handling English regular and irregular plurals, XSTEM's plural module is notable for the wide range of biomedical Greek and Latin plurals it handles. Examples of Greek vs English plural reduction include alternations such as osteoscleroses/osteosclerosis or aponeuroses/aponeurosis vs primroses/primrose or rockroses/rockrose. Examples of Latin plural handling include alternations such as nanomatrices/nanomatrix or oropharynx/oropharynges vs prices/price or lynx/lynxes. -er suffixation XSTEM takes a light-handed approach to -er suffixation, focusing on removing the inflectional suffix (e.g., higher/high or healthier/healthy) but keeping it intact as a derivational suffix (e.g., renter = rent and reporter = report ). 5 A productive morphological process is one that speakers of a language actively apply to new words. 6 Apache License, Version 2.0 -ness suffixation XSTEM removes -ness when used as a suffix (e.g., wooziness/woozy or boldness/bold ) but not when it can be considered part of the root (e.g., harness or witness). -ly suffixation XSTEM removes adverbial usages of -ly such as necessarily/necessary or steadily/steady, but not when it is part of the root, such as apply or firefly. -ity suffixation XSTEM also takes a fairly hands-off approach to removing -ity, restricting it to cases where the derived and root forms are closely related, such as viscosity/viscous or obesity/obese, and keeping it for common roots such as community or quality. -ize suffixation -ize is also removed judiciously in cases of closely related words such as homogenize/homogenous or randomize/random, and is not removed when doing so would produce a root that is commonly used in a distinct sense. For example, organize is not stemmed to organ and polarize is not stemmed to polar. Ultimately, judgments of relatedness are those of the author, and users are encouraged to configure XSTEM to their needs if the defaults are not providing desired results for a given domain. -ing suffixation XSTEM is careful to remove only inflectional occurrences of -ing, such as evading/evade or attaining/attain, while retaining it for roots such as offspring or starling. -al suffixation The suffix -al is removed primarily from closely related words of Latin or Greek origin such as corneal/cornea or esophageal/esophagus and is kept for words such as biomaterial or fiscal. -ion suffixation As with other derivational suffix modules in XSTEM, -ion removal focuses on alternations that produce closely related words such as transcription/transcribe or inclusion/include, while preserving distinctions such as foundation vs found or portion vs port. -ic suffixation XSTEM removes -ic from closely related word pairs such as algorithmic/algorithm, leukemic/leukemia, proteomic/proteome, and theoretic/theory, among others. It retains -ic when it is closely associated with the root, such as mimic, epidemic, or classic, for example. past tense suffixation The past tense module handles various forms of -ed and -d removal and irregular verbs (and their compounds), such as misunderstood/misunderstand, overtook/overtake, and the various forms of be, among many others. The past tense module does not remove -ed or -d when they are part of the stem, such as infrared, naked, or hundred, etc. This paper presented a fast, simple, configurable, high-precision, high-recall stemming algorithm that combines the simplicity and performance of wordbased lookup tables with the strong generalizability of rule-based methods to avert problems with out-of-vocabulary words. The model described here overcomes issues that have been previously noted for other approaches, and an open-source reference implementation is available. BERT-based transfer-learning approach for nested named-entity recognition using joint labeling HPS: High precision stemmer. Information Processing & Management Suffix removal and word conflation Dawson stemming. In Artificial Intelligence for Big Data Stemming algorithms -a case study for detailed evaluation A comparative study of stemming algorithms Decisions and mechanisms in exemplar-based phonology Kstem documentation Viewing morphology as an inference process Development of a stemming algorithm A survey of stemming algorithms in information retrieval An algorithm for suffix stripping. Program: electronic library and information systemsl