Evaluating the Impact of the Long-S upon 18th-Century Encyclopedia Britannica Automatic Subject Metadata Generation Results


ARTICLES 

Evaluating the Impact of the Long-S upon 18th-Century 
Encyclopedia Britannica Automatic Subject Metadata 
Generation Results 
Sam Grabus 

 
INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2020  
https://doi.org/10.6017/ital.v39i3.12235 

 
Sam Grabus (smg383@Drexel.edu) is an Information Science PhD Candidate at Drexel 
University’s College of Computing and Informatics, and Research Assistant at Drexel’s Metadata 
Research Center. This article is the 2020 winner of the LITA/Ex Libris Student Writing Award. 
© 2020. 

ABSTRACT 

This research compares automatic subject metadata generation when the pre-1800s Long-S 
character is corrected to a standard < s >. The test environment includes entries from the third 
edition of the Encyclopedia Britannica, and the HIVE automatic subject indexing tool. A comparative 
study of metadata generated before and after correction of the Long-S demonstrated an average of 
26.51 percent potentially relevant terms per entry omitted from results if the Long-S is not corrected. 
Results confirm that correcting the Long-S increases the availability of terms that can be used for 
creating quality metadata records. A relationship is also demonstrated between shorter entries and 
an increase in omitted terms when the Long-S is not corrected. 

INTRODUCTION 

The creation of subject metadata for individual documents is long known to support standardized 
resource discovery and analysis by identifying and connecting resources with similar aboutness .1 
In order to address the challenges of scale, automatic or semi-automatic indexing is frequently 
employed for the generation of subject metadata, particularly for academic articles, where the 
abstract and title can be used as surrogates in place of indexing the full text. When automatically 
generating subject metadata for historical humanities full texts that do not have an abstract, 
anachronistic typographical challenges may arise. One key challenge is that presented by the 
historical “Long-S” < ſ >. In order to account for these idiosyncrasies, there is a need to understand 
the impact that they have upon the automatic subject indexing output. Addressing this challenge 
will help librarians and information professionals to determine whether or not they will need to 
correct the Long-S when automatically generating subject metadata for full-text pre-1800s 
documents. 

The problem of the Long-S in Optical Character Recognition (OCR) for digital manuscript images 
has been discussed for decades.2 Many scholars have researched methods for correcting the Long-
S through the use of rule-based algorithms or dictionaries.3 While the problem of the Long-S is 
well-known in the digital humanities community, automatic subject metadata generation for a 
large corpus of pre-1800s documents is rare, as is research about the application and evaluation of 
existing automatic subject metadata generation tools on 18th-century documents in real-world 
information environments. The impact of the Long-S upon automatic subject metadata generation 
results for pre-1800s texts has not been extensively explored. The research presented in this 
paper addresses this need. The paper reports results from basic statistical analysis and 
visualization using the Helping Interdisciplinary Vocabulary Engineering (HIVE) tool automatic 

mailto:smg383@Drexel.edu


INFORMATION TECHNOLOGY AND LIBRARIES  SEPTEMBER 2020 

EVALUATING THE IMPACT OF THE LONG-S | GRABUS 2 

subject indexing results, before and after the correction of the historical Long-S in the 3rd edition 
of the Encyclopedia Britannica. Background work was conducted over the Summer and Fall of 
2019, and the research presented was conducted during Winter 2020. The work was motivated by 
current work on the “Developing the Data Set of Nineteenth-Century Knowledge” project, a 
National Endowment for the Humanities collaborative project between Temple University’s 
Digital Scholarship Center and Drexel University’s Metadata Research Center. The grant is part of a 
larger project, Temple University’s “19th-Century Knowledge Project,” which is digitizing four 
historical editions of the Encyclopedia Britannica.4 The next section of this paper presents 
background covering the historical Encyclopedia Britannica data, the automatic subject metadata 
generation tool used for this project, a brief background of “the Long-S Problem,” and the 
distribution of encyclopedia entry lengths in the 3rd edition. The background section will be 
followed by research objectives and method supporting the analysis. Next, the results are 
presented, demonstrating prevalence of terms omitted from the automatic subject metadata 
generation results if the Long-S is not corrected to a standard small < s > character, as well as the 
impact of encyclopedia entry length upon these results. The results are followed by a contextual 
discussion, and a conclusion that highlights key findings and identifies future research. 

BACKGROUND 

Indexing for the 19th-Century Knowledge Project 
The 19th-Century Knowledge Project, an NEH-funded initiative at Temple University, is fully 
digitizing four historical editions of the Encyclopedia Britannica (the 3rd, 7th, 9th, and 11th). The 
long-term goal of the project is to analyze the evolving conceptualization of knowledge across the 
19th century.5  The 3rd edition of the Encyclopedia Britannica (1797) is the earliest edition being 
digitized for this project. The 3rd edition consists of 18 volumes, with a total of 14,579 pages, and 
individual entries ranging from four to over 150,000 words. For each individual entry, researchers 
at Temple have created individual TEI-XML files from the OCR output. 

In order to enrich accessibility and analysis across this digital collection, The Knowledge Project 
will be adding controlled vocabulary subject headings into the TEI headers of each encyclopedia 
entry XML file. Considering the size of this corpus, both in terms of entry length and number of 
entries, automatic subject metadata generation will be required for the creation of this metadata. 
The Knowledge Project will employ controlled vocabularies to replace or complement naturally 
extracted keywords for this process. Using controlled vocabularies adheres to metadata semantic 
interoperability best practices, ensures representation consistency, and helps to bypass linguistic 
idiosyncrasies of these 18th and 19th Century primary source materials. 6 We selected two 
versions of the Library of Congress Subject Headings (LCSH) as the controlled vocabularies for this 
project. LCSH was selected due to its relational thesaurus structure, multidisciplinary nature, and 
continued prevalence in digital collections due to its expressiveness and status as the largest 
general indexing vocabulary.7 In addition to the headings from the 2018 edition of LCSH, headings 
from the 1910 LCSH are also implemented in order to provide a more multi-faceted 
representation, using temporally-relevant terms that may have been removed from the 
contemporary LCSH. 

The tool applied for this process is HIVE, a vocabulary server and automatic indexing application. 8 
HIVE allows the user to upload a digital text or URL, select one or more controlled vocabularies, 
and performs automatic subject indexing through the mapping of naturally extracted keywords to 
the available controlled vocabulary terms. HIVE was initially launched as an IMLS linked open 


INFORMATION TECHNOLOGY AND LIBRARIES  SEPTEMBER 2020 

EVALUATING THE IMPACT OF THE LONG-S | GRABUS 3 

vocabulary and indexing demonstration project in 2009. Since that time, HIVE has been further 
developed, with the addition of more controlled vocabularies, user interface options, and the 
RAKE keyword extraction algorithm. The RAKE keyword extraction algorithm has been selected 
for this project after a comparison of topic relevance precision scores for three keyword 
extraction algorithms.9 

The Long-S Problem 
Early in our metadata generation efforts, we discovered that the 3rd edition of the Encyclopedia 
Britannica employs the historical Long-S. Originating in early Roman cursive script, the Long-S 
was used in typesetting up through the 18th century, both with and without a left crossbar. By the 
end of the 18th century, the Long-S fell out of use with printers.10 As outlined by lexicographers of 
the 17th and 18th centuries, the rules for using the Long-S were frequently vague, complicated, 
inconsistent over time, and varied according to language (English, French, Spanish, or Italian). 11 
These rules specified where in a word the Long-S should be used instead of a short < s >, whether 
it is capitalized, where it may be used in proximity to apostrophes, hyphens, and the letters < f >, 
< b >, < h >, and < k >; and whether it is used as part of a compound word or abbreviation.12 This is 
further complicated by the inclusion of the half-crossbar, which occasionally results in two 
consequences: (a) The Long-S may be interpreted by OCR as an < f >, and < b > and < f > may be 
interpreted by OCR as a Long-S. Figure 1 shows an example from the 3rd edition entry on Russia, 
in which the original text specifies “of” (line 1 in top figure), yet the OCR output has interpreted 
the character as a Long-S. The Long-S may also occasionally be interpreted by the OCR as a lower-
case < l >, such as the “univerlity of Dublin” in the 3rd edition entry on Robinson (The most Rev Sir 
Richard). 

These complications and inconsistencies are challenges when developing Python rules for 
correcting the Long-S in an automated way, and even preexisting scripts will need to be adapted 
for individual use with a particular corpus.  

 
Figure 1. Example from the 3rd edition entry on Russia, comparing the original use of a letter < f > in 
“of” to the OCR output of the same passage, which mistakenly interprets the character as a Long-S. 


INFORMATION TECHNOLOGY AND LIBRARIES  SEPTEMBER 2020 

EVALUATING THE IMPACT OF THE LONG-S | GRABUS 4 

Despite the transition away from the Long-S towards the end of the 18th century, the 3rd edition of 
the Encyclopedia Britannica (published in 1797) implements the Long-S throughout, with 
approximately 100,594 instances of the Long-S in the OCR output. When performing metadata 
generation with the HIVE tool on the OCR output for an entry, the Long-S is most often interpreted 
by the automatic metadata generation tool as an < f >, which can result in (a) inaccurate keyword 
extraction (e.g., Russians→ Ruffians), and (b) when mapping extracted keywords to controlled 
vocabulary terms, essential topics could be unidentifiable, and HIVE will subsequently omit them 
from the results because they cannot be mapped to controlled vocabulary terms. Figure 2 provides 
a truncated view of Long-S words in the 3rd edition entry on Rum, which are subsequently 
removed from the pool of automatically extracted keywords when performing the automatic 
subject indexing sequence in HIVE. Using keyword extraction algorithms that are largely 
dependent upon term frequencies, automatic subject indexing for an entry on Rum may be 
substantially hindered when meaningful and frequently occurring words such as sugar, and yeast 
are removed. 

 
Figure 2. Examples of the Long-S in the 3rd edition Encyclopedia Britannica entry on Rum.  

Using this example entry, the automatic subject indexing results were compared using Python, to 
determine which terms only appear when the Long-S has been corrected to the standard < s >. The 
comparison showed that 16 total terms no longer appeared in the results when the Long-S was 
not corrected to a standard < s >: ten terms using the 2018 LCSH, and six terms using the 1910 
LCSH. These omitted results included the terms sugar and yeast. 

The next section will discuss the encyclopedia entry word count for this corpus, and the possible 
impact that this may have upon automatic subject indexing between corrected and uncorrected 
Long-S instances. 

Encyclopedia Entry Lengths 

Consistent with other Encyclopedia Britannica editions in the 18th and 19th centuries, the 
encyclopedia entries in the 3rd edition vary substantially in length. A convenience sample of 3,849 
3rd edition entries ranging in length from 2 to 202,848 words demonstrated an arithmetic mean of 


INFORMATION TECHNOLOGY AND LIBRARIES  SEPTEMBER 2020 

EVALUATING THE IMPACT OF THE LONG-S | GRABUS 5 

826.60, and a median word count of 71. As shown in figure 3, this indicates a significant skew 
towards shorter entry lengths. For the vast majority of encyclopedia entries in this corpus, a low 
total word count may impact the degree of Long-S impact for automatic subject indexing results, 
given the importance of term availability and frequency for keyword extraction algorithms. 

 
Figure 3. Scatterplot of word count for a convenience sample of 3,849 3rd Edition Encyclopedia 
Britannica entries. 

Large-scale metadata generation requires time, labor, and resources, and it becomes more costly 
when accounting for the complications of correcting the Long-S for a particular corpus. Library 
and information professionals working with digital humanities resources will need to understand 
the impact of correcting or not corrected the Long-S in the corpus before designating resources 
and developing a protocol for generating the automatic or semi-automatic metadata for full-text 
resources. This includes understanding whether or not the length of each individual document 
will affect the degree of Long-S impact upon the results. This challenge, and issues reviewed 
above, are in the research presented below. 

OBJECTIVES 

The overriding goal of this work is to determine the prevalence of omitted terms in automatic 
subject indexing results when the Long-S is not corrected in the 3rd edition entries of the 
Encyclopedia Britannica.  

Research questions: 

1. What is the average number of terms that are omitted from automatic subject indexing 
results when the Long-S is not corrected to a standard < s >? 

2. How does the encyclopedia entry length affect the number of terms that are omitted when 
the Long-S is not corrected to a standard < s >? 

This analysis will approach these goals by performing a comparative analysis of automatic subject 
indexing results to determine the number of terms that are omitted from the results when the 
Long-S is not corrected to a standard letter < s >. Basic descriptive statistics are generated to 
determine central tendency. The quantity of terms omitted are then compared with encyclopedia 


INFORMATION TECHNOLOGY AND LIBRARIES  SEPTEMBER 2020 

EVALUATING THE IMPACT OF THE LONG-S | GRABUS 6 

entry word counts. These objectives were shaped by collaboration between Drexel University’s 
Metadata Research Center and Temple University’s Digital Scholarship Center. The next section of 
this paper will report on methods and steps taken to address these objectives. 

METHODS 

We approached this research by performing a comparative analysis of subject metadata generated 
both before and after the correction of the historical Long-S in the 3rd edition of the Encyclopedia 
Britannica. The HIVE tool was used to automatically generate the subject metadata. Descriptive 
statistics were applied, and visualizations produced from the results were also examined to 
identify trends. 

 
Figure 4. The 30 Encyclopedia Britannica 3rd edition Encyclopedia Britannica entries randomly 
selected for this study, sorted in ascending order by their word counts. 

The protocol for performing this research involved the following steps: 

1. Compile a sample for testing: 
1.1. A random sample of 30 encyclopedia entries was identified from a convenience sample of 

entries that comprise the letter S volumes of the 3rd edition. The entries range, in length, 
from 6 to 6,114 words. The median word count for entries in this sample is 99 words. 

1.2. The sample of terms selected for this study and their respective word counts are 
visualized in figure 4. 

1.3. For each entry, the Long-S terms in the original XML file were extracted to a list. 
2. Perform automatic subject indexing sequence upon entries to generate lists of terms: 

2.1. Using the 2018 and 1910 versions of the LCSH. 
2.2. With fixed maximum subject heading results set to 40: 20 maximum terms returned with 

the 2018 LCSH, and 20 maximum terms returned with the 1910 LCSH. 
2.3. Before Long-S correction and after Long-S correction, using the Oxygen XML Editor TEI to 

TXT transformation. 


INFORMATION TECHNOLOGY AND LIBRARIES  SEPTEMBER 2020 

EVALUATING THE IMPACT OF THE LONG-S | GRABUS 7 

3. Perform outer join on Python Data Frames, between terms generated when the Long-S has 
been corrected vs. terms generated when the Long-S has not been corrected. The resulting left 
outer join list displays terms that are omitted from the automatic indexing results if the Long-S 
is not corrected to a standard small < s >. The quantity of terms omitted are recorded for 
comparison. 

4. Analysis: Descriptive statistics were generated to determine central tendency for the number 
and percentage of words omitted when the Long-S is not corrected. The quantity of terms 
omitted are also visualized in a continuous scatterplot with the corresponding word counts, to 
demonstrate that the quantity of terms omitted when the Long-S is not corrected seems to 
relate to the length of the document being automatically classified. 

RESULTS 

The results report the prevalence of omitted terms when the Long-S is not corrected to a standard 
< s >, as well as a visualization of the number of terms omitted as they relate to the encyclopedia 
entry length. For each of the 30 sample entries automatically indexed with HIVE, a fixed maximum 
number of 40 entries were returned: a maximum of 20 terms using the 2018 LCSH, and a 
maximum of 20 terms using the 1910 LCSH. As seen in table 1, central tendency is measured using 
the arithmetic mean and median, along with the standard deviation and range. The average 
number of terms omitted from an entry’s results is 6.73, and the average percentage of terms 
omitted from an entry’s results is 26.51 percent, with the 2018 and 1910 editions of LCSH 
performing at similar rates. The full results are displayed in appendix A. 

Table 1. Measures of centrality, standard deviation, range, and percentage for quantity of terms 
omitted when the Long-S is not corrected to a standard < s >, rounded to the hundredth. For each 
entry, a maximum of 40 terms were returned: 20 using 2018 LCSH and 20 using 1910 LCSH. The total 
results returned varies according to the entry length. These totals are reported in appendix B. (N= 30 
entries.) 

 
For each entry in the sample, the results in appendix A display the total words omitted when the 
Long-S is not corrected, the number of 2018 LCSH terms omitted, the number of 1910 LCSH terms 
omitted, and the encyclopedia entry word count. Figure 5 visualizes the total number of terms 
omitted for each entry when the Long-S is not corrected, demonstrating an increase in terms 
omitted for entries with lower word counts. These results are broken down by vocabulary used in 
figure 6, demonstrating that both vocabularies used to generate these results indicate a significant 
increase in omitted terms for shorter entries. 

Column1 Both	Vocabularies 2018	LCSH 1910	LCSH

Average,	Terms	Omitted 6.73 3.67 3.07

Median,	Terms	Omitted 5 3 2

Standard	Deviation 6.53 3.84 3.17

Range,	Terms	Omitted 0-24 0-13 0-11

Average	Percentage,	Omitted	Terms 26.51% 27.51% 24.28%

Median	Percentage,	Omitted	Terms 22.36% 20.00% 19.09%


INFORMATION TECHNOLOGY AND LIBRARIES  SEPTEMBER 2020 

EVALUATING THE IMPACT OF THE LONG-S | GRABUS 8 

 
Figure 5. Number of automatic subject indexing terms that are omitted when the Long-S is not 
corrected to a standard < s > as compared by encyclopedia entry word count. 

 
Figure 6. Number of automatic subject indexing terms that are omitted when the Long-S is not 
corrected to a standard < s > as compared by encyclopedia entry word count, separated by controlled 
vocabulary version. 


INFORMATION TECHNOLOGY AND LIBRARIES  SEPTEMBER 2020 

EVALUATING THE IMPACT OF THE LONG-S | GRABUS 9 

DISCUSSION 

The analysis above presents measures of centrality for quantity of terms omitted if the Long-S is 
not corrected to a standard < s > prior to automatic subject indexing using HIVE, as well as a 
visualization to represent the relationship between encyclopedia entry word count and number of 
terms omitted. Although researchers have identified challenges with the Long-S and have focused 
a great deal on the technologies and methods used to correct it, there is still limited work on 
looking at the results of not correcting the Long-S character when performing an automatic 
subject indexing sequence. 

This research demonstrated an average of 6.73 potentially relevant terms omitted from automatic 
indexing results when the Long-S is not corrected, accounting for an average of 26.51 percent of 
the total results, with an approximately equal distribution of omitted terms across the two 
controlled vocabulary versions used. When the quantity of terms omitted is visualized using a 
continuous scatterplot, the results also demonstrated a significant increase in omitted terms for 
shorter entries, with longer entries less affected. These results reflect the impact of term 
frequency and total word count in keyword extraction and automatic subject indexing, with longer 
documents having a greater pool of total terms from which to identify key terms. 

Considering the complexities and similarities of the typographical characters in the original 
manuscript, the OCR output process for this corpus occasionally mistakes the letters < s >, < f >, 
< r >, and < l >. As a result, an occasional Long-S word in this study did not originally contain an 
< s > (e.g., sor instead of for). Correction of these Long-S OCR errors requires the development of a 
dictionary-based script. An additional complication of this research is that the corrected OCR 
output for the encyclopedia entries still contains a few errors not related to the Long-S, which will 
prevent the mapping of the term to any controlled vocabulary term (e.g., in the entry on Sepulchre, 
the OCR output for the term Palestine was Palestinc).  

These results are specific to this particular corpus of 3rd edition Encyclopedia Britannica entries, 
but it is very likely that testing another set of pre-1800s documents containing the Long-S would 
also illustrate that for best results with any algorithm or tool, the Long-S needs to be corrected. 
The results are also specific to the two versions of the LCSH used, both the 1910 LCSH and the 
2018 LCSH, which are available in the HIVE tool. The 1910 version is key for the time period being 
studied, and the 2018, more contemporary to today, has supported additional analysis on the 
impact of the Long-S. Both of these vocabularies are important to the larger 19th-Century 
Knowledge Project. It should be noted that while the LCSH is updated weekly, we were limited to 
what is available via the HIVE tool, and any discrepancies that may be found with the 2020 LCSH 
will very likely have a minimal effect upon metadata generation results. It should be noted that the 
2020 LCSH will be incorporated into HIVE soon and can be explored in future research. 

CONCLUSION AND NEXT STEPS 

The objective of this research was to determine the impact of correcting the Long-S in pre-1800s 
documents when performing an automatic metadata generation sequence using keyword 
extraction and controlled vocabulary mapping. This was accomplished by performing an 
automatic subject indexing sequence using the HIVE tool, followed by a basic statistical analysis to 
determine the quantity of terms omitted from the results when the Long-S is not corrected to a 
standard < s >. The number of omitted terms was also compared with the encyclopedia entry 
word count and visualized to demonstrate a significant increase in omitted terms for shorter 


INFORMATION TECHNOLOGY AND LIBRARIES  SEPTEMBER 2020 

EVALUATING THE IMPACT OF THE LONG-S | GRABUS 10 

encyclopedia entries. The study was conclusive in confirming that the correction of the Long-S is a 
critical part of our workflow.  

The significance of this research is that it demonstrates the necessity of correcting the Long-S 
prior to performing an automatic subject indexing on historical documents. Beyond the correction 
of the Long-S, the larger next steps for this project are to continue to explore automatic metadata 
generation for this corpus. These next steps include the comparison of results using contemporary 
vs. historical vocabularies and streamlining a protocol for bulk classification procedures and 
integration of terms into the TEI-XML headers. The research presented here can inform other 
digital humanities and even science-oriented projects, where researchers may not be aware of the 
impact of the Long-S on automatic metadata generation not only for subjects, but also named 
entities, particularly when automatic approaches with controlled vocabularies are desired.  

ACKNOWLEDGEMENTS 

The author thanks Dr. Jane Greenberg and Dr. Peter Logan for their guidance. The author 
acknowledges the support of the NEH grant #HAA-261228-18. 

  
INFORMATION TECHNOLOGY AND LIBRARIES  SEPTEMBER 2020 

EVALUATING THE IMPACT OF THE LONG-S | GRABUS 11 

APPENDIX A 

Entry Term 

Total 
Words 

Omitted 

2018 LCSH 
Terms 

Omitted 

1910 LCSH 
Terms 

Omitted 

Encyclopedia 
Entry Word 

Count 

SARDIS 24 13 11 381 

SUCTION 24 13 11 38 
STYLITES, 
PILLAR SAINTS 19 13 6 199 

SHADWELL 14 10 4 211 

SALICORNIA 13 6 7 254 

SEPULCHRE 11 3 8 348 
SITTA 
NUTHATCH 9 5 4 620 

SPRAT 9 3 6 475 

SERAPIS 8 5 3 587 

STRADA 8 1 7 189 

SHOAD 7 4 3 463 

SIGN 7 5 2 68 

SHOOTING 6 3 3 6114 

STRATA 6 3 3 2920 

STEWARTIA 5 4 1 72 

SUBCLAVIAN 5 3 2 20 

SCHWEINFURT 4 2 2 84 

SCROLL 4 2 2 45 

SPALATRO 4 3 1 99 

SPECIAL 4 3 1 24 

SAMOGITIA 3 2 1 112 

SHAKESPEARE 3 0 3 3855 

SINAPISM 2 1 1 25 

SECT 1 1 0 20 

SEVERINO 1 1 0 38 

SHADDOCK 1 1 0 6 

SCARLET 0 0 0 65 
SHALLOP, 
SHALLOOP 0 0 0 42 

SOLDANELLA 0 0 0 56 

SPOLETTO 0 0 0 99 
  

INFORMATION TECHNOLOGY AND LIBRARIES  SEPTEMBER 2020 

EVALUATING THE IMPACT OF THE LONG-S | GRABUS 12 

APPENDIX B 

*N = 30 entries 
Average Terms 

Returned 
Median Terms 

Returned 

Corrected 24.77 / 40 possible 28 / 40 possible 

Uncorrected 26.47 / 40 possible 29 / 40 possible 

2018 LCSH Corrected 14.10 / 20 possible 19 / 20 possible 

2018 LCSH Uncorrected 13.47 / 20 possible 18.5 / 20 possible 

1910 LCSH Corrected 11.27 / 20 possible 11 / 20 possible 

1910 LCSH Uncorrected 10.13 / 20 possible 9 / 20 possible 
 

INFORMATION TECHNOLOGY AND LIBRARIES  SEPTEMBER 2020 

EVALUATING THE IMPACT OF THE LONG-S | GRABUS 13 

ENDNOTES 
 

1 Liz Woolcott, “Understanding Metadata: What is Metadata, and What is it For?,” Routledge 
(November 17, 2017), https://doi.org/10.1080/01639374.2017.1358232; Koraljka Golub et 
al., “A framework for evaluating automatic indexing or classification in the context of 
retrieval,“ Journal of the Association for Information Science and Technology 67, no. 1 (2016), 
https://doi.org/10.1002/asi.23600; Lynne C. Howarth, “Metadata and Bibliographic Control: 
Soul-Mates or Two Solitudes?,“ Cataloging & Classification Quarterly 40, no. 3-4 (2005), 
https://doi.org/10.1300/J104v40n03_03. 

2 A. Belaid et al., “Automatic indexing and reformulation of ancient dictionaries“ (paper presented 
at the First International Workshop on Document Image Analysis for Libraries, Palo Alto, CA, 
2004), https://doi.org/10.1109/DIAL.2004.1263264. 

3 Beatrice Alex et al., “Digitised Historical Text: Does it have to be mediOCRe" (paper presented at 
the KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012); Ted Underwood, 
“A half-decent OCR normalizer for English texts after 1700," The Stone and the Shell, December 
10, 2013, https://tedunderwood.com/2013/12/10/a-half-decent-ocr-normalizer-for-english-
texts-after-1700/. 

4 “Nineteenth-century knowledge project,"  (GitHub Repository), 2020, https://tu-
plogan.github.io/. 

5 “Nineteenth-century Knowledge Project.” 

6 Marcia Lei  Zeng and Lois Mai Chan, “Metadata Interoperability and Standardization - A Study of 
Methodology, Part II," D-Lib Magazine 12, no. 6 (2006); G. Bueno-de-la-Fuente, D. Rodríguez 
Mateos, and J. Greenberg, “Chapter 10 - Automatic Text Indexing with SKOS Vocabularies in 
HIVE"  (Elsevier Ltd, 2016); Sheila Bair and Sharon Carlson, “Where Keywords Fail: Using 
Metadata to Facilitate Digital Humanities Scholarship," Journal of Library Metadata 8, no. 3 
(2008), https://doi.org/10.1080/19386380802398503. 

7 John Walsh, “The use of Library of Congress Subject Headings in digital collections," Library 
Review 60, no. 4 (2011), https://doi.org/10.1108/00242531111127875. 

8 Jane Greenberg et al., “HIVE: Helping interdisciplinary vocabulary engineering,“ Bulletin of the 
American Society for Information Science and Technology 37, no. 4 (2011), 
https://doi.org/10.1002/bult.2011.1720370407. 

9 Sam Grabus et al., “Representing Aboutness: Automatically Indexing 19th- Century Encyclopedia 
Britannica Entries,” NASKO 7 (2019), pp. 138-48, https://doi.org/10.7152/nasko.v7i1.15635. 

10 Karen Attar, “S and Long S," in Oxford Companion to the Book, eds. Michael Felix Suarez and H. R. 
II Woudhuysen (Oxford: Oxford University Press, 2010); Ingrid Tieken-Boon van Ostade, 
“Spelling systems,“ in An Introduction to Late Modern English (Edinburgh University Press, 
2009). 

11 Andrew West, “The Rules for Long-S," TUGboat 32, no. 1 (2011). 

12 Attar, “S and Long S.” 

https://doi.org/10.1080/01639374.2017.1358232
https://doi.org/10.1002/asi.23600
https://doi.org/10.1300/J104v40n03_03
https://doi.org/10.1109/DIAL.2004.1263264
https://tedunderwood.com/2013/12/10/a-half-decent-ocr-normalizer-for-english-texts-after-1700/
https://tedunderwood.com/2013/12/10/a-half-decent-ocr-normalizer-for-english-texts-after-1700/
https://tu-plogan.github.io/
https://tu-plogan.github.io/
https://doi.org/10.1080/19386380802398503
https://doi.org/10.1108/00242531111127875
https://doi.org/10.1002/bult.2011.1720370407
https://doi.org/10.7152/nasko.v7i1.15635

	ABSTRACT
	INTRODUCTION
	Background
	Indexing for the 19th-Century Knowledge Project
	The Long-S Problem
	Encyclopedia Entry Lengths

	Objectives
	Methods
	Results
	Discussion
	Conclusion and Next Steps
	Acknowledgements
	Appendix A
	Appendix B