Text Recognition for Nepalese Manuscripts in Pracalit Script DATA PAPER CORRESPONDING AUTHOR: Alexander James O’Neill Department of East Asian Languages and Cultures, SOAS University of London, London, UK ao34@soas.ac.uk KEYWORDS: handwritten text recognition; PyLAia; Transkribus; Sanskrit; Newar; Manuscripts TO CITE THIS ARTICLE: O’Neill, A. J., & Hill, N. (2022). Text Recognition for Nepalese Manuscripts in Pracalit Script. Journal of Open Humanities Data, 8: 26, pp. 1–6. DOI: https://doi.org/10.5334/ johd.90 Text Recognition for Nepalese Manuscripts in Pracalit Script ALEXANDER JAMES O’NEILL NATHAN HILL ABSTRACT This dataset is a model for handwritten text recognition (HTR) of Sanskrit and Newar Nepalese manuscripts in Pracalit script. This paper introduces the state of the field in Newar literature, Newar manuscripts, and HTR engines. It explains our methodology for developing the requisite ground truth consisting of manuscript images and corresponding transcriptions, training our model with a PyLAia engine, and this model’s limitations. This dataset shared on Zenodo can be used by anyone working with manuscripts in Pracalit script, which will benefit the fields of Indology and Newar studies, as well as historical and linguistic analysis. *Author affiliations can be found in the back matter of this article mailto:ao34@soas.ac.uk https://doi.org/10.5334/johd.90 https://doi.org/10.5334/johd.90 https://orcid.org/0000-0001-9982-2589 https://orcid.org/0000-0001-6423-017X 2O’Neill and Hill Journal of Open Humanities Data DOI: 10.5334/johd.90 (1) OVERVIEW REPOSITORY LOCATION https://doi.org/10.5281/zenodo.6967421. CONTEXT Newar (also referred to as Nepāl Bhāṣā) is the indigenous language of the Kathmandu Valley. In its pre-print phase, this highly literate and creative culture produced thousands of works that have remained mainly unstudied in either western or Nepalese scholarship. Much of Newar literature is a mixture of Newar, Sanskrit, and Maithili (Malla, 1981, 6–9). While Newar literature is written in various scripts, the most common by far is the Pracalit script, which has thus also come to be known as Newar Lipi (Newar script) (Pandey, 2012). Thus, for both Indological interest in Nepalese manuscripts written in Sanskrit and for students of Newar language and culture, a means to compile a digital corpus more quickly through optical character recognition (OCR) becomes apparent. OCR engines have gradually become more effective in recent decades. Handwritten text recognition (HTR) has proven to be far more problematic. Deep learning neural networks have made it possible to build HTR models based on images of handwritten text linked with corresponding transcriptions (called “ground truth”). A character error rate (CER) under 10% allows for effective automatic transcription (Muehlberger et al., 2019). Advances in computing power and storage made by the Transkribus platform developed by READ-COOP have enabled the training of large data sets involving multiple hands, allowing for generalised HTR models for particular writing styles (Hodel et al., 2021). Transkribus hosts two HTR engines: CITIlab-HTR+ (Michael et al., 2018) and PyLaia, a PyTorch-based model (Mocholí Calvo et al., 2018). In principle, models for HTR of Indic texts can be developed similarly to those in Roman scripts. Transkribus already has two publicly available HTR+ models for printed 19th and 20th century Devanagari developed by Nicole Merkel-Hilf (2022). This project focused on expanding the abilities of HTR models to Indic texts in pre-print and non-Devanagari sources, focusing on Sanskrit and Newar (Nepāl Bhāṣā) manuscripts in Pracalit script from the 16th to 19th centuries. (2) METHOD An HTR trainer requires diplomatic transcriptions of Pracalit manuscripts to line up with text in manuscript photographs. Critically edited editions can speed up transcription and ground truth generation through de-correction. Databases like GRETIL, from which we sourced the published transcriptions, make it possible to bootstrap a non-existent HTR model by using texts from other scripts (Georg-August-Universität Göttingen, 2020). To this end, transcriptions were prepared based on the following four Nepalese manuscripts, each with different varieties of Pracalit script. For each entry in the list below, in order, the manuscript title is given in italics followed by call numbers in parentheses, deposit location, manuscript languages and date, and sources of the corresponding transcriptions: 1. Hitopadeśa (MIK I 4851) Staatsbibliothek zu Berlin Mixed Newar and Sanskrit, 1561 CE Original transcription by Alexander James O’Neill 2. Vetālapañcaviṃśati (HS. Or. 6414) Staatsbibliothek zu Berlin Newar, 1675 CE Adapted transcription based on unpublished materials by Felix Otter (Otter, n.d.a) 3. Avalokiteśvaraguṇakāraṇḍavyūha (MS Add. 1322) Cambridge Digital Library Sanskrit, 18th century Adapted transcription based on an edition by Lokesh Chandra (Chandra, 1999) 4. Madhyamasvayaṃbhūpurāṇa (RAS Hodgson MS 23) Royal Asiatic Society Online Collection Mixed Newar and Sanskrit, c. 1800 https://doi.org/10.5281/zenodo.6967421 3O’Neill and Hill Journal of Open Humanities Data DOI: 10.5334/johd.90 Adapted transcription based on unpublished materials by Felix Otter (Otter, n.d.b) and the published Nagarjuna Institute transcription (Shakya & Bajracharya, 2001) While the HTR+ engine appeared to have difficulty working with the lack of word division, PyLaia produced better results, and we used it for the rest of the training. We trained the model on 441 pages of manual transcriptions of the above four manuscripts, with validation performed on 242 pages that were not part of the training set. It was further tested and continues to be used on pages that were not part of the training or validation sets. We decided it would be most appropriate and culturally sensitive to transcribe into Unicode Pracilt (Unicode, Inc., 2021), see Figure 1. Using 250 epochs, Transkribus trained a model with a CER on the training set of 2.6% and 0.1% on the validation set. This discrepancy may signify little more than that the latter had fewer complex characters to recognise. Therefore, the model produces accurate results when transcribing the same or similar hands to those responsible for these four manuscripts, see Figure 2. QUALITY CONTROL The model has a higher CER when applied to irregular forms of Pracalit script, including more ornate or rougher hands (Figure 3) However, with a trained base model, new hands require Figure 2 Screenshot of the model’s learning curve on Transkribus. Figure 1 Screenshot of a completed transcription of a folio of Hitopadeśa (MIK I 4851) in Transkribus. 4O’Neill and Hill Journal of Open Humanities Data DOI: 10.5334/johd.90 significantly fewer pages, ranging from ten to thirty pages of new ground truth. We will update and refine the model with new ground truth as we encounter variant hands. The main limitation of this model’s initial and continued training is the lack of transcriptions. However, bootstrapping existing editions and transcriptions and feeding corrected machine- generated transcriptions back into the model are workable solutions. In transcription, the model encounters difficulties with damaged or soiled manuscripts, irregular spacing, punctuation, and illustrations interrupting the text. It is worth noting that while the vast majority of Pracalit manuscripts are written in a scriptio continua, occasional spacing and irregular punctuation conventions produce mixed results for the model. While mistakes in ground truth produce incorrect transcriptions, a larger mass of correct ground truth reduces the impact of any one mistake. (3) DATASET DESCRIPTION Object name – OCR model for Pracalit for Sanskrit and Newar MSS 16th to 19th C., Ground Truth Format names and versions – png and xml Creation dates – 2022-04-01 – 2022-08-04 Dataset creators – Alexander James O’Neill, SOAS University of London, Data curation, Formal Analysis, Investigation, Methodology, Validation, Visualization Language – Sanskrit and Newar License – Creative Commons Attribution 4.0 International Repository name – Zenodo Publication date – 2022-08-05 (4) REUSE POTENTIAL While it is possible to share models within Transkribus, this has limited potential for the shared creation of ground truth. As modelled by the GitHub collection “HTR united,” which combines the ground truth of French documents (Chaqué & Clérice, 2021), it is possible to make ground truth data sets available in ways that others can use within platforms such as Transkribus and elsewhere. We have therefore made our dataset publicly available on Zenodo in the form of PNG and XML files that can be used on HTR platforms (O’Neill, 2022). For the future, in collaboration with the Centre of Asian and Transcultural Studies (CATS) Bibliothek at the University of Heidelberg, we are participating in the development of a South Asian Studies- specific ground truth database in a FID4SA (Fachinformationsdienst für Südasien: Specialised Information Service for South Asia) dataverse, called “Ground truth data for HTR on South Asian Scripts,” as part of the University of Heidelberg’s research data archive heiDATA (Universität Heidelberg, 2022). As the most labour-intensive part of philological practice, the ability to quickly produce machine- readable transcriptions of various witnesses of an Indic text is of great value to Indology and other disciplines. This enables high-speed searches and comparisons of corpora, as well as linguistic analysis through machine-learning methods (Meelen et al., 2021). In disciplines such as Newar studies, where there is both a paucity of trained scholars and a profusion of manuscripts, this tool can contribute to easing the burden of compiling and editing a digital corpus, which will benefit linguistic, literary, and historical analysis of the Newar language by easing the burden of work with primary manuscript sources. Figure 3 An example of a cruder form of Pracalit, from Vetālapañcaviṃśati (HS. Or. 6414), transcribed on Transkribus. 5O’Neill and Hill Journal of Open Humanities Data DOI: 10.5334/johd.90 ACKNOWLEDGEMENTS We would like to extend our thanks to Felix Otter (Philipps-Universität Marburg) for providing us with transcriptions. FUNDING INFORMATION This work was funded by the Arts and Humanities Research Council (AHRC), UKRI, as part of the project “The Emergence of Egophoricity: a diachronic investigation into the marking of the conscious self.” Project Reference: AH/V011235/1. Principal Investigator: Nathan Hill, SOAS University of London. COMPETING INTERESTS The authors have no competing interests to declare. AUTHOR CONTRIBUTIONS Alexander James O’Neill: Data curation, Formal Analysis Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing. Nathan Hill: Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing – review & editing. AUTHOR AFFILIATIONS Alexander James O’Neill orcid.org/0000-0001-9982-2589 Department of East Asian Languages and Cultures, SOAS University of London, London, UK Nathan Hill orcid.org/0000-0001-6423-017X Department of East Asian Languages and Cultures, SOAS University of London, London, UK; Trinity Centre for Asian Studies, Trinity College Dublin, Dublin, Ireland REFERENCES Chandra, L. (Ed.) (1999). Guṇakāraṇḍavyūhasūtram. International Academy of Indian Culture. Chaqué, A., & Clérice, T. (2021). HTR-United. GitHub. https://github.com/HTR-United/htr-united (last accessed: 9 November 2022). Georg-August-Universität Göttingen. (2020). GRETIL: Göttingen Register of Electronic Texts in Indian Languages and related Indological materials from Central and Southeast Asia. GRETIL. Retrieved from http://gretil.sub.uni-goettingen.de/gretil.html (last accessed: 22 August 2022). Hodel, T., Schoch, D., Schneider, C., & Purcell, J. (2021). General Models for Handwritten Text Recognition: Feasibility and State-of-the-Art. German Kurrent as an Example. Journal of Open Humanities Data, 7(13), 1–10. DOI: https://doi.org/10.5334/johd.46 Malla, K. P. (1981). Classical Newari Literature. Nepal Study Centre. Meelen, M., Roux, E., & Hill, N. (2021). “Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods.” ACM Transactions on Asian and Low- Resource Language Information Processing, 20(1), 1–11. DOI: https://doi.org/10.1145/3409488 Merkel-Hilf, N. (2022). Ground Truth data for printed Devanagari [Dataset]. In FID4SA@heiDATA. DOI: https://doi.org/10.11588/data/EGOKEI Michael, J., Weidemann, M., & Labahn, R. (2018). HTR Engine Based on NNs P3: Optimizing speed and performance - HTR+ [Deliverable 7.9 for READ project funded by EU Horizon 2020 Project 674943]. READ-COOP. Retrieved from https://readcoop.eu/wp-content/uploads/2018/12/Del_D7_9.pdf (last accessed: 8 November 2022). Mocholí Calvo, C., Vidal Ruiz, E., & Puigcerver i Pérez, J. (2018). Development and experimentation of a deep learning system for convolutional and recurrent neural networks [Degree final work]. Universitat Politècnica de València. Retrieved from https://riunet.upv.es/bitstream/ handle/10251/107062/MOCHOL%C3%8D%20-%20Desarrollo%20y%20experimentaci%C3%B3n%20 de%20un%20sistema%20de%20aprendizaje%20profundo%20para%20redes%20neuronale.... pdf?sequence=1&isAllowed=y (last accessed: 8 November 2022). Muehlberger, G., Seawrd, L., Terras, M., Ares Oliveira, S., Bosch, V., Bryan, M., Culluto, S., Déjean, H., Diem, M., Fiel, S., Gatos, B., Greinoecker, A., Grüning, T., Hackl, G., Haukkovaara, V., Heyer, G., Hirvonen, L., Hodel, T., Jokinen, M., … Zagoris, K. (2019). Transforming scholarship in the archives https://orcid.org/0000-0001-9982-2589 https://orcid.org/0000-0001-9982-2589 https://orcid.org/0000-0001-6423-017X https://orcid.org/0000-0001-6423-017X https://github.com/HTR-United/htr-united http://gretil.sub.uni-goettingen.de/gretil.html https://doi.org/10.5334/johd.46 https://doi.org/10.1145/3409488 https://doi.org/10.11588/data/EGOKEI https://readcoop.eu/wp-content/uploads/2018/12/Del_D7_9.pdf https://riunet.upv.es/bitstream/handle/10251/107062/MOCHOL%C3%8D%20-%20Desarrollo%20y%20experimentaci%C3%B3n%20de%20un%20sistema%20de%20aprendizaje%20profundo%20para%20redes%20neuronale....pdf?sequence=1&isAllowed=y https://riunet.upv.es/bitstream/handle/10251/107062/MOCHOL%C3%8D%20-%20Desarrollo%20y%20experimentaci%C3%B3n%20de%20un%20sistema%20de%20aprendizaje%20profundo%20para%20redes%20neuronale....pdf?sequence=1&isAllowed=y https://riunet.upv.es/bitstream/handle/10251/107062/MOCHOL%C3%8D%20-%20Desarrollo%20y%20experimentaci%C3%B3n%20de%20un%20sistema%20de%20aprendizaje%20profundo%20para%20redes%20neuronale....pdf?sequence=1&isAllowed=y https://riunet.upv.es/bitstream/handle/10251/107062/MOCHOL%C3%8D%20-%20Desarrollo%20y%20experimentaci%C3%B3n%20de%20un%20sistema%20de%20aprendizaje%20profundo%20para%20redes%20neuronale....pdf?sequence=1&isAllowed=y 6O’Neill and Hill Journal of Open Humanities Data DOI: 10.5334/johd.90 TO CITE THIS ARTICLE: O’Neill, A. J., & Hill, N. (2022). Text Recognition for Nepalese Manuscripts in Pracalit Script. Journal of Open Humanities Data, 8: 26, pp. 1–6. DOI: https://doi.org/10.5334/ johd.90 Published: 30 November 2022 COPYRIGHT: © 2022 The Author(s). This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See http://creativecommons.org/ licenses/by/4.0/. Journal of Open Humanities Data is a peer-reviewed open access journal published by Ubiquity Press. through handwritten text recognition: Transkribus as a case study. Journal of Documentation, 75(5), 954–976. DOI: https://doi.org/10.1108/JD-07-2018-0114 O’Neill, A. (2022). OCR model for Pracalit for Sanskrit and Newar MSS 16th to 19th C., Ground Truth [Dataset]. In Zenodo. DOI: https://doi.org/10.5281/zenodo.6967421 Otter, F. (n.d.a). Vetālapañcaviṃśati [Unpublished transcription]. Otter, F. (n.d.b). Madhyamasvayaṃbhūpurāṇa [Unpublished transcription]. Pandey, A. (2012). Proposal to Encode the Newar Script in ISO/IEC 10646 [Proposal from the Script Encoding Initiative]. eScholarship. https://escholarship.org/uc/item/50c8w93x Shakya, M. B., & Bajracharya, S. H. (Eds.) (2001). Svayambhū Purāṇa. Nagarjuna Institute of Exact Methods. Unicode, Inc. (2021). Newa Range: 11400–1147F [Excepted Character Code tables for The Unicode Standard, Version 14.0]. Unicode. Retrieved from https://www.unicode.org/charts/PDF/U11400.pdf (last accessed: 8 November 2022). Universität Heidelberg. (2022). Ground truth data for HTR on South Asian Scripts. FID4SA@heidata. Retrieved from https://heidata.uni-heidelberg.de/dataverse/FID4SA-GT (last accessed: 9 November 2022). https://doi.org/10.5334/johd.90 https://doi.org/10.5334/johd.90 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1108/JD-07-2018-0114 https://doi.org/10.5281/zenodo.6967421 https://escholarship.org/uc/item/50c8w93x https://www.unicode.org/charts/PDF/U11400.pdf https://heidata.uni-heidelberg.de/dataverse/FID4SA-GT