Crosslinguistic Semantic Textual Similarity of Buddhist Chinese and Classical Tibetan RESEARCH PAPER CORRESPONDING AUTHOR: Marieke Meelen University of Cambridge, GB mm986@cam.ac.uk KEYWORDS: Cross-linguistic STS; Information Retrieval; Buddhist Chinese; Classical Tibetan; Translation Studies TO CITE THIS ARTICLE: Felbur, R., Meelen, M., & Vierthaler, P. (2022). Crosslinguistic Semantic Textual Similarity of Buddhist Chinese and Classical Tibetan. Journal of Open Humanities Data, 8(1): 23, pp. 1–14. DOI: https://doi.org/10.5334/ johd.86 Crosslinguistic Semantic Textual Similarity of Buddhist Chinese and Classical Tibetan RAFAL FELBUR MARIEKE MEELEN PAUL VIERTHALER *Author affiliations can be found in the back matter of this article ABSTRACT In this paper we present the first-ever procedure for identifying highly similar sequences of text in Chinese and Tibetan translations of Buddhist sūtra literature. We initially propose this procedure as an aid to scholars engaged in the philological study of Buddhist documents. We create a cross-lingual embedding space by taking the cosine similarity of average sequence vectors in order to produce unsupervised similar cross-linguistic parallel alignments at word, sentence, and even paragraph level. Initial results show that our method lays a solid foundation for the future development of a fully-fledged Information Retrieval tool for these (and potentially other) low-resource historical languages. mailto:mm986@cam.ac.uk https://doi.org/10.5334/johd.86 https://doi.org/10.5334/johd.86 https://orcid.org/0000-0002-0555-9992 https://orcid.org/0000-0003-0395-8372 https://orcid.org/0000-0002-2135-9499 2Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 1 INTRODUCTION Buddhist sūtra texts, which are fundamental sources for understanding the beliefs that once dominated, and largely continue to dominate, Asian societies, present formidable challenges to the modern researcher. Like oral literature, the sūtras are authorless and textually fluid and their content is complex and can be rather formulaic (Silk, 2020). As a result, it is often impossible to determine the ‘original’ form of a given work. The situation is complicated further by the huge volume of these documents and the linguistic diversity of their extant versions: for most, only fragments survive in the languages of their original composition (i.e. Sanskrit or other Indic languages) and all we have are their translations, mainly into Chinese and Tibetan. In this paper we present a novel method1 designed to help researchers tackle these challenges more effectively than has been possible to date. This is a method for automatic detection of cross-linguistic semantic textual similarity (STS) across historical Chinese and Tibetan Buddhist textual materials. It aims to enable philologists to take any passage in a Chinese Buddhist translation text, and to quickly locate Tibetan-language parallels to it anywhere in the Tibetan Buddhist canon. The novelty of our contribution is its cross-linguistic capability for historical, low-resource and under-researched languages. Although in both of the languages in question, Buddhist Chinese and Classical Tibetan, searching for parallel passages (i.e. monolingual alignment) is possible (Klein, Dershowitz, Wolf, Almogi, & Wangchuk, 2014; Nehrdich, 2020, as well as, in a crude but effective way, through the user interfaces of CB Reader, in both its web-based and desktop versions, or the SAT Daizōkyō Text Database), cross-linguistic semantic textual similarity and Information Retrieval (i.e. cross-linguistic ‘alignments’) in Buddhist texts have long remained an unsolved task. For a limited number of edited texts in Sanskrit and Tibetan an attempt at automatic crosslinguistic alignment has recently been made by Nehrdich (2020)2 using the YASA sentence aligner.3 However, this method depends on the availability of texts in which words and sentences have been manually pre-segmented, which is not the case for the vast majority of texts we are targeting. Furthermore, being designed for Sanskrit and Tibetan, this method is not currently applicable to our highly specific Buddhist Chinese. In short, no advanced cross-linguistic information retrieval techniques have yet been developed for any historical languages. Both the Tibetan and Buddhist Chinese texts under investigation pose particular challenges because e.g. of their different scripts, the lack of word segmentation and sentence boundaries, as well as due to the highly specific Buddhist terms and (often deliberately) obscure double meanings etc. In this paper we build on the extant work on these languages by Vierthaler and Gelein (2019) and Vierthaler (2020) (for alignment and segmentation of Buddhist Chinese) and Meelen and Hill (2017), Faggionato and Meelen (2019) and Meelen, Roux, and Hill (2021) (for segmentation and POS tagging of Old and Classical Tibetan) to develop the first-ever Buddhist Chinese-Tibetan cross-linguistic STS pipeline, creating unsupervised cross-linguistic alignments for words, sentences, and whole paragraphs of these Buddhist texts, and potentially of contemporaneous non-Buddhist materials as well. Our proposed procedure for these highly specific Buddhist Chinese or Tibetan texts will be an important asset for anyone working with under-researched and low-resource historical languages. 2 METHOD In recent years, large digitisation projects have provided online access to huge Buddhist Chinese and Buddhist Tibetan corpora: digitized versions of over 70,000 traditional woodblock print pages in the Tibetan case, as well as, on the Chinese side, of some 80,000 typeset print pages of the modern Taishō canon, in addition to growing quantities of other canonical and extra- canonical materials. In this section we show how we developed our procedure step-by-step. Figure 1 shows the full pipeline of our proposed procedure, starting with tokenisation of the individual Chinese and Tibetan corpora and ending with the full output ranked after clustering and optimisation of cosine similarity scores of target outputs. 1 All code available on https://github.com/vierth/buddhist_chinese_classical_tibetan (last accessed: 8 August 2022). 2 https://github.com/sebastian-nehrdich/sanskrit-tibetan-etexts (last accessed: 8 August 2022). 3 http://rali.iro.umontreal.ca/rali/?q=en/yasa (last accessed: 8 August 2022). https://github.com/vierth/buddhist_chinese_classical_tibetan https://github.com/sebastian-nehrdich/sanskrit-tibetan-etexts http://rali.iro.umontreal.ca/rali/?q=en/yasa 3Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 2.1 TOKENISATION While tokenisation and sentence segmentation are not usually significant hurdles when working with documents written in Western languages, in which words are delineated by white space, these are not trivial tasks for either premodern Chinese, including Buddhist Chinese, or Classical Tibetan. Neither language uses clear morphological markers or white space to indicate words, and in many cases it is not easy to even divide a text into sentences or utterances. Accordingly, before we can develop a model, we must first preprocess our corpora to include token and sentence boundaries. Tokenisation is especially challenging on the Chinese side. For the Chinese, we use Chinese Buddhist translation texts from the Kanseki repository (Wittern, 2016).4 These texts are mostly provided with punctuation, which makes sentence level segmentation relatively simple. Complications arise, however, when it comes to segmentation on the word level of these materials. While much effort is currently being invested in attempts to develop tools that will segment Chinese texts into words (some of them specifically designed to segment Buddhist materials, e.g. Wang, 2020), these tools remain unusable to us, since the underlying models themselves are often not openly released, and the training data used to create them is often not available. For this reason, we had to devise our own strategy for tokenising the Chinese Buddhist translation texts. In doing so, we used three different approaches and compared their efficiency: Word-based tokenisation, Character-based tokenisation, and a Hybrid approach. For the first approach, we began by creating word-based embeddings on the basis of two glossaries of Buddhist terms (Inagaki, 1978; Yokoyama & Hirosawa, 1996). This allowed us to scan each sentence in our texts for Buddhist terms listed in these glossaries, prioritising longer sequences of characters. Once the Buddhist vocabulary was identified, the remaining sequences not found in the glossaries were parsed into words using a Classical Chinese tokeniser5 (see Qi, Zhang, Zhang, Bolton, & Manning, 2020). Because this word-based tokeniser introduced significant noise into our downstream tasks, we tested two other tokenisation approaches: a character-based approach that treats individual characters as tokens, and a hybrid approach that uses the word-based tokenisation described above, but which parses sequences not found in the glossaries simply as individual characters (i.e. without using the Classical Chinese tokeniser). We also enhanced the dictionaries, using more advanced glossaries by Karashima Seishi (Karashima, S., 1998, 2001, 2010) for our first test, which we will refer to as ‘Hybrid 1’, and an even further extended dictionary including the Da zhidu lun (Li, 2011) glossary which we will refer to as ‘Hybrid 2’. On the Tibetan side, tokenisation was converted to a syllable-tagging and recombination task with the ACTib scripts6 developed by Meelen et al. (2021). As for sentence segmentation, we could use the technique developed by Meelen and Roux (2020) and optimised by Faggionato, Hill, and Meelen (2022) to create sentence boundaries in Tibetan, which is good, but not 100% accurate. Existing automatic aligners rely on sentence boundaries, so accuracy is of crucial importance. Another issue that arises in this context is the difference between the Chinese and Tibetan texts we focus on specifically, as there are often multiple Tibetan sentences corresponding to one sentence in Buddhist Chinese. For these reasons, our procedure is solely based on semantic textual similarity, thereby bypassing the need for sentence boundaries altogether. 4 Kanseki Repository http://web.archive.org/web/20210418080358/http://blog.kanripo.org/ (last accessed: 8 August 2022). The texts themselves are hosted on GitHub: https://github.com/kanripo (last accessed: 8 August 2022) and derive from work done by the CBETA project. 5 As distributed through the Stanza python library. https://stanfordnlp.github.io/stanza/available_models.html (last accessed: 8 August 2022). 6 https://github.com/lothelanor/actib (last accessed: 8 August 2022). Figure 1 Pipeline for overall procedure of cross-lingual Buddhist Chinese & Classical Tibetan alignment. http://web.archive.org/web/20210418080358/http://blog.kanripo.org/ https://github.com/kanripo https://stanfordnlp.github.io/stanza/available_models.html https://github.com/lothelanor/actib 4Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 2.2 DEVELOPING EMBEDDINGS There are many ways to acquire useful vector representations of words, known as word embeddings, which in turn can be used to aid downstream tasks like text classification, stylometric analysis, sentiment analysis, and, crucially for us, information retrieval, and its specific application in automatic textual alignments. These ways range from the straightforward count vector models that simply track word frequency across a corpus, to more advanced algorithms like Google’s Word2Vec and Facebook’s FastText, which use neural networks to develop models that can predict words based on a set of context words (continuous bag of words, or CBOW), or that can predict context words when given an input term (skip-gram). State-of-the-art word representations can be attained using transformer-based algorithms like BERT (Devlin, Chang, Lee, & Toutanova, 2019) and ERNIE (Zhang et al., 2019), which learn word representations by predicting masked words. In our procedure, in order to balance sophistication against complexity, we have elected to use FastText to create the embeddings that will drive our approach.7 In addition to selecting the most adequate embedding method, it is essential to choose the most appropriate textual corpus as a basis for the embeddings. Since our goal was to create an embedding model that will be useful for the specific goal of aligning Chinese and Tibetan Buddhist translation texts, we chose a corpus that contains just the type of language that is specifically used in these texts. This is essential because the idiom and style of Buddhist texts is usually markedly different from that used in the broader language as a whole. Accordingly, for Chinese, we used Buddhist texts contained within the Kanseki repository, encompassing the Taishō edition of the Chinese Buddhist canon and a variety of supplementary materials, for a total of 4,137 documents containing 174m characters (20,775 unique). For Tibetan, we used the sūtra translations in the Kangyur (the electronic Derge version of the eKangyur collection), as well as electronic versions of commentarial and other texts in the entire eTengyur to create a corpus that is large enough to create word embeddings. The eKangyur consists of around 27 m tokens and the eTengyur consists of around 58m tokens (see Meelen & Roux, 2020); these together represent 31k unique tokens. Because we are attempting to develop a system that is not dependent on a priori knowledge of which Chinese text ‘should’ align with which Tibetan text, we trained two separate embeddings, one on the Chinese Buddhist texts, and one on the Tibetan. That is, we took each corpus independently and fed the corpora into the FastText algorithm with the same settings, creating two independent spaces of 100 dimensions each. We then projected the resulting embeddings into the same space, creating a combined embedding space, discussed in Section 2.3. 2.3 COMBINING EMBEDDINGS For creating the combined embedding space, we adopted the approach of Glavaš, Franco- Salvador, Ponzetto, and Rosso (2018),8 which is in turn an implementation of the linear translation matrix approach suggested by Mikolov, Le, and Sutskever (2013). In effect, our method takes an embedding space for each language and then relies on a bilingual glossary to create a linear projection. This projection casts the two spaces into a shared space, one which preserves internal linguistic similarity while trying to bring the glossary terms as close together as possible.9 Using the two embedding spaces created in the previous step, we can then apply the aforementioned Yokoyama-Hirosawa and Inagaki glossaries, which provide Chinese and Tibetan translation pairs. We then identify every pair for which we have an embedding in both Chinese and Tibetan and use all these pairs together to create a projection into a shared embedding space. 7 While it might be ideal to use a transformer model, there are no available models trained on Buddhist Chinese or Classical Tibetan specifically and existing models for modern Chinese or even Tibetan are not suitable for the task since the [classical] langauges differ too much compared to the corresponding contemporary varieties. We therefore leave transformers for future research and use FastText rather than Word2Vec as it learns sub-word level representations of terms, which in the end creates a slightly more flexible model. 8 Following the method they describe in Glavaš et al. (2018), we adapted their translation matrix code (https://bitbucket.org/gg42554/cl-sts/src/master/code/ [last accessed: 8 August 2022]) for this project. 9 It is possible that orthogonal constraints on the translation matrix and other normalisations could improve the resulting embedding space, as is suggested by Xing, Wang, Liu, and Lin (2015). However, this would require extensive refactoring of code and is planned for the future. https://bitbucket.org/gg42554/cl-sts/src/master/code/ 5Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 In cases where the translation glossary includes a multi-character Chinese term not found in the embedding space, but where all constituent characters are present, an embedding is derived by averaging the vectors for all the characters within the word. We can glean some insight into the quality of the new shared embedding space by looking at the cosine similarity between known translation pairs from the glossaries, as shown in Table 1. The results listed in Table 1 show that the different Chinese tokenisation approaches used lead to different rates of similarity in the shared embedding space. For word-based embeddings and to a lesser extent ‘Hybrid 2’, these results also indicate that, in general, the larger the tokenisation dictionary, the higher the similarity. Although word-based tokenisation performs slightly better at this initial step, it does not work as well as the hybrid approaches for our downstream tasks, as shown in Section 3 below. As a further sanity check, we visualised some embeddings to see whether similar words indeed exist in close proximity to each other. The resulting visualisation is presented in Figure 2, which demonstrates this for some sample vectors for animals, directions, numbers, and seasons.10 All these categories are nicely clustered together as expected. The only outlier is Tibetan nya sha, which was labelled as an animal, but it actually means ‘fish (as) meat’, i.e. fish that will be eaten. It is therefore not entirely surprising that it would be farther away from the rest of the animal words, which are not used as food. Figure 3 is a zoomed-in view of the “animal” cluster from Figure 2, with English translations for the vectors. This zoomed- in view shows that Tibetan and Chinese equivalents are placed relatively close together, as expected. 10 The embeddings exist in 100-dimensional space, and we have used tSNE to reduce the dimensionality in order to visualize the relationships. Please note that this preserves local similarity but obscures global differences. Table 1 Summary of cosine similarity scores of Tibetan- Chinese glossary pairs within the new embedding spaces according to Chinese tokenisation method. Shows the highest scoring pair, lowest scoring pair, and some descriptive statistics. Higher scores with lower standard deviation indicate a more accurate embedding space. CHINESE EMBEDDING TYPE MOST SIMILAR LEAST SIMILAR MEDIAN MEAN STD Character 0.9 –0.2 0.66 0.64 0.12 Hybrid1 0.9 0.19 0.66 0.65 0.11 Hybrid2 0.91 0.22 0.66 0.64 0.11 Word 0.92 0.3 0.67 0.67 0.11 Figure 2 A sample of embeddings selected from the cross-lingual Tibetan-Chinese space. This includes a selection of animal, numerical, seasonal, and directional words. བ ན ི གབཞིབ དབ ོང གཉིས་ཀ བ ད རད ིདད ན ི ཐི་བལོག་ རང་ང་མ ང་པོ ངང་པགགཟིགབོང་ ཉ་ཤ དོམ ན་ལག བཤརང ོ 0 10 20 30 −40 −30 −20 −10 0 Dim1 D im 2 A sample of term embeddings in cross-lingual space Animals Numbers Seasons Directions 6Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 There is room for improvement in the quality of the shared embedding space, but the real test is the space’s utility for the task at hand, which is identifying textual sequences with similar semantic meaning across languages. 2.4 IDENTIFYING SIMILAR SEQUENCES With the combined and checked word embeddings in hand, we are ready to apply our procedure to what has been the main goal all along, i.e. searching for sequences of text in both Tibetan and Chinese that carry similar meanings. In this pilot study we use as our source texts three Chinese sūtras from the Mahāratnakūṭa (MRK) collection, which have been manually divided into sections.11 We then tokenise each section into either characters, words, or Buddhist terms (as in our two Hybrid embedding approaches). Then we fetch the vector for each token in the section and average the vectors together to create a vector representation of the entire section. We then define Tibetan texts parallel to the Chinese sūtras as the ‘target.’ We divided this target text into sections as well: we did this by using a sliding window of text from a Tibetan candidate document, the length of which window is based on the length of the Chinese section, adjusted by some length factor. We then calculate the cosine similarity between the Chinese section in question and all Tibetan sections. Finally, we have the system rank the suggested results based on highest cosine similarity of the combined embeddings, and report the results. The highest- scoring sections are likely to have similar meaning. 2.5 PARAMETER SETTINGS, CLUSTERING & OPTIMISATION When we looked closely at the generated results, we found that we could improve their quality by optimising the test parameter settings, specifically the length of the Tibetan search window. One reason why such optimisation proved advantageous may be the fact that the Tibetan text is always more elaborate than the Chinese, meaning that for every Chinese passage of n tokens, the parallel Tibetan will include roughly 50% more tokens. In order to accommodate this difference, we extended the Tibetan search window by a fixed rate (proportional rates proved inefficient, hence we rejected them), in order to ensure the results would cover the entire Chinese input. Significantly shorter Chinese input phrases required a different rate still, since they tend to be proportionally even longer in Tibetan than are longer Chinese phrases. In Section 3.3 we discuss the parameter options to optimise results for different input lengths. 11 Please see section 3.1 of the Alignment Scoring Manual Handy and Meelen (2022): https://zenodo.org/ record/6782150#.Yu5UIcbA5pQ (last accessed: 8 August 2022). Figure 3 A zoomed in detail of some of the animal words from the cross-lingual embedding space shown in Figure 1, including English translations. dog ིdog sheep pigeon ཐི་བpigeon snake ལsnake ant ོག་ རant mosquito ང་ bee/fly cow/buffalo beast/animal bee ང་མbee scorpion elephant ང་པོelephant dog goose ངང་པgoose tiger གtiger leopard གཟིགleopard insect/animals donkey བོང་ donkey fish bear དོམbear horse horse bird magpie ན་ལགmagpie 16 18 20 22 24 26 28 −42 −41 −40 −39 −38 −37 −36 −35 −34 −33 Dim1 D im 2 Detail of animal embeddings in cross-lingual space https://zenodo.org/record/6782150#.Yu5UIcbA5pQ https://zenodo.org/record/6782150#.Yu5UIcbA5pQ 7Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 2.6 SAMPLE OUTPUT Figure 4 shows an excerpt of a sample output file with the Chinese input (shown in line 1), the Tibetan target (shown in line 2), further information on location, ranking, similarity scores, etc. as well as the clustered outputs and information on how well they fit with the target. Alignments are identified by their unique alignment codes, e.g. ‘T2.A1’ refers to ‘Alignment number 1 in Text 2’. A complete overview of all manual alignments used for evaluation (see Section 2.7) can be found in the Supplementary Files. 2.7 EVALUATION METHOD Our alignment outputs automatically receive similarity scores, which allows them to be automatically ranked. This in turn is useful to philologists, as it allows for displaying any number of ‘top’ alignments, depending on the task at hand (e.g. top 5, 10 or 15). In order to evaluate our automatic Chinese-Tibetan alignment outputs, we compared them to a manually-created gold standard. This gold standard refers to a set of data produced by expert philologists,12 who manually aligned three of our source and target texts and provided alignment scores based on machine translation evaluation techniques. Producing these manual alignments was a non-trivial task, for two reasons. First, while nominally speaking the Chinese and Tibetan texts in question are translations of the same Indic Buddhist scripture, in no case can we assume that the two were in fact translated from the same original source in Sanskrit or another Indic source language; indeed the two texts in each pair often differ from each other strikingly, in some cases entirely. Second, the very process of manually scoring the proposed alignments, with the aim to identify ‘near-perfect’ pairs, is also to a considerable degree subjective, so much so that even experienced philologists with excellent knowledge of both languages can differ in judgement. In order to mitigate both of the problems listed above, we created a detailed annotation and scoring guide, with diagnostics and precise decision-making criteria, as well as examples.13 In addition, we had a random number of alignments double-checked by multiple annotators, in order to check for consistency.14 All in all, the philologists identified 80 near-perfect alignment pairs for three Chinese input texts and their corresponding Tibetan targets (42 for Text 1; 21 for Text 2; 17 for Text 3). These 80 alignments constituted our gold standard, which we used in testing the effectiveness and accuracy of our procedure. This manually-developed gold standard is available for only three Chinese texts and their Tibetan counterparts at present, which is why we focus on these three pairs of texts only in the evaluation of this pilot study. The three texts in question are: 1. Xulai jing 須賴經 (T329), from the late 3rd-early 4th century, and the Des pas zhus pa (D71), from ca. the late 8th century—translations, into Chinese and Tibetan respectively, of the *Sūrata-paripṭcchā (henceforth ‘Text 1’) 12 These philologists were from the ERC-funded OpenPhilology Project (https://openphilology.eu/team [last accessed: 8 August 2022]). 13 This guide is available on Zenodo: https://zenodo.org/record/6782150#.Yr3FiMbRZpQ (last accessed: 8 August 2022), cf. Handy and Meelen (2022). 14 A comprehensive inter-annotator agreement study could further improve the results, but was beyond the scope of the present pilot study. Figure 4 Sample output for Alignment T2.A1. https://openphilology.eu/team https://zenodo.org/record/6782150#.Yr3FiMbRZpQ 8Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 2. Genghe shang youpoyi hui 恒河上優婆夷會 (T310 [31]), from the early 8th century, and the Gang ga’i mchog gis zhus pa (D75), roughly a century later–translations of the *Gaṇgottarā-paripṭcchā (henceforth ‘Text 2’) 3. Shande tianzi hui 善徳天子會 (T310 [35]), from the early 8th century, and the Sangs rgyas kyi yul bsam gyis mi khyab pa bstan pa (D79), roughly a century later, translations of the *Acintyabuddhaviṭaya-nirdeṇa (henceforth ‘Text 3’) All three texts survive in their entirety only in the Chinese and Tibetan translations, with no known complete Sanskrit or other Indic language versions, they also differ in many ways. One of these ways is especially consequential for our results: Text 1 is mainly narrative, and consists of stories that illustrate moral points, while the latter two are more abstract-philosophical, and contain a narrower set of more technical metaphysical concepts. We weigh the implications of this difference in Section 3.1. For this pilot study, we use the Chinese sentence as input and let the system find Tibetan equivalents that are semantically as similar as possible, ideally capturing the exact target that the philologists identified in the gold standards. 3 RESULTS In this section we present the results of using the different methods of creating Buddhist Chinese embeddings described above in Section 2.2. As these embeddings were not yet optimised, a comparison of the effectiveness of the different methods when applied to each of our three texts can give us further insight into which method is best suited for the task at hand. Tibetan word embeddings were already optimised (see Meelen, 2022), including the addition of specialist (Buddhist) terms. In the remainder of this section, we first present the aggregate results per text, and then zoom in on select ‘interesting’ results in order to discuss how they may have been affected by the different embedding methods used, as well as by the unique characteristics of the inputs qua vocabulary, style, and grammar. 3.1 RESULTS PER TEXT Table 2 shows what percentage of outputs for each text was ranked first or in the top 5/10/15; a separate listing is given for each of the four Chinese embedding methods. Ideally, the system would automatically rank the exact Tibetan target ‘first,’ so that philologists can instantly find the Tibetan equivalents of the Chinese inputs they are looking for. However, since this is not likely to happen always, or even frequently, a dedicated user interface for philologists should display the top 5/10/15 (depending on preference), which the user would then go through by hand. For this reason, we list not only the percentage of target alignments that were automatically ranked first, but also those where the target was found in the top 5/10/15, as well as the average ranking of the target result and the number of cases in which the target alignment in Tibetan was not found in the top 15 (i.e. ranked ‘zero’).15 Table 2 shows that the results for Text 2 are always better than those for Text 1 and Text 3: the average rank is higher (ranging from 1.24 with Character embeddings to 2.48 for Word embeddings); there are no zero results with any of the embedding methods used; and it has the highest percentage of perfectly matched target results in the top ranks (with almost all targets found in the top 5 with any embedding method). In practice, this means that philologists inputting Chinese passages from Text 2 are very likely to be presented with exact Tibetan targets (i.e. semantically similar passages or target alignments as identified manually by philologists) when searching the entire text. The results for Texts 1 and 3 are not as outstanding, but are still very good, with average ranking between 3.3–4.6 (as against 1.2–2.4 for Text 2). Still, for both Text 1 and 3, we came across some problematic cases in which the system found no Tibetan equivalent in the top 15 of the ranked results, as well as ones in which the Character-embedding method yielded zero results. These problematic cases are particularly interesting to us: by looking at what went wrong we may understand how to improve our system. One example of such a problematic case with a ‘zero result’ is Alignment 20 in Text 1 (T1.A20), as shown in example 1. The highest-ranked match for this input based on Character-embeddings is shown in 1c. 15 Note that ‘zero’ could mean, for example, that the target was ranked 16th, which is not such a bad result. However, if a targeted interface for philologists only displays the top 15 results than anything ranked lower could not be considered. 9Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 (1) (a) (b) (c) The system ranked 1c first, suggesting that it matches the input very closely, while even a quick look reveals that this is not at all the case. The colour coding in the examples shows, however, that the highest-ranked output contains multiple matches for a number of individual key terms present in the input, such as ‘your highness/majesty’, ‘wealth/precious jewel,’ etc. Although meaning-wise, the high-ranked output suggested by the system differs from the input, these key terms do occur multiple times in both. This latter fact may have contributed decisively to the relatively high cosine similarity score of 0.90184 (standard deviation of similarity score: 0.02, avg similarity score 0.85 for this alignment). We may add here that this problem seems to persist also with other embedding methods: for instance, for Hybrid-2-embeddings, the highest-ranked result is an altogether different passage of the text than the one discussed immediately above, but this one too contains the very same crucial key terms ‘Your Majesty’ and ‘precious jewel’ multiple times. In addition to such individual cases, we also need to account for the differences in the quality of results between our three texts. One of the reasons for these differences may be the fact that Texts 1 and 3 are much longer than Text 2 (Text 1 has 4,463 Tibetan tokens; Text 2 has 2,484, and Text 3 has 10,930), while at the same time the individual Chinese inputs for Test 2 are much shorter, which is a reflection of both the internal features of the text, and of the personal preferences of the philologist who aligned it. Generally, the longer the text, the more difficult it is to rank the target match first (especially when the input passages are short), simply because there are many more competing matches than there are in a shorter text. We will discuss this further in Section 3.3 below. Another reason might be the subjectivity of the manual alignments, which depend to some extent on the discretion of the philologist, as mentioned before. In addition, currently our only measure for evaluating the accuracy of the results in this pilot study is the ranking of the target Tibetan. This ranking is, however, not always entirely reliable, and it can be easily influenced (or distorted?) by a number of factors, e.g. text length, how repetitive/diverse the content is, etc. One important aspect our current evaluation metric disregards is how closely non-target results with high cosine similarity scores reflect the semantic content of the input. Although we currently do not have the required data to evaluate this automatically, we will shed some light on this in Section 3.5. TEXT – CHI. EMBEDDING TYPE % RANK1 %RANK5 %RANK10 %RANK15 AV. RANK #ZERO Text 1 – Character 30.95 69.05 78.57 92.86 4.33 2 Text 1 – Hybrid 1 35.71 69.05 88.1 92.86 3.56 0 Text 1 – Hybrid 2 40.48 73.81 90.48 95.24 3.4 0 Text 1 – Word 38.1 61.9 76.19 85.71 3.92 2 Text 2 – Character 76.19 100 100 100 1.24 0 Text 2 – Hybrid 1 52.38 100 100 100 2 0 Text 2 – Hybrid 2 61.9 100 100 100 1.57 0 Text 2 – Word 42.86 95.24 100 100 2.48 0 Text 3 – Character 35.29 47.06 52.94 70.59 4.58 1 Text 3 – Hybrid 1 35.29 64.71 82.35 88.23 3.53 0 Text 3 – Hybrid 2 35.29 58.82 82.35 82.35 3.36 0 Text 3 – Word 11.76 52.94 70.59 70.59 3.92 2 Table 2 Results for all texts with four embedding methods for the Chinese input. 10Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 3.2 THE EFFECT OF DIFFERENT CHINESE EMBEDDING METHODS One variable parameter in our results consists of the different methods of creating Chinese embeddings, as described in Section 2.2 above. ‘Hybrid 2’ embeddings are essentially ‘Hybrid 1’ embeddings extended with additional Buddhist terms from the Da zhidu lun glossary. Therefore, whenever ‘Hybrid 2’ embeddings yielded better results for certain alignments than did Hybrid-1-embeddings, we expect this is because these alignments contain terminology that is only found in the Da zhidu lun glossary. One clear example of this is Alignment 21 in Text 1 (ranked first with Hybrid 2, but sixth with Hybrid 1). This alignment contains 如來 ‘Tathāgata’ which, among the glossaries we used, is only found in the Da zhidu lun glossary, and not in the Karashima lists upon which the ‘Hybrid 1’ embeddings were based. This example is shown in 2, along with its Tibetan target: (2) (a) (b) Figure 5 shows the results (up to top-10 ranks) from Table 1 in a chart organised by type of Chinese embedding. Though this pattern of superiority of Hybrid-2 over Hybrid-1 embeddings is expected and indeed quite common in our results, we also found one counterexample to it, namely the short Alignment 11 in Text 2 (shown in 3). In this case, Hybrid-1 performed best (target ranked 5th), while Hybrid-2-embeddings had the target ranked 11th. This is unexpected, because the input contains 攀縁 ‘in accordance with conditions,’ which is found in the Karashima lists, but not in the Da zhidu lun glossary. This means that this particular term was included in both Hybrid-1 and Hybrid-2 embeddings and there must be another, as of yet unidentified, reason why the Hybrid-1 embeddings yield a better result here. (3) (a) (b) Another category of results consists of those in which Character embeddings performed best. In these cases we expect to be dealing with inputs that contain few multi-character proper nouns and specialist Buddhist terms, which is indeed the usual pattern. Nonetheless, we found a number of exceptions, e.g. Alignment 12 of Text 2. The input here does contain some Figure 5 Top-ranked results for each Chinese embedding method by text. 11Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 technical multi-character terms (世尊 ‘Bhagavān’, 能知 ‘knowable’ and 能得 ‘graspable’). This might lead one to expect that Hybrid embeddings would perform best. This, however, is not the case: Character embeddings proved superior. The reason for this is not entirely clear, although it may have something to do with the fact that all the terms listed above also make sense if they are split up into single characters (‘world-honour,’ ‘able-know,’ ‘able-grasp’ respectively). A similar explanation can be offered for Alignment 33 in Text 1 (ranked 14th with Char vs 24/35th with Hybrid-1/2 embeddings), so this phenomenon does not appear to be text-specific. Other cases of better performance of Char-embeddings include: • Text 1: Alignments 27 (ranked 2nd with Char vs 4/7th with Hybrid-1/Word) and 32 (ranked 2nd with Char vs 7th in Hybrid-1 and Word); • Text 3: Alignments 12 and 15 (both ranked 1st with Char vs 3rd/4th with Hybrid-1/Word), and also 7 (ranked 8th with Char vs 19/36th with Hybrid-1/Word), 13 (ranked 1st with Char vs 3rd/6th with Hybrid-1/Hybrid-2) and 14 (ranked 1st with Char vs 6th/3rd with Hybrid/Word). Some of these cases are especially difficult to interpret. For instance, Alignments 27 and 32 of Text 1 contain multi-character proper names, like 波斯匿 ‘Prasenajit.’ These are expected to pose difficulties for Char-embeddings, for, while they can be read as individual characters, this would result in jibberish: 波-斯-匿 is ‘wave-this-conceal.’ Similarly, Alignments 12 and 15 of Text 3 contain the long phonetic transcription of a Sanskrit name, 文殊師利 ‘Mañjuśrī’, which, if read as individual characters, would make little sense (‘literature-distinct-teacher-benefit’), and which therefore can only be ‘misleading’ for alignment purposes. As for Alignments 7, 13 and 14 of Text 3, the fact that Char-embeddings performed best may be related to the fact that the inputs are extremely short, consisting only of max 7 characters (see Section 3.3). These types of unexpected examples form a minority, however, and while further analysis of such cases is a desideratum, it can only be performed at a later stage, using a larger dataset. Overall, we can conclude that in the three texts we have investigated for this pilot study, the enhanced Hybrid-2 embeddings generally perform better for alignments that contain specialist Buddhist terminology, and that in the absence of such terminology, Char embeddings perform equally well or better, which is exactly what we expected. 3.3 THE EFFECT OF INPUT LENGTH Some texts exhibit a relatively high degree of repetition of short, generic clauses. This presents a challenge for the alignment procedure as it is unclear which passage is the target identified by philologists if multiple passages with very similar meanings are present in the text. This problem pertains especially to Texts 2 and 3, where aligned segments are relatively short. Especially in Text 3, we have short recurring inputs like ‘X said’, e.g. Alignment 7 with input 諸 比丘言 ‘all the monks said’ (ranked 8th) or Alignment 11 with input 汝等應知 ‘you all should know’ (partial match ranked 12th, because the Tibetan target contains an additional vocative ‘friends!’ ). While short inputs pose challenges to our procedure, very long inputs usually lead to good results. One example of this is Alignment 10 in Text 3, which contains a very long tantric incantation. As input length clearly affects our results we included the option of adjusting several minor parameters in order to improve the results of variable input lengths as follows: • The proportion by which to adjust long phrases (as they are generally longer in Tibetan than in Chinese); • The proportion by which to adjust short phrases (as short Chinese phrases are often significantly longer in Tibetan); • The length threshold for what constitutes a “short phrase”; • How far apart results can be clustered together in the final analysis (results within n words of each other get reported as a single result). Of all these minor parameters, we observed that the greatest impact on the results could be generated by adjusting the parameters for long and short phrases. This is most clearly seen in examples from Text 2. Text 2 has the longest input alignments in general (with a median length of 21 characters; Text 1 has a median of 12.5, and Text 3 a median of 10), and Alignments 4, 6 and 15 of this text demonstrate the importance of adjustments according to phrase length. 12Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 With the new settings of a 50% increased adjustment length for short phrases from Chinese to Tibetan, instead of the much longer, 130%/140%/160% options we tested before, the rankings of results improved significantly (ranking improvement of 14 → 3 for Alignment 4; 11 → 2 for Alignment 6 and 6 → 2 for Alignment 15). For some alignments, however, reducing the phrasal length settings resulted not in higher rankings, but in lower ones, although these differences were much smaller than the gains observed for the other alignments (ranking 1 → 3 for Alignment 1; 1 → 2 for Alignments 10 and 17). Our current corpora are too small to justify any generalisations here. However, based on the results of our pilot study we can conclude that it is certainly worthwhile to allow for the adjustment of additional parameters, and that the most optimal settings are a function of input length and content (i.e. how common the key terms of the input are and how often they reoccur in the text). 3.4 THE EFFECT OF MANUAL ANNOTATION One limitation of the current pilot study lies in the manual annotation: the alignment scores of each of our texts were added by three different philologists. For Text 1, we asked the same annotator to provide scores for his alignments on two different occasions, at least 1 year apart. We observed that some alignments he had at first identified as perfect equivalents (score 5), were scored 4 in the second round of manual annotation. This shows the important issue of subjectivity in manual scoring. This issue can only be effectively addressed through rigorous and repeated large-scale inter-annotator agreement checks. However, at present such checks are almost impossible for logistical reasons: they require time- and labour-intensive participation of multiple philologists who are experts in both classical languages as well as in the highly complex Buddhist content of the texts, and such participation is extremely difficult to secure. In view of this, while in future work we hope to include at least partial inter-annotator agreement scores, in the present pilot study we had to settle for the sub-optimal single-scored method. 3.5 MEASURING THE SUCCESS OF ACTUAL SEMANTIC SIMILARITY Alignment 20 from Text 1 illustrated in example 1 above already showed that frequently-occurring key terms could have a negative impact on ranking: whenever key terms occur repeatedly, the chances of multiple outputs with high cosine similarity scores increase, and the chances of a high ranking for just one specific output (corresponding to the target) decrease. In this section we briefly demonstrate that although lower rankings may initially indicate a bad result, this does not necessarily mean that our system is performing badly: high-ranked outputs may not be the exact target (as identified by expert philologists in our gold standard), but they could still convey the same or a very similar meaning. We can see this in particular for alignments where the average cosine similarity results are low. Consider, for example, Alignment 8 from text 1: (4) (a) (b) (5) The average cosine similarity of this alignment with the Hybrid-2 embeddings is only 0.80 (standard deviation of 0.02). The target is ranked 2nd with a cosine similarity of 0.85807, but the highest-ranked output shown in (5) scored 0.88005. The color coding shows that this output contains two of the key terms present in the Chinese input. Since the Chinese input is relatively short, overlap in two such highly specific terms can yield relatively high similarity and thus lead to a highly-ranked result. 4 CONCLUSION In this paper we presented the first-ever procedure for identifying highly similar sequences of text in Chinese and Tibetan translations of Buddhist sūtra literature. Our pilot study is based on creating a cross-lingual embedding space by taking the cosine similarity of average sequence 13Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 vectors in order to produce unsupervised similar cross-linguistic parallel alignments at word, sentence, and even paragraph level. We evaluate the results of the pilot study comparing three Buddhist texts that are manually aligned by expert philologists. Initial results show that our method lays a solid foundation for the future development of a fully-fledged Information Retrieval tool for these (and potentially other) low-resource, historical languages. We will address questions of scalability and of further philological use cases in future research. SUPPLEMENTARY FILES Supplementary materials are deposited on Zenodo: • Alignment Scoring Manual (Handy & Meelen, 2022): https://doi.org/10.5281/zenodo.6782150 • Buddhist Chinese embeddings (Vierthaler, 2022): https://doi.org/10.5281/zenodo.6782932 • Classical Tibetan embeddings (Meelen, 2022): https://doi.org/10.5281/zenodo.6782247 ACKNOWLEDGEMENTS Thanks to the British Academy and to the European Research Council (ERC) for financial support, as well as to Gregory Forgues & Jonathan A. Silk for manual alignments. FUNDING INFORMATION This work was supported by the European Research Council (ERC) under the Horizon 2020 program (Advanced Grant agreement No 741884). COMPETING INTERESTS The authors have no competing interests to declare. AUTHOR AFFILIATIONS Rafal Felbur orcid.org/0000-0002-0555-9992 Leiden University, NL Marieke Meelen orcid.org/0000-0003-0395-8372 University of Cambridge, GB Paul Vierthaler orcid.org/0000-0002-2135-9499 College of William and Mary, US REFERENCES Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1423 (last accessed: 8 August 2022) DOI: https://doi.org/10.18653/v1/N19-1423 Faggionato, C., Hill, N., & Meelen, M. (2022, June). NLP Pipeline for Annotating (Endangered) Tibetan and Newar Varieties. In Proceedings of The Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference (p. 1–6). Marseille, France: European Language Resources Association. Faggionato, C., & Meelen, M. (2019). Developing the Old Tibetan treebank. In N. T. Angelova Mitkov (Ed.), Proceedings of Recent Advances in Natural Language Processing (p. 304–312). Varna: Incoma. DOI: https://doi.org/10.26615/978-954-452-056-4_035 Glavaš, G., Franco-Salvador, M., Ponzetto, S. P., & Rosso, P. (2018). A resource-light method for cross-lingual semantic textual similarity. Knowledge-based systems, 143, 1–9. DOI: https://doi. org/10.1016/j.knosys.2017.11.041 Handy, C., & Meelen, M. (2022, June). MRK alignment scoring guidelines. Zenodo. Retrieved from https:// doi.org/10.5281/zenodo.6782150 (last accessed: 8 August 2022). Inagaki, H. (1978). Index to the Larger Sukhāvatīvyūha-sūtra. A Tibetan Glossary with Sanskrit and Tibetan Equivalents. Tokyo: Nagata Bunshudo. https://doi.org/10.5281/zenodo.6782150 https://doi.org/10.5281/zenodo.6782932 https://doi.org/10.5281/zenodo.6782247 https://orcid.org/0000-0002-0555-9992 https://orcid.org/0000-0002-0555-9992 https://orcid.org/0000-0003-0395-8372 https://orcid.org/0000-0003-0395-8372 https://orcid.org/0000-0002-2135-9499 https://orcid.org/0000-0002-2135-9499 https://aclanthology.org/N19-1423 https://doi.org/10.18653/v1/N19-1423 https://doi.org/10.26615/978-954-452-056-4_035 https://doi.org/10.1016/j.knosys.2017.11.041 https://doi.org/10.1016/j.knosys.2017.11.041 https://doi.org/10.5281/zenodo.6782150 https://doi.org/10.5281/zenodo.6782150 14Felbur et al. Journal of Open Humanities Data DOI: 10.5334/johd.86 TO CITE THIS ARTICLE: Felbur, R., Meelen, M., & Vierthaler, P. (2022). Crosslinguistic Semantic Textual Similarity of Buddhist Chinese and Classical Tibetan. Journal of Open Humanities Data, 8(1): 23, pp. 1–14. DOI: https://doi.org/10.5334/ johd.86 Published: 04 October 2022 COPYRIGHT: © 2022 The Author(s). This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See http://creativecommons.org/ licenses/by/4.0/. Journal of Open Humanities Data is a peer-reviewed open access journal published by Ubiquity Press. Karashima, S. (1998). A Glossary of Dharmarakṭa’s Translation of the Lotus Sutra: Zheng fahua jing ci dian. Tokyo: The International Research Institute for Advanced Buddhology, Soka University. Karashima, S. (2001). A Glossary of Kumārajīva’s Translation of the Lotus Sutra: Myōhō Rengekyō shiten. Tokyo: The International Research Institute for Advanced Buddhology, Soka University. Karashima, S. (2010). A Glossary of Lokakṭema’s Translation of the Aṭṭasāhasrikā Prajñāpāramitā. Tokyo: The International Research Institute for Advanced Buddhology, Soka University. Klein, B. E., Dershowitz, N., Wolf, L., Almogi, O., & Wangchuk, D. (2014). Finding Inexact Quotations Within a Tibetan Buddhist Corpus. In 9th Annual International Conference of the Alliance of Digital Humanities Organizations, DH 2014, Lausanne, Switzerland, 8–12 July 2014, Conference Abstracts. Li, Q. (2011). Da zhidu lun cidian 大智度論辭典. Electronic resource. Retrieved from https://www.dropbox. com/s/ocsagb529k3e70v/dzdl.bgl?dl=0 (last accessed: 1 June 2021). Meelen, M. (2022). Tibetan language models: from distributional semantics to facilitating Tibetan NLP. Accepted submission to IATS 2022. Meelen, M., & Hill, N. (2017). Segmenting and POS tagging Classical Tibetan using a memory-based tagger. Himalayan Linguistics, 16(2). DOI: https://doi.org/10.5070/H916234501 Meelen, M., & Roux, É. (2020). Meta-dating the parsed corpus of Tibetan (PACTib). In Proceedings of the 19th Workshop on Treebanks and Linguistic Theories (pp. 31–42). DOI: https://doi.org/10.18653/ v1/2020.tlt-1.3 Meelen, M., Roux, É., & Hill, N. (2021). Optimisation of the largest annotated Tibetan corpus combining rule-based, memory-based, and deep-learning methods. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 20(1), 1–11. DOI: https://doi. org/10.1145/3409488 Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168. Retrieved from http://arxiv.org/abs/1309.4168 (last accessed: 8 August 2022). Nehrdich, S. (2020). A method for the calculation of parallel passages for Buddhist Chinese sources based on million-scale nearest neighbor search. Journal of the Japanese Association for Digital Humanities, 5(2), 132–153. DOI: https://doi.org/10.17928/jjadh.5.2_132 Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. arXiv preprint arXiv:2003.07082. DOI: https://doi. org/10.18653/v1/2020.acl-demos.14 Silk, J. A. (2020). Tekisuto sokei no nai kōtei: Bukkyō kyōten to yudayakyō rabi bunken kenkyū ni okeru honbun hihan, soshite ‘Hirakareta bunkengaku’ dejitaru hyūmanitīzu purojekuto” テキスト祖型のな い校訂: 佛敎經典とユダヤ敎ラビ文獻硏究における本文批評、そして「開かれた文獻學」デジタルヒ ューマニティーズプロジェクト[Editing without an Ur-text: Buddhist Sūtras, Rabbinic Text Criticism, and the Open Philology Digital Humanities Project]. Tōyō no Shisō to Shūkyō 東洋の思想と宗敎, 37, 22–58. Vierthaler, P. (2020). A Simple Dictionary-Based Tokenizer for Classical Chinese Text. Retrieved from https://github.com/vierth/dictionary_parser (last accessed: 8 August 2022). Vierthaler, P. (2022, June). Buddhist Chinese Word Embeddings. Zenodo. Retrieved from https://doi. org/10.5281/zenodo.6782932 (last accessed: 8 August 2022). Vierthaler, P., & Gelein, M. (2019, 3 22). A blast-based, language-agnostic text reuse algorithm with a markus implementation and sequence alignment optimized for large Chinese corpora. Journal of Cultural Analytics, 4(2). DOI: https://doi.org/10.22148/16.034 Wang, Y.-C. (2020). Word segmentation for Classical Chinese Buddhist literature. Journal of the Japanese Association for Digital Humanities, 5(2), 154–172. DOI: https://doi.org/10.17928/jjadh.5.2_154 Wittern, C. (2016). The Kanseki repository: A new online resource for Chinese textual studies. Digital Scholarship in History and the Humanities. Xing, C., Wang, D., Liu, C., & Lin, Y. (2015, May–June). Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies (pp. 1006–1011). Denver, Colorado: Association for Computational Linguistics. Retrieved from https:// aclanthology.org/N15-1104 (last accessed: 8 August 2022). Yokoyama, K., & Hirosawa, T. (1996). Index to the Yogācārabhūmi, Chinese-Sanskrit-Tibetan: 漢梵蔵対照瑜 伽師地論総索引. Tokyo: Sankibō Busshorin. https://doi.org/10.5334/johd.86 https://doi.org/10.5334/johd.86 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://www.dropbox.com/s/ocsagb529k3e70v/dzdl.bgl?dl=0 https://www.dropbox.com/s/ocsagb529k3e70v/dzdl.bgl?dl=0 https://doi.org/10.5070/H916234501 https://doi.org/10.18653/v1/2020.tlt-1.3 https://doi.org/10.18653/v1/2020.tlt-1.3 https://doi.org/10.1145/3409488 https://doi.org/10.1145/3409488 http://arxiv.org/abs/1309.4168 https://doi.org/10.17928/jjadh.5.2_132 https://doi.org/10.18653/v1/2020.acl-demos.14 https://doi.org/10.18653/v1/2020.acl-demos.14 https://github.com/vierth/dictionary_parser https://doi.org/10.5281/zenodo.6782932 https://doi.org/10.5281/zenodo.6782932 https://doi.org/10.22148/16.034 https://doi.org/10.17928/jjadh.5.2_154 https://aclanthology.org/N15-1104 https://aclanthology.org/N15-1104