The TRANSCOMP Dataset of Literary Translations from 120 Languages and a Parallel Collection of English-language Originals DATA PAPER CORRESPONDING AUTHOR: Matt Erlin Germanic Languages and Literatures, Washington University, St. Louis, US merlin@wustl.edu KEYWORDS: translation studies; computational literary studies; world literature; natural language processing; text corpus; text collection TO CITE THIS ARTICLE: Erlin, M., Piper, A., Knox, D., Pentecost, S., & Blank, A. (2022). The TRANSCOMP Dataset of Literary Translations from 120 Languages and a Parallel Collection of English- language Originals. Journal of Open Humanities Data, 8: 29, pp. 1–6. DOI: https://doi. org/10.5334/johd.94 The TRANSCOMP Dataset of Literary Translations from 120 Languages and a Parallel Collection of English-language Originals MATT ERLIN ANDREW PIPER DOUGLAS KNOX STEPHEN PENTECOST ALLIE BLANK ABSTRACT The TRANSCOMP Dataset of Literary Translations is a collection of document-level word frequencies sampled from 10,631 translations into English of global literary fiction published since 1950, together with a historically matched parallel corpus of 10,682 fictional works originally published in English. We provide CSV files with word frequency counts for 10,000-word samples taken from each text. The associated metadata is available in a separate CSV. These data will be useful to literary scholars and linguists working in translation studies, and those interested in the linguistic, stylistic, and thematic specificity of translations from particular regions. *Author affiliations can be found in the back matter of this article mailto:merlin@wustl.edu https://doi.org/10.5334/johd.94 https://doi.org/10.5334/johd.94 https://orcid.org/0000-0002-0536-7499 https://orcid.org/0000-0001-9663-5999 https://orcid.org/0000-0002-7168-7271 https://orcid.org/0000-0002-2093-6151 2Erlin et al. Journal of Open Humanities Data DOI: 10.5334/johd.94 (1) OVERVIEW REPOSITORY LOCATION doi.org/10.7910/DVN/ITLGQV CONTEXT This dataset consists of document-level word frequency samples drawn from a parallel corpus containing 10,631 translations of literary fiction into English from 120 different languages published since 1950 along with a comparable set of 10,682 works of fiction written originally in English during the same time period. All texts are contained in the Hathi Trust Digital Library and are derived from the ca. 176,000-volume NovelTM collection created by Underwood et al. (2020). The dataset was compiled as part of an ongoing research project into the unique linguistic, stylistic, and thematic features of translated fiction as compared to fiction written originally in English. Following the precedent established by Toury’s (1980) and Baker’s (1993) pioneering work on translation universals, our aim has been to create two independent corpora that enable researchers to evaluate translated texts as they relate to target language texts in general, rather than to compile a corpus of translations and their corresponding source texts. While corpora designed for comparative translation studies do exist, including a number of parallel corpora, they are often focused on single pairs of languages and/or non-literary texts; moreover, they are not constructed to facilitate the kind of historical comparisons that interest computational literary scholars. To our knowledge, no existing collection of historically matched translated and original-language fictional texts even approaches the size or linguistic diversity of our corpus, and we hope that it will serve as a resource for additional research. (2) METHOD STEPS On the basis of the metadata provided by Underwood et al. (2020) regarding the NovelTM dataset of English-language fiction, we first used a set of regular expressions such as “translated from the Swedish,” “from the [language],” “tr. from,” “rendered into English,” etc. to identify an initial list of translated texts. Next, if an author was included in this initial list, we included all titles by that author. For example, if one volume by Leo Tolstoy had “translated from” in one of its metadata fields, we included all works by Leo Tolstoy in our set of translations. Original English-language works were identified by fuzzy matching against a large set of author names derived from Wikipedia and the Virtual International Authority File (VIAF), which consists of millions author names derived from 68 library catalogues from around the world. Any names identified as English-language authors from this list were then removed from the translation data. We similarly used non-English-language author data to match with our translation data and reviewed all non-matching works by hand. Information on a translation’s original language was taken from two primary sources: explicit references included in the titles of the works in Hathi (e.g., translated from the Swedish) and from the HathiTrust extracted features metadata. These results were supplemented using fuzzy matching of author lists from Wikipedia and VIAF. The remaining missing data was manually retrieved using WorldCat and other internet sources. To identify date of publication, we used Underwood et al.’s “inferred date” (2020). Because the holdings of translations in Hathi are heavily skewed toward a rather small set of European authors and languages in the first part of the twentieth century, we subsetted our data down to the date range 1950–2008, which aligns with the period construct of “post-war” fiction used in literary studies (McGurl, 2009). Finally, we also removed all volumes where Underwood et al.’s predicted probability of being non-fiction was greater than 85% (2020). Given that the set of original language works was larger than the set of translations, we also randomly downsampled each year of our original publications to match the number of translations. We then processed the files to be extracted as word frequency data. Working within the HTRC capsule (Plale et al., 2019), we first downloaded individual page files using the preloaded functions in the HTRC Workset Toolkit to remove running page headers and footers. For each volume, we concatenated individual page files into a single document. After tokenizing with regular expressions, we next represented each document as ten randomly selected 1,000-continuous- https://doi.org/10.7910/DVN/ITLGQV 3Erlin et al. Journal of Open Humanities Data DOI: 10.5334/johd.94 word samples drawn from the middle 60% of the document to avoid paratextual content in the front and back matter. This sampling enables us to control for effects that might arise from the different lengths of the source texts. To mitigate problems related to low OCR quality, foreign- language passages, or the presence of other non-standard characters, only samples that had 90% of words in an English dictionary were kept. If a work did not have ten samples that met this criteria, it was removed. All of this work was completed in the Hathi capsule. These samples were then converted into bags of words, which we are able to make accessible to the scholarly community in the form of two CSV files, one for originals and one for translations, listing raw frequency counts by document for each of the words in each of the original document samples. While the final corpus of translated texts remains skewed towards European languages, it does include a significant number of works originally published in East Asian and South Asian languages and a smaller number of works originally published in Middle Eastern and African languages. Figures 1–4 provide an overview of the dataset. Figure 3 Count of translations from the top 20 languages represented in the corpus. Figure 1 Count of works by decade, originals and translations. Figure 2 Count of translated works by decade, non- European and European. Figure 4 Total translations by subregion (classical literature as separate category). 4Erlin et al. Journal of Open Humanities Data DOI: 10.5334/johd.94 QUALITY CONTROL To test the accuracy of our identification of translations in the NovelTM dataset, we created a random sample of 100 works identified as translations and 100 works identified as originals from our data. We then manually checked each title to see whether our classification had been correct. We found that 99 were correctly labeled for an estimated precision of .99. We did not evaluate the accuracy of recall (i.e., translations in Hathi that we missed). In addition to its impracticality, given the size of the original dataset, the results would simply have told us whether our sample was representative of translations in the Hathi corpus. For the comparative work we envision, the key question is whether we have a randomly sampled set of translations that mirrors our English original corpus, not whether it accurately represents the distribution of texts in Hathi. LIMITATIONS One key limitation is the date range of our data (1950–2000). Expanding this date range, however, leads to an overwhelming predominance of a few European languages, which runs counter to our goal of having a diverse set of source languages represented. As Figure 2 reveals, even after 1950, translations in the Hathi Library skew European. Whether this is true of the English-language market for fiction more generally or is an artifact of Hathi we leave for future work. We note that the period 1950–2000 is considered a distinct period within literary history and thus our data aligns with this historical construct (McGurl, 2009). An additional potential limitation is the presence of works in the dataset that were originally published prior to 1950 but which were translated or re-translated at a later date. On the basis of the (incomplete) information that we have on author birth and death dates, we estimate that such works constitute between 15–20% of the total (see Python notebook in the repository). Finally, our data is limited due to intellectual property restrictions that only allow us to export word frequencies and not the full text from Hathi. We provide all Hathi IDs such that researchers can recreate our data inside of the Hathi capsule system. (3) DATASET DESCRIPTION OBJECT NAME The dataset consists of three CSV files: Translation_samples.csv, Original_samples.csv, and TransComp_metadata.csv. We have also included a Python notebook addressing the question of original publication dates: 1950_boundary_question.ipynb FORMAT NAMES AND VERSIONS CSV, ipynb CREATION DATES 2021–04–28 – 2022–04–18 DATASET CREATORS Allie Blank Douglas Knox Stephen Pentecost LANGUAGE English LICENSE CC0 REPOSITORY NAME Dataverse 5Erlin et al. Journal of Open Humanities Data DOI: 10.5334/johd.94 PUBLICATION DATE 2022–10–07 (4) REUSE POTENTIAL Two primary areas of research will likely benefit from access to this data. Recent scholarship in the sociology of translation (Heilbron, 1999; Bachleitner and Wolf, 2004; Sapiro, 2016; 2020) has helped reveal the structural asymmetries in the global flow of translations, often adopting a core—semi-periphery—periphery model to clarify the dominant role played by a small subset of European languages in this regard. To date, however, there has been virtually no effort to link these asymmetries to differences in the linguistic, stylistical, or thematic features of translations (Piper and Erlin 2022). To what extent, in other words, do translations from “peripheral” languages or language regions exhibit common features that might reinforce or challenge existing cultural biases or reflect the pressures imposed on “peripheral” authors in what Pascale Casanova (2004) has referred to as the “world republic of letters”? We believe that our data set will greatly facilitate the investigation of such questions. While only having bags of word frequencies places some limitations on what is possible in this regard, prior research has generated important cultural insights using such word distribution approaches (Erlin, 2017; Jockers and Mimno, 2013; Piper, 2016; Underwood, 2016). In addition, the CSV files include information on the page count for each work sampled as well as the mean sentence length for the samples, the latter of which we calculated in the Hathi capsule. Finally, we include metadata so that researchers can work on the full texts within the Hathi data capsule system. With regard to translation studies more broadly, we believe that this historically matched collection of translations and originals can shed new light on questions of “translationese” (i.e. translation universals). Corpus and computational linguists have long been identifying ways in which translation can be thought of as a distinct linguistic practice that consists of quasi-universal behaviors conditioned by the nature of moving between languages and the cognitive demands of doing so (Volansky, Orden, and Winter, 2015). Only a few studies, however, have focused on the specific qualities of literary translations, and certainly not at the scale made possible by this dataset. We think the collection is particularly well suited to investigations into the question of whether translations can be understood as a literary genre (Piper and Erlin, 2022). While the concept of genre is famously multivalent in literary studies (Cohen, 2017, 86), we use the term in the most elementary sense as a set of works that exhibit “shared features” (Reichert, 1978, 57) – translations in this case — that can be algorithmically classified on the basis of its relational distinctiveness vis- a-vis non-translated works as well as the ways it coheres as a category over time. ADDITIONAL FILE The additional file for this article can be found as follows: • Supplementary Material. About the TRANSCOMP dataset. DOI: https://doi.org/10.5334/ johd.94.s1 ACKNOWLEDGEMENTS We are grateful to HathiTrust for the permission to release this data. COMPETING INTERESTS The authors have no competing interests to declare. AUTHOR CONTRIBUTIONS • Matt Erlin: conceptualization, methodology, writing, visualization • Andrew Piper: conceptualization, methodology, writing, visualization https://doi.org/10.5334/johd.94.s1 https://doi.org/10.5334/johd.94.s1 6Erlin et al. Journal of Open Humanities Data DOI: 10.5334/johd.94 TO CITE THIS ARTICLE: Erlin, M., Piper, A., Knox, D., Pentecost, S., & Blank, A. (2022). The TRANSCOMP Dataset of Literary Translations from 120 Languages and a Parallel Collection of English- language Originals. Journal of Open Humanities Data, 8: 29, pp. 1–6. DOI: https://doi. org/10.5334/johd.94 Published: 26 December 2022 COPYRIGHT: © 2022 The Author(s). This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See http://creativecommons.org/ licenses/by/4.0/. Journal of Open Humanities Data is a peer-reviewed open access journal published by Ubiquity Press. • Douglas Knox: conceptualization, methodology, writing, data curation, visualization • Stephen Pentecost: conceptualization, methodology, data curation • Allie Blank: conceptualization, methodology, data curation AUTHOR AFFILIATIONS Matt Erlin orcid.org/0000-0002-0536-7499 Germanic Languages and Literatures, Washington University, St. Louis, US Andrew Piper orcid.org/0000-0001-9663-5999 Languages, Literatures, and Cultures, McGill University, Montreal, Canada Douglas Knox orcid.org/0000-0002-7168-7271 Humanities Digital Workshop, Washington University, St. Louis, US Stephen Pentecost orcid.org/0000-0002-2093-6151 Humanities Digital Workshop, Washington University, St. Louis, US Allie Blank Humanities Digital Workshop, Washington University, St. Louis, US REFERENCES Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. In Baker, M., Francis, G., Tognini-Bonelli, E. (Eds.), Text and technology: In honour of John Sinclair (pp. 233–250). Amsterdam/Philadelphia: Benjamins. DOI: https://doi.org/10.1075/z.64.15bak Bachleitner, N., & Wolf, M. (2004). Auf dem Weg zu einer Soziologie der literarischen Übersetzung im deutschsprachigen Raum. Internationales Archiv Für Sozialgeschichte Der Deutschen Literatur, 29(2), 1–25. DOI: https://doi.org/10.1515/IASL.2004.2.1 Casanova, P. (2004). The world republic of letters. Cambridge: Harvard University Press. Cohen, R. (2017). Genre theory and historical change: Theoretical essays of Ralph Cohen. Charlottesville: University of Virginia Press. Erlin, M. (2017). Topic modeling, epistemology, and the English and German novel. Journal of Cultural Analytics, 2(2), 11070. DOI: https://doi.org/10.22148/16.014 Heilbron, J. (1999). Towards a sociology of translation: Book translations as a cultural world-system. European Journal of Social Theory, 2(4), 429–444. DOI: https://doi.org/10.1177/136843199002004002 Jockers, M. L., & Mimno, D. (2013). Significant themes in 19th-century literature. Poetics, 41(6), 750–769. DOI: https://doi.org/10.1016/j.poetic.2013.08.005 McGurl, M. (2009). The program era: Postwar fiction and the rise of creative writing. Cambridge: Harvard University Press. DOI: https://doi.org/10.2307/j.ctvjsf59f Piper, A. (2016). Fictionality. Journal of Cultural Analytics, 2(2). DOI: https://doi.org/10.22148/16.011 Piper, A., & Erlin, M. (2022). The predictability of literary translation. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pp. 155–160. Plale, B., Dickson, E., Kouper, I., Liyanage, S. H., Ma, Y., McDonald, R. H., Walsh, J. A., & Withana, S. (2019). Safe open science for restricted data. Data and Information Management, 3(1). DOI: https:// doi.org/10.2478/dim-2019-0005 Reichert, J. (1978). More than kin and less than kind: The limits of genre theory. In J. P. Strelka (Ed.), Theories of Literary Genre, pp. 57–79. University Park: Pennsylvania State University Press. Sapiro, G. (2016). How do literary works cross borders (or not)?: A sociological approach to world literature. Journal of World Literature, 1(1), 81–96. DOI: https://doi.org/10.1163/24056480-00101009 Sapiro, G. (2020). The transnational literary field between (inter)-nationalism and cosmopolitanism. Journal of World Literature, 5(4), 481–504. DOI: https://doi.org/10.1163/24056480-00504002 Toury, G. (1980). In search of a theory of translation. Tel Aviv: Porter Institute for Poetics and Semiotics, Tel Aviv University. Underwood, T., Kimutis, P., & Witte, J. (2020). NovelTM datasets for English-language fiction, 1700– 2009. Journal of Cultural Analytics, 5(2), 13147. DOI: https://doi.org/10.22148/001c.13147 Underwood, T. (2016). The Life Cycles of Genres. Journal of Cultural Analytics, 2(2). DOI: https://doi. org/10.22148/16.005 Volansky, V., Ordan, N., & Wintner, S. (2015). On the features of translationese. Digital Scholarship in the Humanities, 30(1), 98–118. DOI: https://doi.org/10.1093/llc/fqt031 https://doi.org/10.5334/johd.94 https://doi.org/10.5334/johd.94 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://orcid.org/0000-0002-0536-7499 https://orcid.org/0000-0002-0536-7499 https://orcid.org/0000-0001-9663-5999 https://orcid.org/0000-0001-9663-5999 https://orcid.org/0000-0002-7168-7271 https://orcid.org/0000-0002-7168-7271 https://orcid.org/0000-0002-2093-6151 https://orcid.org/0000-0002-2093-6151 https://doi.org/10.1075/z.64.15bak https://doi.org/10.1515/IASL.2004.2.1 https://doi.org/10.22148/16.014 https://doi.org/10.1177/136843199002004002 https://doi.org/10.1016/j.poetic.2013.08.005 https://doi.org/10.2307/j.ctvjsf59f https://doi.org/10.22148/16.011 https://doi.org/10.2478/dim-2019-0005 https://doi.org/10.2478/dim-2019-0005 https://doi.org/10.1163/24056480-00101009 https://doi.org/10.1163/24056480-00504002 https://doi.org/10.22148/001c.13147 https://doi.org/10.22148/16.005 https://doi.org/10.22148/16.005 https://doi.org/10.1093/llc/fqt031