Corpus of the Epigraphy of the Italian Peninsula in the 1st Millennium BCE (CEIPoM)


DATA PAPER

KEYWORDS:

CORRESPONDING AUTHOR:
Reuben J. Pitts

Faculty of Arts, KU Leuven, 
Leuven, BE

reuben.pitts@kuleuven.be

corpus linguistics; language 
contact; linguistic area; Italic; 
epigraphy

TO CITE THIS ARTICLE: 
Pitts, R. J. (2022). Corpus of 
the Epigraphy of the Italian 
Peninsula in the 1st Millennium 
BCE (CEIPoM). Journal of Open 
Humanities Data, 8: 1, pp. 1–4. 
DOI: https://doi.org/10.5334/
johd.65

Corpus of the Epigraphy of 
the Italian Peninsula in the 
1st Millennium BCE (CEIPoM)

REUBEN J. PITTS 

ABSTRACT
The Corpus of the Epigraphy of the Italian Peninsula in the 1st Millennium BCE (CEIPoM) 
is a linguistic database which covers the Oscan, Umbrian, Old Sabellic, Messapic and 
Venetic languages, as well as epigraphic Latin up to 100 BCE. The database is hosted 
on GitHub and Zenodo, and provides manually annotated linguistic information on 
all levels of language structure, ranging from phonology to syntax. In providing a 
high-resolution digital dataset for language varieties that have until now been largely 
restricted to printed reference works, this corpus opens up new avenues for research 
into this unique ancient linguistic area.

mailto:reuben.pitts@kuleuven.be
https://doi.org/10.5334/johd.65
https://doi.org/10.5334/johd.65
https://orcid.org/0000-0002-3960-1490


2Pitts 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.65

(1) OVERVIEW 
REPOSITORY LOCATION

https://reubenjpitts.github.io/Corpus-of-the-Epigraphy-of-the-Italian-Peninsula-in-the-1st-Millennium-BCE/

Current version (1.2): https://zenodo.org/record/5602978#.YXkw8Z5Bw2w

DOI: https://doi.org/10.5281/zenodo.4759134

CONTEXT

This database was created in the context of a PhD project on language contact in Ancient 
Italy, entitled The interplay between language contact and language change in a fragmentary 
linguistic area: the Italic peninsula in the first millennium BCE.

https://www.kuleuven.be/onderzoek/portaal/#/projecten/3H190594?lang=en&hl=en

(2) METHODOLOGY
STEPS

Most of the data was entered manually by the author, based on standard reference works for the 
languages in question. In some cases, basic forms of automation were used to create an initial 
dataset which was then corrected. For instance, an initial morphological analysis for Venetic was 
created by linking the attested tokens to a digitised version of Lejeune’s (1974: 315–341) Venetic 
word list, and the result was then systematically checked and corrected by the author. The 
method used for any given field is described in the accompanying documentation on GitHub.

A few fields were generated automatically using Python modules. These include, for instance, 
the field “Token_clean”, which uses the unidecode package to generate a version of the token 
stripped of special characters, intended for ease of searching. Once again, the documentation 
on GitHub describes in detail which fields are automatic and how they are generated.

SAMPLING STRATEGY

The aim of the database is to include all texts in Oscan, Umbrian, Old Sabellic, Messapic and 
Venetic, as well as epigraphic Latin texts before 100 BCE. The corpus does not include Etruscan, 
due to the additional complexities of incorporating a non-Indo-European language into the 
structure of the database. Within the languages encompassed by the database, however, the 
primary aim is exhaustivity, and the corpus currently contains over 36,000 tokens.

QUALITY CONTROL

Data was entered manually and checked multiple times by the author.

(3) DATASET DESCRIPTION 
OBJECT NAME

Corpus of the Epigraphy of the Italian Peninsula in the 1st Millennium BCE (CEIPoM)

FORMAT NAMES AND VERSIONS

CSV

CREATION DATES

2017–2021

DATASET CREATORS

Reuben J. Pitts

LANGUAGE

Metadata are provided in English.

https://doi.org/10.5334/johd.65
https://reubenjpitts.github.io/Corpus-of-the-Epigraphy-of-the-Italian-Peninsula-in-the-1st-Millennium-BCE/ 
https://zenodo.org/record/5602978#.YXkw8Z5Bw2w 
https://doi.org/10.5281/zenodo.4759134
https://doi.org/ 10.5281/zenodo.4759134 
https://www.kuleuven.be/onderzoek/portaal/#/projecten/3H190594?lang=en&hl=en 


3Pitts 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.65

LICENSE

Creative Commons Attribution-ShareAlike 4.0 International License

REPOSITORY NAME

A continually updated version of the corpus is hosted on GitHub. Each old version of the corpus is 
permanently stored at Zenodo. In traditional publications CEIPoM should be cited as this paper, 
where relevant also specifying the version of the corpus used to achieve any given research result.

PUBLICATION DATE

2021–05–13

(4) REUSE POTENTIAL
This database has a wide range of applications in linguistic research on the languages of ancient 
Italy. Currently, such research is hampered by the absence of searchable digital information, as 
the description of these languages is mostly spread over disparate written reference works (e.g. 
Bakkum, 2009; Lejeune, 1974; Santoro, 1982; Untermann, 2000; Wachter, 1987). This database 
aims to address that research need head-on.

The salience of digital and corpus-based approaches to ancient languages has increased in 
recent years (e.g. Adamik, 2016; Eckhoff et al., 2018; Mambrini et al., 2020; Qiu et al., 2018), and 
these methods have proven their effectiveness even in relatively poorly attested languages. It 
goes without saying that a digital dataset is more easily and more efficiently queried than a 
written corpus, facilitating research results that would otherwise be difficult or impossible to 
achieve. Moreover, the use of a digital dataset means any research results thus obtained can be 
replicated by other researchers, conferring a key advantage in terms of academic transparency. 
These advantages hold true in fragmentary languages such as Venetic or Messapic as much as 
in large corpus languages such as Classical Latin or Greek. 

Since annotation is provided on multiple levels of description, this corpus can serve as a tool for 
linguistic research of various kinds, including research on the syntax, word order, morphology, 
lexicon, semantics, phonology and orthography of the ancient languages in question. To give 
an example of a simple linguistic query in CEIPoM, if one is researching the usage of syntactic 
objects in these languages, one can simply use spreadsheet software to search for instances 
of OBJ in the field Relation, and thus obtain a list of all tokens in the corpus with a syntactic 
analysis containing this value. The GitHub documentation offers considerable detail on how 
each of these features are annotated, and how the different levels of linguistic description can 
be related to one another to formulate more complex queries.

In addition to the strictly linguistic annotation, chronological and geographical information 
(including longitude and latitude) is integrated into the data throughout, allowing the evolution 
and distribution of these linguistic features to be tracked through time and space. Although the 
focus of the corpus does not lie on epigraphical metadata, the texts in the corpus are linked to 
their ID in the Trismegistos database (Depauw & Gheldof, 2014), which means they can easily 
be linked to further metadata and bibliography, as well as to other epigraphic databases (such 
as EDR or EDCS). In addition to its linguistic uses, therefore, the database also holds promise for 
related fields such as history, epigraphy and onomastics.

The corpus focuses strongly on ensuring that the information provided for the languages 
of ancient Italy is intercomparable. This makes it particularly well adapted for the study of 
convergence, language contact and other cross-linguistic typological trends in ancient 
Italy. This region has sometimes been described as a linguistic area (Zair, 2016: 311–312), a 
geographic region where prolonged language contact is responsible for grammatical similarities 
across distantly related languages (Friedman & Joseph, 2017: 55). Since the data is (with a few 
clearly signalled exceptions) annotated in the same way for all six languages currently in the 
corpus, this makes it possible to track the evolving differences and similarities between these 
languages, and to test hypotheses on contact-based change in this region.

The main current limitation of the database lies in the fact that, inevitably, its data is not 
fully complete. In particular, the emphasis until now has been on providing a single plausible 


4Pitts 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.65

TO CITE THIS ARTICLE: 
Pitts, R. J. (2022). Corpus of 
the Epigraphy of the Italian 
Peninsula in the 1st Millennium 
BCE (CEIPoM). Journal of Open 
Humanities Data, 8: 1, pp. 1–4. 
DOI: https://doi.org/10.5334/
johd.65

Published: 03 January 2022

COPYRIGHT: 
© 2022 The Author(s). This is an 
open-access article distributed 
under the terms of the Creative 
Commons Attribution 4.0 
International License (CC-BY 
4.0), which permits unrestricted 
use, distribution, and 
reproduction in any medium, 
provided the original author 
and source are credited. See 
http://creativecommons.org/
licenses/by/4.0/.

Journal of Open Humanities 
Data is a peer-reviewed open 
access journal published by 
Ubiquity Press.

linguistic analysis for each token, even when the scholarly literature offers multiple possible 
interpretations. Since this is frequently the case in disputed fragmentary texts, this may cause 
queries to miss potentially relevant and interesting forms. However, since the state of the data 
in each field is described in detail in the documentation on GitHub, researchers can take these 
limitations into account and adjust their use of this research tool in line with their research 
aims. Future updates to the corpus will continue to improve and fine-tune the quality of the 
data offered, as well as expanding the coverage of alternative analyses for individual tokens.

ACKNOWLEDGEMENTS
I would like to thank Toon Van Hal, Freek Van de Velde, Mark Depauw and Tom Gheldof for their 
help and advice in making this corpus. 

FUNDING INFORMATION 
This research was carried out with a grant from the Fonds Wetenschappelijk Onderzoek (FWO) – 
Vlaanderen (Research Foundation – Flanders) (grant no. 1150720N).

COMPETING INTERESTS
The author has no competing interests to declare.

AUTHOR AFFILIATION 
Reuben J. Pitts  orcid.org/0000-0002-3960-1490 
Faculty of Arts, KU Leuven, Leuven, BE

REFERENCES
Adamik, B. (2016). Computerized Historical Linguistic Database of the Latin Inscriptions of the Imperial 

Age: Search and Charting Modules. In Á. Szabó (Ed.), From Polites to Magos: Studia György Németh 

Sexagenario Dedicata (pp. 13–27). Budapest: Debrecen.

Bakkum, G. C. L. M. (2009). The Latin Dialect of the Ager Faliscus: 150 Years of Scholarship. Amsterdam: 
Amsterdam University Press.

Depauw, M., & Gheldof, T. (2014). Trismegistos: An Interdisciplinary Platform for Ancient World Texts and 
Related Information. In P. Goodale & N. Houssos (Eds.), Theory and Practice of Digital Libraries—TPDL 

2013 Selected Workshops (pp. 40–52). Cham: Springer. 

Eckhoff, H., Bech, K., Bouma, G., Eide, K., Haug, D., Haugen, O. E., & Jøhndal, M. (2018). The PROIEL 
Treebank Family: A Standard for Early Attestations of Indo-European Languages. Language Resources 

and Evaluation, 52(1), 29–65. DOI: https://doi.org/10.1007/s10579-017-9388-5
Friedman, V. A., & Joseph, B. D. (2017). Reassessing Sprachbunds: A View from the Balkans. In R. Hickey 

(Ed.), The Cambridge Handbook of Areal Linguistics (pp. 55–87). Cambridge, UK: Cambridge University 

Press. DOI: https://doi.org/10.1017/9781107279872.005
Lejeune, M. (1974). Manuel de la langue vénète. Heidelberg: Carl Winter Universitätsverlag.
Mambrini, F., Cecchini, F. M., Franzini, G., Litta, E., Passarotti, M. C., & Ruffolo, P. (2020). LiLa: Linking 

Latin: Risorse linguistiche per il latino nel Semantic Web. Umanistica Digitale, 8, 63–78. DOI: https://
doi.org/10.6092/issn.2532-8816/9975 

Qiu, F., Stifter, D., Bauer, B., Lash, E., & Ji, T. (2018). Chronologicon Hibernicum: A Probabilistic Chronological 
Framework for Dating Early Irish Language Developments and Literature. In M. Ioannides, E. Fink, R. 

Brumana, P. Patias, A. Doulamis, J. Martins, & M. Wallace (Eds.), Digital Heritage. Progress in Cultural 

Heritage: Documentation, Preservation, and Protection (pp. 731–740). Cham: Springer International 

Publishing. DOI: https://doi.org/10.1007/978-3-030-01762-0_65
Santoro, C. (1982). Nuovi studi messapici. Galatina: Congedo editore.
Untermann, J. (2000). Wörterbuch des Oskisch-Umbrischen. Heidelberg: Winter.
Wachter, R. (1987). Altlateinische Inschriften: Sprachliche und epigraphische Untersuchungen zu den 

Dokumenten bis etwa 150 v. Chr. Lausanne: Lang.

Zair, N. (2016). Vowel Weakening in the Sabellic Languages as Language Contact. Indogermanische 
Forschungen, 121(1), 295–315. DOI: https://doi.org/10.1515/if-2016-0016

https://doi.org/10.5334/johd.65
https://doi.org/10.5334/johd.65
https://doi.org/10.5334/johd.65
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
https://orcid.org/0000-0002-3960-1490
https://doi.org/10.5117/9789056295622 
https://doi.org/10.1007/s10579-017-9388-5 
https://doi.org/10.1017/9781107279872.005 
https://doi.org/10.6092/issn.2532-8816/9975  
https://doi.org/10.6092/issn.2532-8816/9975  
https://doi.org/10.1007/978-3-030-01762-0_65 
https://doi.org/10.1515/if-2016-0016