Bearing a Bag-of-Tales: An Open Corpus of Annotated Folktales for Reproducible Research


RESEARCH PAPER

CORRESPONDING AUTHOR:

Joshua Hagedorn

Independent researcher, 
Grand Rapids, MI, USA

josh.hagedorn@gmail.com

KEYWORDS:
annotated folktales; motifs; 
reproducible research; 
machine learning; version 
control

TO CITE THIS ARTICLE:
Hagedorn, J., & Darányi, S. 
(2022). Bearing a Bag-of-Tales: 
An Open Corpus of Annotated 
Folktales for Reproducible 
Research. Journal of Open 
Humanities Data, 8: 16, 
pp. 1–10. DOI: https://doi.
org/10.5334/johd.78

Bearing a Bag-of-Tales: An 
Open Corpus of Annotated 
Folktales for Reproducible 
Research

JOSHUA HAGEDORN 

SÁNDOR DARÁNYI 

ABSTRACT
Motifs in folktales and myths have been identified and articulated by scholars, and 
the computational identification and discovery of such motifs is an area of ongoing 
research. Achieving this goal means meeting scientific requirements (that methods be 
comparable and replicable) and requirements for collaboration (that multi-disciplinary 
teams can reliably access data). To support those requirements, access to consistent 
reference datasets is needed. Unfortunately, these datasets are not openly available in 
a format that supports their use in data science. Here we report work in progress toward 
this goal, having converted the Ashliman Folktexts collection into a public dataset of 
annotated tale texts. The data can be accessed at doi.org/10.5281/zenodo.6575263.

*Author affiliations can be found in the back matter of this article

mailto:josh.hagedorn@gmail.com
https://doi.org/10.5334/johd.78
https://doi.org/10.5334/johd.78
https://doi.org/10.5281/zenodo.6575263
https://orcid.org/0000-0001-8026-7562
https://orcid.org/0000-0002-1542-934X


2Hagedorn and Darányi 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.78

(1) CONTEXT AND MOTIVATION
Ever since the concept of a motif was introduced some 200 years ago, the quest to identify 
elements of content above the word level has been a standard preoccupation in literary science 
(Frenzel, 1992; Seigneuret, 1988). There, a motif stands for a recurrent theme, whereas in 
musicology a motif is considered “the smallest structural unit possessing thematic identity” 
(White, 1976: 26–27). In the field of folktale research, Stith Thompson defined motifs as “the 
smallest element in a tale having a power to persist in tradition” (Thompson, 1977: 415).

The overlap between these definitions suggests that such higher-order content units exist as 
narrative building blocks, yet their automatic extraction by computational means has eluded 
folk narrative studies so far (Darányi & Lendvai, 2010). As we will argue below, folk narrative 
studies are not yet up to the task of a scalable pattern hunt. One reason for our scepticism is 
that in Thompson’s Motif Index of Folk Literature (Thompson, 1951) alone over 45,000 motifs 
are listed on a global scale, but many more regional motif indexes exist whose material would 
doubtlessly inflate that number. If we want to apply machine learning for motif identification 
or discovery, first we need suitable datasets which enable research teams to replicate each 
other’s results. Below we report work in progress in this direction, but also guard against any 
hubris in our promises regarding motif detection, with its first analytical results to be reported 
elsewhere. Ideally we would like to see an emerging motif annotation system that crowd-
sourced expert folklorists could use, similar to Prodigy.1

The structure of this paper is as follows. In Section 1, we offer our motivation in context, 
bring examples of related research with converging trends, publicly available databases and 
datasets, and introduce the Ashliman Folktexts collection. Section 2 focuses on methodology, 
progressing from our motivation to support reproducible research in computational folkloristics 
toward dataset creation and repository access, including steps of data harvesting and cleaning, 
concluding with current limitations of use. Section 3 discusses features of the result, the 
Annotated Folktales (aft) corpus, with descriptive statistics. In Section 4, we briefly outline 
directions of future collection development to support folktale research.

(1.1) RELATED RESEARCH 

As our pilot was not concerned with the structural analysis of folk narratives, this overview 
omits significant research results, such as those concerning the automatic detection of Proppian 
functions (Finlayson, 2016), or their use in ontology building (Declerck et al., 2017). Instead, our 
focus will be on precursory efforts to support motif detection using two standard tools, the 
Thompson Motif Index (TMI) (Thompson, 1951), and the Aarne-Thompson-Uther tale typology 
(ATU) (Uther, 2004). Important extensions to these, and to our current work, exist by Declerck 
and colleagues (Declerck & Schäfer, 2017; Declerck, Kostova, & Schäfer, 2017). As motifs and 
motifemes abound in myths as well, we admit the latter into our scope under the reasoning 
that “myth is a traditional tale with secondary, partial reference to something of collective 
importance” (Burkert, 1982: 23), considering the debate about the difference between myths 
and folktales as open (e.g. Kirk, 1970: 31–41; Burkert, 1982: 1–5).

(1.1.1) Converging trends

We consider finding characteristic patterns of semantic content by automatic means an open 
research problem. The relevant research question is this: if we were going to extract features 
from the descriptive text of the TMI, what kind of features could we build, and could these 
features also be identified in tale corpora?

The convergence of two major trends in computational folkloristics (Abello, Broadwell, & 
Tangherlini, 2012) will likely shape the results of the next decade. The first is a focus on the 
evolutionary aspect of motif and/or tale type distributions, either with regard to certain tale 
types (Bortolini et al., 2017; Karsdorp, 2016; Karsdorp & van den Bosch, 2013; da Silva & Tehrani, 
2016; Tehrani, 2013), or to the geographical distribution of globally occurring narrative motifs 
(Thuillard, d’Huy, Berezkin, & Le Quellec, 2018), even inferring the presence of lost narratives 
(Kestemont et al, 2022). A genetic metaphor seems to inform some approaches, perhaps inspired 

1 https://spacy.io/universe/project/prodigy (last accessed: 12 May 2022).

https://spacy.io/universe/project/prodigy


3Hagedorn and Darányi 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.78

by the modelling capacities inherent in Dawkins’ meme theory (Dawkins, 1976); these compare 
tale types as motif sequences to ‘narrative DNA’ (Darányi, Wittek, & Forró, 2012; Meder et al., 
2016; Murphy, 2015; Ofek, Darányi, & Rokach, 2013), or look at the evolution of narrative/story 
networks as a quasi-biological process based on the mutation and recombination of narrative 
elements (Karsdorp, 2016; Karsdorp & Fonteyn, 2019), extended even to the framework of 
cultural evolution via population genetics (Ross, Greenhill & Atkinson, 2013; Ross & Atkinson, 
2015). Such methods resemble bioinformatic applications such as network motif identification 
(Qin & Gao, 2012), a problem analogous with ours. The context is that of evolving semantics, 
an emerging research area both in lexical semantic change (Armaselu et al., 2021) and digital 
preservation (Kontopoulos et al., 2016a; Kontopoulos et al., 2016b). 

The second trend is to use probabilistic and/or multivariate statistical methods for the analysis 
of binary versus non-binary matrices of events over cases, where events can be index terms, 
motifs, motif sequences, and so on, and cases as an umbrella term stand for documents in 
general, such as abstracts describing narratives (Berezkin, 2015), or tale types (Uther, 2004), 
ultimately constituting text corpora or databases. On such collections, one can then experiment 
for instance with sub-corpus topic modelling (STM) by Latent Dirichlet Allocation (LDA) as a 
means of supervised passage exploration in partly unknown corpora (Tangherlini & Leonard, 
2013).

The little one can say about the plethora of methods listed is that, regardless of the corpora, 
their regionality, and the analytical units whose distributions characterise the body of texts in 
question, they express similarity between items in terms of distance, with more similar items 
forming dense groups as the outcome of mass comparison. Cluster analysis (Thuillard et al., 
2018), Principal Component Analysis (PCA) (Berezkin, 2015), Labelled Latent Dirichlet Allocation 
(L-LDA) (Karsdorp & van den Bosch, 2013), Support Vector Machines (SVM) (Nguyen et al., 2012; 
Meder et al., 2016), or deep learning by Recurrent Neural Networks (RNN) (Lô, de Boer, & van Aart, 
2020), however, share the same nature of being static snapshots of collections. Hence there 
is a contradiction in principle in addressing text evolution, a dynamic phenomenon, through 
tools tailored to static measurements: the notion asks for vector fields instead of vector spaces 
(Darányi et al., 2016). The most promising recent direction seems to be the combination of 
word embeddings – increasingly condensed and geometrically located types of word meaning 
(Le & Mikolov, 2014; Mikolov et al., 2013; Reimers & Gurevych, 2019) – with deep learning: 
Pompeu (2019) reports successful application of a Hierarchical Attention Network (HAN) for the 
prediction of ATU categories on a multilingual database of folk texts.

As the computing of results for both trends discussed above require datasets, the next section 
briefly addresses their availability.

(1.1.2) Databases and datasets

Progress in computational folkloristics requires that results be replicable. To this end we 
sought open access datasets of ATU-annotated tales in English, but could not identify suitable 
candidates on GitHub,2 Kaggle3 or Google,4 although websites with separate tale collections 
are available.5 Neither could we find the big folklore data anticipated by Tangherlini & Leonard 
(2013) and Tangherlini (2016). Based on Meder (2010) and Ilyefalvi (2018), the largest databases 
seem to be the Dutch Folktale Database of the Meertens Institute, and the Danish Folklore 
Archive’s Tang Kristensen Collection, the former in the magnitude of around 50,000 texts, the 
latter at around 34,000 texts (Tangherlini & Leonard, 2013). Other important databases exist 
(Berezkin, 2017), but are either beyond public access, or in their original languages only, or 
both. The notable exception is the Meertens Institute whose texts are in Dutch and Frisian plus 
a number of local dialects, but can be read in English translation as well. 

Other researchers who have shared their data as supporting material for their articles include 
for instance Bortolini et al. (2017), da Silva & Tehrani, (2016), Tehrani (2013), and Tehrani, 

2 https://github.com/awesomedata/awesome-public-datasets (last accessed: 4 May 2022).

3 https://www.kaggle.com/datasets (last accessed: 4 May 2022). 

4 https://datasetsearch.research.google.com/ (last accessed: 4 May 2022).

5 https://fairytalez.com/authors-and-collections/ (last accessed: 1 June 2022).

https://github.com/awesomedata/awesome-public-datasets
https://www.kaggle.com/datasets
https://datasetsearch.research.google.com/
https://fairytalez.com/authors-and-collections/


4Hagedorn and Darányi 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.78

Nguyen, & Roos (2016). Declerck et al. (2017) also report that a large amount of ATU data 
has recently been made available online by the Multilingual Folk Tale Database (MFTD),6 which 
also offers annotation facilities for tales in multilingual versions. We found only a single recent 
study (Lô, de Boer, & van Aart, 2020) which published a corresponding tale corpus to promote 
reproducibility, albeit without ATU type labels.7

Among the ATU-annotated tale collections publicly available on the internet, the most 
promising candidate was Ashliman’s Folktexts collection. The process of the conversion of this 
collection to the desired format will be described below. 

(1.1.3) The Ashliman Folktexts collection

The Folktexts site8 has been populated and maintained since 1996 by D.L. Ashliman, who kindly 
agreed to donate his collection to the interested research communities.9 While other sites may 
sport a more lavish design, this one is the largest and most extensively annotated. It serves as 
a respected scholarly resource for folklorists, with a large and curated set of tale texts. Whereas 
our dataset contains only tales from pages with clear ATU annotations (214 pages), the total 
content of the website is much larger (370 pages), containing various creation myths, stories 
of changelings, Faust legends, and Christiansen’s tale types (Christiansen, 1992). However, it is 
the ATU annotation that makes this corpus particularly valuable as a potential training dataset 
for classification methods.

Despite the richness of this resource, it has not frequently been used in folklore research as a 
larger corpus. Some previous studies reference it, yet these often only include a smaller portion 
of the entire set of texts (Reiter, Frank, & Hellwig, 2014). To the best of our knowledge, none of 
the published studies provided open access to the data. 

(2) METHOD
(2.1) SUPPORT FOR REPRODUCIBILITY IN FOLKLORE STUDIES

Reproducibility is a defining characteristic of science, yet a wide gamut of scientific fields has 
been plagued by a ‘replicability crisis’: a situation where trusted research findings have been 
impossible to reproduce (Goodman, Fanelli, & Ioannidis, 2016; Pasquier et al., 2017). While the 
problem has come to the fore in the health and social sciences, it has been acknowledged in 
disciplines as broad as archaeology (Marwick, 2017), public health (Harris et al., 2018), biology 
(Kühne & Liehr, 2009), and economics (McCullough, 2009).

Reproducible research entails that study results be accompanied by (Gandrud, 2015):

1. a detailed description of the methods used to obtain and operate on the data;
2. the full dataset(s) used in the study;
3. the full code used to transform the data and compute the results.

(2.1.1) Guiding principles

The following features guided our selection of tools and format for the code and data:

•	 Open data: In order to use tale data consistently, it must be made freely and openly 
available to anyone. The dataset is therefore distributed under a Creative Commons 
Attribution-ShareAlike 4.0 International license (CC BY-SA 4.0).10

•	 Extensible data: The dataset can be added to or modified, in order to develop a more 
complete repository of tales. This can be done by submitting pull requests to the project’s 
GitHub repository.

•	 Open code: Any user is allowed to view and run the code that produces the dataset, 
as well as downstream analyses which use the dataset. This allows for inspection, 

6 https://www.mftd.org (last accessed: 4 May 2022).

7 https://github.com/GossaLo/afr-neural-folktales/ (last accessed: 4 May 2022).

8 http://www.pitt.edu/~dash/folktexts.html (last accessed: 1 June 2022).

9 Personal communication per email (11 February 2021).

10 https://creativecommons.org/licenses/by-sa/4.0/ (last accessed: 4 May 2022).

http://www.mftd.org/index.php?action=home
https://github.com/GossaLo/afr-neural-folktales/
http://www.pitt.edu/~dash/folktexts.html
https://creativecommons.org/licenses/by-sa/4.0/


5Hagedorn and Darányi 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.78

refinement, and reasoning about the effects of transformation and statistical modelling 
on the data.

•	 Common form: We have chosen to use the “tidy” dataframe as the structure of the 
dataset, in which (a) each variable forms a column, (b) each observation forms a row, and 
(c) a single type of observational unit forms the dataframe (Wickham, 2014).

•	 Common tools: The data must also be structured in a way that allows for ease of use with 
the standard tools of the trade of data science, such as R or Python. 

•	 Modifiable form: The structure must allow for reshaping the data into sparse matrices, 
nested structures, and graph-based structures as dictated by the needs of a given text 
analysis, while starting from a common source dataset (that is, the aft).

(2.1.2) Accessing and growing the corpus

Snapshot versions of the aft corpus will be cached on Zenodo with development and 
collaboration ongoing in the trilogy GitHub repository, where a vignette provides information 
on how to access, use, and augment the dataset.11 Whereas long-term sustainability to curate 
the result will require academic resources, as a next step it would be logical to create temporary 
merger options with other multilingual tale collections, such as the MFTD, or the ones analyzed 
by Tehrani (2013) or Karsdorp and Fonteyn (2019), for analytical studies using for instance 
multilingual word embeddings.

The open-source Git functionality allows motifs, tale types and annotated tales to be added 
over time, and for the corpus to serve as a communal resource. We welcome inquiries and 
suggestions about how best to manage this resource as a “commons” (Vollan & Ostrom, 2010).

(2.2) DATA HARVESTING AND CLEANING

(2.2.1) Steps 

Web-scraping of the Folktexts site12 was completed using the R statistical programming 
language. The following high-level summary is provided to allow for an understanding of the 
methods used and their limitations:

1. Obtain URLs and associated label text for all ‘child’ pages of the main website to create a 
dataframe of page names and URLs, removing links to external websites.

2. Retain all URLs with the pattern “type…”, which denote pages containing tales which 
belong to an ATU type, and recode links which do not follow this form, such as the page 
for Animal Brides and Animal Bridegrooms which was recoded as belonging to ATU type 
402.

3. Extract the ATU type ID from the URL for each page, resulting in a dataframe listing 214 
webpages, each associated with a tale type and containing the page name, page URL, 
and associated ATU ID. 

4. Loop through each webpage identified in the dataframe above and extract the text, using 
the following steps: (a) extract HTML nodes from the page, creating a dataframe using 
the text, name and attribute elements of the nodes; (b) remove superfluous text other 
than tale texts, titles, and other associated metadata (e.g. source documents, notes); (c) 
using a fuzzy-joining method to align missing body text with the well-formatted HTML.13 

5. Take the resulting dataframe and apply the following steps: (a) select the longest text, 
choosing between the tagged HTML version and the version extracted from the body; 
(b) select available metadata; (c) remove irrelevant entries using regular expressions; (d) 
create unique tale titles where these were duplicated across multiple variants of tales; (e) 
clean tale text data (e.g. removing remnant HTML tags, replacing internal double quotes 
with single quotes).

6. Add manually extracted tales into a consistent format for web pages which generated 
errors during web scraping (ATU IDs 1696, 2, 545B, 57, 675, 75, 779J*, 676). Other than 
this final step, all steps were fully automatic.

11 https://github.com/j-hagedorn/trilogy/blob/master/docs/vignettes/getting_started.md (last accessed: 4 May 
2022).

12 The site was harvested on 10 March 2021.

13 Using the Jaro-Winkler method, with maximum distance for a match set to 1 (Winkler, 1990; van der Loo, 
2014).

https://github.com/j-hagedorn/trilogy/blob/master/docs/vignettes/getting_started.md


6Hagedorn and Darányi 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.78

(2.2.2) Limitations

Web-scraping is an inherently messy exercise, as the data contained in web pages are often not 
formatted with the intent of being analysed. While the output has been reviewed at a cursory 
level, we anticipate that greater use of the dataset will result in the need for additional cleaning 
and processing.

The provenance field does not meet the definition of ‘tidy’ outlined above, since multiple 
types of descriptors (e.g. country, region, tale collection) are stored in a single column. While 
additional cleaning may be able to distinguish some of these, we have chosen to leave it as 
entered in the original to avoid losing potentially valuable detail.

The final limitation is purposefully adopted for the sake of downstream analyses. We have 
included only tales which were annotated with a single tale type, despite the existence of some 
tales which can be characterized by multiple types. This decision was made in order to avoid 
repeating texts or using data structures which are tool specific.14

(3) RESULTS AND DISCUSSION
(3.1.) FEATURES OF THE ANNOTATED FOLKTALES (aft) DATASET

(3.1.1) Data dictionary

The aft (henceforth standing for Annotated Folktales to allow for the future inclusion of 
other resources) dataframe contains 1518 rows, each corresponding to a single tale. Its eight 
columns are described briefly below:

•	 atu_id: The ATU tale type identifier which classifies the tale.
•	 tale_title: The title of the tale.
•	 provenance: The person, place or tradition from which the tale came. In Ashliman’s 

collection, this refers variously to the person recording the tales (e.g. Giambattista Basile), 
the country or region from which the version of the tale came (e.g. North Africa), or the 
larger collection of tales in which the tale is found (e.g. the Kathasaritsagara).

•	 notes: Additional notes related to the tale.
•	 source: The bibliographic citation for the original published source of the tale.
•	 text: The full text of the tale identified in tale_title.
•	 data_source: The source of the annotated tales. At the time of this writing, the source of 

all tales is Ashliman’s Folktexts, but this is intended to change as the dataset grows.
•	 date_obtained: The date on which the dataset identified as a data_source was last 

downloaded and compiled.

Table 1 below shows the initial characters of fields from the first six rows of the dataset, in order 
to illustrate its appearance:

(3.1.2) Descriptive statistics

The 1518 tales in the dataset average 979.1 tokens in length, though the individual texts vary 
with a minimum of 10 tokens and a maximum of 12,406 (Table 2). 

The histogram below (see Figure 1) shows the distribution of tale lengths for all tales in the 
corpus15:

14 The list structure is specific to R, and different in Python. Texts with multiple IDs would result in a nested list 
so we would stop being data-tool agnostic.

15 Excludes six tales with greater than 6,000 tokens, to increase visibility. 

atu_id tale_title provenance source text

910B The Highlander Takes… Scotland Cuthbert Bede In one of the glens of…

910B The Prince Who Acquired India Cecil Henry Bompas There was once a raja …

910B The Three Admonitions Italy Thomas Frederick Crane A man once left his co…

910B The Three Advices Ireland T. Crofton Croker The stories current am…

910B The Three Advices Which… Ireland Patrick Kennedy The name of the young …

1430 Buttermilk Jack  Thomas Hughes Oh mother, my buttermilk

Table 1 Example output of the 
dataset.


7Hagedorn and Darányi 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.78

The tales compiled in the aft data are annotated by ATU tale type, and represent 182 
distinct types. There are on average 8.3 tales in each tale type, with a range of one to 31. 
The tale types with the largest representative group of tales in the corpus are shown in 
Table 3 below:

(4) IMPLICATIONS/APPLICATIONS
Under a Creative Commons license, we published on Zenodo and GitHub an open-access, ATU-
annotated dataset of 1518 tales for motif detection by machine learning. This dataset resulted 
from the conversion of the Ashliman Folktexts collection, and is hoped to become the core 
of an expanding corpus to support reproducible research in computational folkloristics. As a 
next step we plan to integrate information from the TMI and the ATU, to be applied in trawling 
(Tangherlini & Leonard, 2013) for motifs by deep learning. 

MEASURE VALUE

Number of tales 1518 

Number of tale types 182 

Mean tokens per tale 979.1

Median tokens per tale 642 

Minimum tokens per tale 10 

Maximum tokens per tale 12,406 

Mean sentences per tale 45.7

Median sentences per tale 31 

ATU ID TALE NAME N OF 
TALES

275 The Race between Two Animals (previously The Race of the Fox and the Crayfish) 31

777 The Wandering Jew 30

1645 The Treasure at Home 26

510B Peau d’Asne (previously The Dress of Gold, of Silver, and of Stars [Cap o Rushes]) 26

500 The Name of the Supernatural Helper 23

510A Cinderella 21

700 Thumbling (previously Tom Thumb) 21

155 The Ungrateful Snake Returned to Captivity 20

545B Puss in Boots 20

980 The Ungrateful Son (previously Ungrateful Son Reproved by Naive Actions of Own Son) 20

Table 2 Summary statistics of 
the AFT dataset.

Table 3 Ten tale types 
with the largest number of 
representative tales.

Figure 1 Distribution of tale 
lengths.


8Hagedorn and Darányi 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.78

ACKNOWLEDGEMENTS
The authors are grateful to Professor D.L. Ashliman (University of Pittsburgh) for his permission to 
turn his annotated collection into a publicly available dataset. We also thank three anonymous 
reviewers for helpful comments on the manuscript.

COMPETING INTERESTS
The authors have no competing interests to declare.

AUTHOR CONTRIBUTIONS
Joshua Hagedorn: conceptualisation, methodology, data curation, software, validation, writing 
– original draft, writing – review & editing.

Sándor Darányi: conceptualisation, methodology, resources, supervision, project administration, 
writing – original draft, writing – review & editing.

AUTHOR AFFILIATIONS
Joshua Hagedorn  orcid.org/0000-0001-8026-7562 
Independent researcher, Grand Rapids, MI, US

Sándor Darányi  orcid.org/0000-0002-1542-934X 
Swedish School of Library and Information Science, University of Borås, Borås, SE

REFERENCES
Abello, J., Broadwell, P., & Tangherlini, T. R. (2012). Computational folkloristics. Communications of the 

ACM, 55(7), 60–70. DOI: https://doi.org/10.1145/2209249.2209267

Armaselu, F., Apostol, E.-S., Khan, F., Liebeskind, C., McGillivray, B., Truica, C.-O., Utka, A., Oleškevičienė, 
G. V., & van Erp, M. (2022). LL(O)D and NLP perspectives on semantic change for Humanities 
research. Semantic Web Journal, accepted for publication. http://www.semantic-web-journal.net/

system/files/swj2848.pdf. DOI: https://doi.org/10.3233/SW-222848

Berezkin, Y. (2015). Spread of folklore motifs as a proxy for information exchange: Contact zones and 
borderlines in Eurasia. Trames, 19(1), 3–14. DOI: https://doi.org/10.3176/tr.2015.1.01

Berezkin, Y. (2017). Peopling of the New World from data on distributions of folklore motifs. In R. Kenna, 
M. MacCarron, & P. MacCarron (Eds.), Maths meets myths: Quantitative approaches to ancient 

narratives (pp. 71–89). Cham, Switzerland: Springer International Publishing. DOI: https://doi.

org/10.1007/978-3-319-39445-9_5

Bortolini, E., Pagani, L., Crema, E. R., Sarno, S., Barbieri, C., Boattini, A., Sazzini, M., da Silva, S. G., 
Martini, G., Metspalu, M., Pettener, D., Luiselli, D., & Tehrani, J. J. (2017). Inferring patterns of 
folktale diffusion using genomic data. Proceedings of the National Academy of Sciences, 114(34), 

9140–9145. DOI: https://doi.org/10.1073/pnas.1614395114

Burkert, W. (1982). Structure and history in Greek mythology and ritual. Berkeley, CA: University of 
California Press.

Christiansen, R. (1992). The migratory legends: A proposed list of types with a systematic catalogue of the 
Norwegian variants. Helsinki: Suomalainen Tiedeakatemia, Academia Scientiarum Fennica.

Darányi, S., & Lendvai, P. (Eds.). (2010). Proceedings of the first AMICUS workshop, October 21, 2010, 
Vienna, Austria. Szeged, Hungary: University of Szeged, Department of Library and Human 

Information Science.

Darányi, S., Wittek, P., & Forró, L. (2012). Toward sequencing ‘Narrative DNA’: Tale types, motif strings 
and memetic pathways. Proceedings of CMN-12, 3rd workshop on Computational models of narrative 

in conjunction with the 8th Language resources and evaluation conference, 2–10. http://narrative.

csail.mit.edu/cmn12/proceedings.pdf 

Darányi, S., Wittek, P., Konstantinidis, K., Papadopoulos, S., & Kontopoulos, E. (2016). A physical 
metaphor to study semantic drift. To appear in Proceedings of SuCCESS-16, 1st international workshop 

on Semantic change & evolving semantics. DOI: https://doi.org/10.48550/arXiv.1608.01298 

da Silva, S. G., & Tehrani, J. J. (2016). Comparative phylogenetic analyses uncover the ancient roots 
of Indo-European folktales. Royal Society Open Science, 3(1), 1–11. DOI: https://doi.org/10.1098/

rsos.150645

Dawkins, R. (1976). The selfish gene. Oxford, UK: Oxford University Press.

https://orcid.org/0000-0001-8026-7562
https://orcid.org/0000-0001-8026-7562
https://orcid.org/0000-0002-1542-934X
https://orcid.org/0000-0002-1542-934X
https://doi.org/10.1145/2209249.2209267
http://www.semantic-web-journal.net/system/files/swj2848.pdf
http://www.semantic-web-journal.net/system/files/swj2848.pdf
https://doi.org/10.3233/SW-222848
https://doi.org/10.3176/tr.2015.1.01
https://doi.org/10.1007/978-3-319-39445-9_5
https://doi.org/10.1007/978-3-319-39445-9_5
https://doi.org/10.1073/pnas.1614395114
http://narrative.csail.mit.edu/cmn12/proceedings.pdf
http://narrative.csail.mit.edu/cmn12/proceedings.pdf
https://doi.org/10.48550/arXiv.1608.01298
https://doi.org/10.1098/rsos.150645
https://doi.org/10.1098/rsos.150645


9Hagedorn and Darányi 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.78

Declerck, T., Aman, A., Banzer, M., Macháček, D., Schäfer, L., & Skachkova, N. (2017). Multilingual 
ontologies for the representation and processing of folktales. Proceedings of the First workshop 

on Language technology for Digital Humanities in Central and (South-)Eastern Europe, 20–23. DOI: 

https://doi.org/10.26615/978-954-452-046-5_003

Declerck, T., Kostova, A., & Schäfer, L. (2017). Towards a linked data access to folktales classified by 
Thompson’s motifs and Aarne-Thompson-Uther’s types. Proceedings of Digital Humanities 2017. 

https://www.dfki.de/fileadmin/user_upload/import/9028_Dh2017_LOD_TMI-ATU_final.pdf 

Declerck, T., & Schäfer, L. (2017). Porting past classification schemes for narratives to a linked data 
framework. Proceedings of DATeCH2017. DOI: https://doi.org/10.1145/3078081.3078105

Finlayson, A. M. (2016). Inferring Propp’s functions from semantically annotated text. Journal of American 
Folklore, 55–77. DOI: https://doi.org/10.5406/jamerfolk.129.511.0055

Frenzel, E. (1992). Stoffe der Weltliteratur: Ein Lexikon dichtungsgeschichtlicher Längsschnitte (8. überarb. 
u. erweit. Aufl.). Stuttgart: Kröner.

Gandrud, C. (2015). Reproducible research with R and RStudio (2nd ed.). Chapman and Hall/CRC. DOI: 
https://doi.org/10.1201/9781315382548

Goodman, S. N., Fanelli, D., & Ioannidis, J. P. A. (2016). What does research reproducibility mean? Science 
Translational Medicine, 8(341), 341ps12. DOI: https://doi.org/10.1126/scitranslmed.aaf5027

Harris, J. K., Johnson, K. J., Carothers, B. J., Combs, T. B., Luke, D. A., & Wang, X. (2018). Use of 
reproducible research practices in public health: A survey of public health analysts. PLoS ONE, 13(9). 

DOI: https://doi.org/10.1371/journal.pone.0202447

Ilyefalvi, E. (2018). The theoretical, methodological and technical issues of digital folklore databases 
and computational folkloristics. Acta Ethnographica Hungarica, 63(1), 209–258. DOI: https://doi.

org/10.1556/022.2018.63.1.11

Karsdorp, F. (2016). Retelling stories: A computational-evolutionary perspective [PhD thesis]. Nijmegen: 
Radboud Universiteit. 

Karsdorp, F., & Fonteyn, L. (2019). Cultural entrenchment of folktales is encoded in language. Palgrave 
Communications, 5, 25. DOI: https://doi.org/10.1057/s41599-019-0234-9

Karsdorp, F., & van den Bosch, A. (2013). Identifying motifs in folktales using topic models. In 
Proceedings of the 22 Annual Belgian-Dutch Conference on Machine Learning (pp. 41–49). 

Kestemont, M., Karsdorp, F., De Bruijn, E., Driscoll, M., Kapitan, K. A., Ó Macháin, P., Sawyer, D., 
Sleiderink, R., & Chao, A. (2022). Forgotten books: The application of unseen species models to the 
survival of culture. Science, 375(6582), 765–769. DOI: https://doi.org/10.1126/science.abl7655

Kirk, G. S. (1970). Myth: Its meaning and functions in ancient and other cultures. Berkeley, CA: University of 
California Press. DOI: https://doi.org/10.1525/9780520342378

Kontopoulos, E., Darányi, S., Wittek, P., Konstantinidis, K., Riga, M., Mitzias, P., Stavropoulos, T., 
Andreadis, S., Maronidis, A., Karakostas, A., & others. (2016a). Deliverable 4.5: Context-aware 
content interpretation. PERICLES project.

Kontopoulos, E., Riga, M., Mitzias, P., Andreadis, S., Stavropoulos, T., Konstantinidis, K., Maronidis, 
A., Karakostas, A., Tachos, S., Kaltsa, V., & others. (2016b). Pericles deliverable 4.4: Modelling 
contextualised semantics. PERICLES project.

Kühne, M., & Liehr, A. (2009). Improving the traditional information management in natural sciences. 
Data Science Journal, 8, 18–26. DOI: https://doi.org/10.2481/dsj.8.18

Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. Proceedings of 
the 31st International conference on Machine Learning, PMLR 32(2), 1188–1196. DOI: https://doi.

org/10.48550/arXiv.1405.4053

Lô, G., de Boer, V., & van Aart, C. J. (2020). Exploring West African folk narrative texts using Machine 
Learning. Information, 11(5), 236. DOI: https://doi.org/10.3390/info11050236

Marwick, B. (2017). Computational reproducibility in archaeological research: Basic principles and a case 
study of their implementation. Journal of Archaeological Method and Theory, 24, 424–450. DOI: 

https://doi.org/10.1007/s10816-015-9272-9

McCullough, B. D. (2009). Open Access Economics journals and the market for reproducible economic 
research. Economic Analysis and Policy, 39(1), 117–126. DOI: https://doi.org/10.1016/S0313-

5926(09)50047-1

Meder, T. (2010). From a Dutch folktale database towards an international folktale database. Fabula, 
51(1-2), 6–22. 

Meder, T., Karsdorp, F., Nguyen, D.-P., Theune, M., Trieschnigg, R. B., & Muiser, I. (2016). Automatic 
enrichment and classification of folktales in the Dutch Folktale Database. The Journal of American 

Folklore, 129(511), 78–96. DOI: https://doi.org/10.5406/jamerfolk.129.511.0078

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector 
space. DOI: https://doi.org/10.48550/arXiv.1301.3781 

Murphy, T. P. (2015). From fairy tale to film screenplay: Working with plot genotypes. Houndmills, 
Basingstoke, Hampshire: Palgrave Macmillan UK. DOI: https://doi.org/10.1057/9781137552037

https://doi.org/10.26615/978-954-452-046-5_003
https://www.dfki.de/fileadmin/user_upload/import/9028_Dh2017_LOD_TMI-ATU_final.pdf
https://doi.org/10.1145/3078081.3078105
https://doi.org/10.5406/jamerfolk.129.511.0055
https://doi.org/10.1201/9781315382548
https://doi.org/10.1126/scitranslmed.aaf5027
https://doi.org/10.1371/journal.pone.0202447
https://doi.org/10.1556/022.2018.63.1.11
https://doi.org/10.1556/022.2018.63.1.11
https://doi.org/10.1057/s41599-019-0234-9
https://doi.org/10.1126/science.abl7655
https://doi.org/10.1525/9780520342378
https://doi.org/10.2481/dsj.8.18
https://doi.org/10.48550/arXiv.1405.4053
https://doi.org/10.48550/arXiv.1405.4053
https://doi.org/10.3390/info11050236
https://doi.org/10.1007/s10816-015-9272-9
https://doi.org/10.1016/S0313-5926(09)50047-1
https://doi.org/10.1016/S0313-5926(09)50047-1
https://doi.org/10.5406/jamerfolk.129.511.0078
https://doi.org/10.48550/arXiv.1301.3781
https://doi.org/10.1057/9781137552037


10Hagedorn and Darányi 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.78

TO CITE THIS ARTICLE:
Hagedorn, J., & Darányi, S. 
(2022). Bearing a Bag-of-Tales: 
An Open Corpus of Annotated 
Folktales for Reproducible 
Research. Journal of Open 
Humanities Data, 8: 16, 
pp. 1–10. DOI: https://doi.
org/10.5334/johd.78

Published: 24 June 2022

COPYRIGHT:
© 2022 The Author(s). This is an 
open-access article distributed 
under the terms of the Creative 
Commons Attribution 4.0 
International License (CC-BY 
4.0), which permits unrestricted 
use, distribution, and 
reproduction in any medium, 
provided the original author 
and source are credited. See 
http://creativecommons.org/
licenses/by/4.0/.

Journal of Open Humanities 
Data is a peer-reviewed open 
access journal published by 
Ubiquity Press.

Nguyen, D., Trieschnigg, D., Meder, T., & Theune, M. (2012). Automatic classification of folk narrative 
genres. In J. Jancsary (Ed.), Proceedings of KONVENS 2012 (pp. 378–382).

Ofek, N., Darányi, S., & Rokach, L. (2013). Linking motif sequences with tale types by Machine Learning. 
In M. A. Finlayson, B. Fisseni, B. Löwe, & J. C. Meister (Eds.), 2013 Workshop on Computational Models 

of Narrative, 32, 166–182. DOI: https://doi.org/10.4230/OASIcs.CMN.2013.166

Pasquier, T., Lau, M. K., Trisovic, A., Boose, E. R., Couturier, B., Crosas, M., Ellison, A. M., Gibson, V., 
Jones, C. R., & Seltzer, M. (2017). If these data could talk. Scientific Data, 4, 170114. DOI: https://doi.
org/10.1038/sdata.2017.114

Pompeu, D. (2019). Interpretable Deep Learning methods for classifying folktales according to the Aarne-
Thompson-Uther scheme. Master’s Thesis. Lisboa: Instituto Superior Técnico, Universidade de Lisboa.

Qin, G., & Gao, L. (2012). An algorithm for network motif discovery in biological networks. IJDMB, 1–16. 
DOI: https://doi.org/10.1504/IJDMB.2012.045533

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. 
DOI: https://doi.org/10.18653/v1/D19-1410

Reiter, N., Frank, A., & Hellwig, O. (2014). An NLP-based cross-document approach to narrative structure 
discovery. Literary and Linguistic Computing, 29(4), 583–605. DOI: https://doi.org/10.1093/llc/fqu055

Ross, R. M., Greenhill, S. J., & Atkinson, Q. D. (2013). Population structure and cultural geography of a 
folktale in Europe. Proc R Soc B, 280: 20123065. DOI: https://doi.org/10.1098/rspb.2012.3065

Ross, R. M., & Atkinson, Q. D. (2015). Folktale transmission in the Arctic provides evidence for high 
bandwidth social learning among hunter-gatherer groups. Evolution and Human Behavior, 37(1), 

47-53. DOI: https://doi.org/10.1016/j.evolhumbehav.2015.08.001

Seigneuret, J. C. (1988). Dictionary of literary themes and motifs. New York, NY: Greenwood Press.
Tangherlini, T. R. (2016). Big folklore: A special issue on computational folkloristics. The Journal of 

American Folklore, 129(511), 5–13. DOI: https://doi.org/10.5406/jamerfolk.129.511.0005

Tangherlini, T. R., & Leonard, P. (2013). Trawling in the sea of the Great Unread: Sub-corpus topic 
modeling and Humanities research. Poetics, 41(6), 725–749. DOI: https://doi.org/10.1016/j.

poetic.2013.08.002

Tehrani, J. J. (2013). The phylogeny of Little Red Riding Hood. PLoS ONE, 8(11), e78871. DOI: https://doi.
org/10.1371/journal.pone.0078871

Tehrani, J. J., Nguyen, Q., & Roos, T. (2016). Oral fairy tale or literary fake? Investigating the origins of 
Little Red Riding Hood using phylogenetic network analysis. Digital Scholarship in the Humanities., 

31(3), 611–636. DOI: https://doi.org/10.1093/llc/fqv016

Thompson, S. (1951). Motif-index of folk-literature: A classification of narrative elements in folktales, 
ballads, myths, fables, mediaeval romances, example, fabliaux, jest-books and local legends (rev. 

[2nd] ed.). Copenhagen: Rosenkilde.

Thompson, S. (1977). The Folktale. Berkeley, CA: University of California Press.
Thuillard, M., d’Huy, J., Berezkin, Y., & Le Quellec, J.-L. (2018). A large-scale study of world myths. 

Trames, 22(4), 407–424. DOI: https://doi.org/10.3176/tr.2018.4.05

Uther, H.-J. (2004). The types of international folktales: A classification and bibliography, based on 
the system of Antti Aarne and Stith Thompson. Helsinki: Suomalainen Tiedeakatemia, Academia 

Scientiarum Fennica.

van der Loo, M. P. J. (2014). The stringdist package for approximate string matching. The R Journal, 6(1), 
111–122. DOI: https://doi.org/10.32614/RJ-2014-011

Vollan, B., & Ostrom, E. (2010). Cooperation and the Commons. Science, 330(6006), 923–924. DOI: 
https://doi.org/10.1126/science.1198349

White, J. D. (1976). The analysis of music. New York, NY: Prentice-Hall.
Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(1), 1–23. DOI: https://doi.org/10.18637/

jss.v059.i10

Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model 
of record linkage. Proceedings of the Section on Survey research methods, 354–359. https://eric.

ed.gov/?id=ED325505 

https://doi.org/10.5334/johd.78
https://doi.org/10.5334/johd.78
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
https://doi.org/10.4230/OASIcs.CMN.2013.166
https://doi.org/10.1038/sdata.2017.114
https://doi.org/10.1038/sdata.2017.114
https://doi.org/10.1504/IJDMB.2012.045533
https://doi.org/10.18653/v1/D19-1410
https://doi.org/10.1093/llc/fqu055
https://doi.org/10.1098/rspb.2012.3065
https://doi.org/10.1016/j.evolhumbehav.2015.08.001
https://doi.org/10.5406/jamerfolk.129.511.0005
https://doi.org/10.1016/j.poetic.2013.08.002
https://doi.org/10.1016/j.poetic.2013.08.002
https://doi.org/10.1371/journal.pone.0078871
https://doi.org/10.1371/journal.pone.0078871
https://doi.org/10.1093/llc/fqv016
https://doi.org/10.3176/tr.2018.4.05
https://doi.org/10.32614/RJ-2014-011
https://doi.org/10.1126/science.1198349
https://doi.org/10.18637/jss.v059.i10
https://doi.org/10.18637/jss.v059.i10
https://eric.ed.gov/?id=ED325505
https://eric.ed.gov/?id=ED325505