Accessibility, Discoverability, and Functionality: An Audit of and Recommendations for Digital Language Archives


RESEARCH PAPER

CORRESPONDING AUTHOR:

Irene Yi

Linguistics Department, Yale 
University, New Haven, CT, US

irene.yi@yale.edu

KEYWORDS:
language archives; 
documentation; accessibility; 
discoverability; functionality; 
linguistics; endangered 
languages; metadata

TO CITE THIS ARTICLE:
Yi, I., Lake, A., Kim, J., 
Haakman, K., Jewell, J., 
Babinski, S., & Bowern, 
C. (2022). Accessibility, 
Discoverability, and 
Functionality: An Audit of and 
Recommendations for Digital 
Language Archives. Journal 
of Open Humanities Data, 8: 
10, pp. 1–19. DOI: https://doi.
org/10.5334/johd.59

Accessibility, Discoverability, 
and Functionality: An Audit 
of and Recommendations 
for Digital Language 
Archives

IRENE YI 

AMELIA LAKE

JUHYAE KIM

KASSANDRA HAAKMAN

JEREMIAH JEWELL

SARAH BABINSKI 

CLAIRE BOWERN 

*Author affiliations can be found in the back matter of this article

ABSTRACT
While digital archiving has long been standard for linguistics, archives themselves are 
heterogeneous (Aznar & Seifart 2020), and archived linguistic material is important 
for researchers and communities, particularly for language reclamation (cf. Baldwin 
& Olds 2007; Whalen et al. 2016; Hinton 2003, 2018; Kung et al. 2020). The format 
and usability of scholarly archival collections is shaped by the functions of the 
management practices at the stewarding institution, making an appreciation of the 
range of access services provided by such institutions relevant to the evaluation of 
individual collections.

Here we report on a review of 41 digital language archives. Three factors 
are examined: 1) accessibility, including metadata and site navigation; 2) 
discoverability, or searchability and internal navigation; and 3) functionality, the 
overall ease of data retrieval and use. We recognize that the decisions made by both 
stewards and depositors can greatly impact the accessibility of archived materials; 
to that end, we present recommendations for how archives might increase the 
utility of their holdings for their users. We emphasize that our intention is not to 
dissuade linguists from using archives because of these issues, and we recognize the 
tremendous amount of work that goes into the upkeep of digital infrastructure, often 
with very limited institutional support. Implementing such recommendations at an 
institutional level can establish a fairer peer-review process of archival collections. By 
delineating precisely what standards fall under the archive management level and 
what procedures individual depositors are responsible for, the roles of “archivist” and 
“depositor” become clearer. 

mailto:irene.yi@yale.edu
https://doi.org/10.5334/johd.59
https://doi.org/10.5334/johd.59
https://orcid.org/0000-0001-9255-4235
https://orcid.org/0000-0001-7764-5876
https://orcid.org/0000-0002-9512-4393


2Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

1 INTRODUCTION
It is estimated that 32% of living languages are currently in some state of loss (Simons & Lewis 
2013:10); some estimates place the figure at closer to 50% (Campbell et al. 2013; Campbell 
& Belew 2018). Documentation of endangered languages is vital for preserving them (Berez 
2013), whether for study, language reclamation, preventing the irreversible loss of intangible 
cultural heritage, or any other reason that language is used.

Digital archiving has been standard for linguistics for at least 15 years, but the extent to 
which this material can be accessed and used for research, education, and activism varies 
(cf. Evans & Sasse 2004; Kaplan & Lemov 2019; Paterson 2021). Language archives utilize a 
number of different content management systems and do not provide uniform functionality 
(Aznar & Seifart 2020). Some language reclamation projects have found success working 
exclusively with archival sources (Hinton 2003; ANA 2006; Baldwin & Olds 2007; Whalen et 
al. 2016, 2018);1 such work is a partnership-at-a-distance between the institutions that store 
and curate the materials, the researchers who deposit them, and the users of the materials. 
Archival materials are decontextualized (Schwartz & Dobrin 2016; Gaby & Woods 2020:e273; 
Dobrin & Schwartz 2021) from their original utterance, and depositors and archives both can do 
much to ensure that language collections are as robust and as useful as possible. In this paper, 
we report on the results of a review of language archives, with a concentration on sites and 
organizations with substantial holdings of digital data in and about endangered languages. 
We discuss the accessibility, discoverability, and functionality of archival resources, focusing 
on features of web portals and the special needs of linguistic collections. That is, the focus 
is around the needs that depositors and language resource users have, and how such needs 
are or are not met by current practices at the stewardship institutions that manage archives. 
Finally, we provide recommendations and suggest changes to the access services provided by 
stewardship institutions. It is our hope that these recommendations will serve as foundation 
for future guidelines in the creation, curation, and maintenance of web portals, the gateways 
to language resources at “language archives”.

1.1 DIGITAL LANGUAGE ARCHIVES

A “language archive” is defined here as a repository of language data (broadly construed), such 
as audio and/or video recordings, transcriptions, and translations, whether in physical or digital 
format, created with the purpose of preserving and disseminating those materials (Kung et 
al. 2020; Burke et al. 2021; Austin 2021; Paterson 2021). There is substantial variation among 
repositories that contain linguistic data (cf. Vann 2006)—in scope, functionality, infrastructure, 
the number of languages or regions covered, and the extent to which they function as research 
tools or simply data repositories, to name just a few. We follow Austin (2021) in considering 
the role of archives to be appraising materials (that is, collecting selectively based on a stated 
goal), preserving those materials, “mak[ing] known their existence”, and facilitating their 
appropriate distribution. For our purposes, we include sites that appear to have these aims. 
We treat the process of archiving as one in which someone places language resources in one 
of these repositories, as opposed to interacting with an archive or linguistic data in other ways. 
For this reason, we exclude from our definition of “archives” sites such as OLAC, which do not 
collect materials themselves but rather act as a directory for other archives. Throughout this 
paper, we refer to “items”, “collections”, and “archives”, where items are the linguistic materials 
that are deposited; they are grouped into “collections”, and those collections are housed by 
archives. Archives are repositories that are owned and managed by people, who are employed 
by institutions. Thus to talk about access to archives we need to think about the web portals, 
the choices of individuals, their employee obligations at their institution(s), the infrastructure 
that underlies the repository and its data services, among other topics.

While the advancement of technology has allowed linguists to digitize corpora that were once 
only available in physical media, digitization and online archiving have problems of their own. 
The long-term accessibility of digital material is dependent on the continuing availability of 

1 We recognize that language revitalization and reclamation are complex topics, far beyond the scope of what 
we can cover in this paper. Archived language data is inevitably an incomplete portrayal of languages and their 
communities.


3Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

compatible hardware and software. Necessary equipment may become obsolete and/or fall 
out of production (e.g., computers are no longer produced with built-in optical drives, making it 
difficult to access information stored on CDs). Storage media has a limited lifespan. Software, 
too, can rapidly become obsolete. While linguists have mostly heeded Bird and Simon’s (2003) 
call to use open source software wherever possible, documentation projects can become tied 
to a specific software platform and version (cf. Bird & Simons 2003). Such issues affect both 
depositors and archives: while depositors should ensure they archive materials in the most 
endurable formats possible, digital archives are also subject to these constraints, such as the 
lifespan of servers and backups.

While the Internet has greatly improved the availability of research materials to far-flung 
audiences, it is far from an equalizer. Access to a reliable Internet connection is not universal, 
particularly in remote communities where broadband has yet to be fully implemented. For 
example, Wasson et al. (2016) note that Dinjii Zhuh K’yaa (Gwich’in Language Archive and 
Language Revitalization Center), a community archive of Gwich’in people in Fort Yukon, Alaska, 
is mostly accessed and available only by physically accessing the center where the archive is 
located because internet access is uneven within the language community. By some estimates, 
roughly 40% of the global population are not Internet users. Even in the United States, 21 million 
people lack access to broadband Internet; the FCC believes this figure “radically overstates” the 
number of people who have reliable connections (Sonnemaker 2020). 

Despite the decades-long prevalence of digital archiving in the field, no two archives are alike, 
some having features that are tailored to the languages of focus. The Digital Archive of Scottish 
Gaelic,2 for example, offers a search feature to filter for lenited words, an option especially useful 
for researchers working with Goidelic languages. But because archives are so decentralized, 
there is currently no set of protocols or standards for digital language preservation (Aznar & 
Seifart 2020).3

1.2 RATIONALE FOR THE CURRENT PROJECT

Previous research (among others, Bird & Simons 2003; Vann 2006; Berez 2013; Sullivant 2020; 
Burke et al. 2021) has discussed aspects of linguistic archiving, including the importance of 
metadata, consistent approaches to creating language materials, and the current state 
of language archiving. The current paper covers a wider scope of contemporary language 
collections, as well as contributing to the discussion of how to improve the archival practice in 
order to help communities and researchers more easily use and reuse these archives. A review 
of digital archival practices, such as the decisions made in designing websites and displaying 
content, will provide insight into how archives are and are not meeting the needs of end users—
and the steps they can take to rectify these issues. 

This paper reports on the results of a review conducted of online digital archives in June-August, 
2021. The “audit” was conducted with the aim of investigating the utility of online archives and 
their accessibility for retrieval of materials. Concomitantly, we investigated a sample of individual 
collections in a subset of archives for ease of completing certain standard investigations, such 
as testing whether or not materials could be easily aligned using the Montreal Forced Aligner 
(a process widely used in the creation of corpus materials for phonological research; McAuliffe 
et al. 2017). This paper reports on findings that relate specifically to archives; a companion 
paper (Babinski et al. 2022) details the phonological/typological findings. The remainder of 
this paper describes the methods (Section 2), results of the archive audit (Section 3), and 
conclusions (Section 4), focusing on topics ranging from accessibility and discoverability to 
actual functionality (i.e., use and reuse of archival materials). At the end of each subsection 
in Section 3, we present suggestions for changes in practice. In engaging with these questions 
and making suggestions for changes in practice, we do not wish to downplay the efforts and 
skills of professional archivists, or dissuade researchers from depositing their materials in 
these archives. We recognize that there are innumerable tradeoffs in all aspects of language 

2 https://dasg.ac.uk/en.

3 The challenges around digital archives are not unique, as issues such as the longevity of software and 
hardware, internet accessibility, and the like, are common across many digital media repositories. However, 
because of the complexity of language archive collections, their many filetypes, heterogeneity of construction 
(and resulting metadata), to name just a few, they are probably a good illustration of a very broad array of 
challenges.

https://dasg.ac.uk/en


4Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

documentation and archiving, and that any safeguarding is preferable to none. However, we 
also consider it appropriate to evaluate the extent to which archival practices—that is, those 
practices that are primarily controlled by archives and their management—are serving the aims 
of those using archives. To this end, we are not yet at the stage where we can present a full 
set of recommendations for archival practice. Rather, we raise the issues we have found across 
archives so that those in the field, including archivists, can consider them in future archive 
development and management.

Region-focused archives, such as the Alaska Native Language Archive (ANLA)4 and the Survey 
of California and Other Indian Languages/California Language Archive (CLA),5 draw an audience 
of language communities who access materials for the purposes of cultural, historical, and 
language learning. It is believed that the usage of language archives by Indigenous communities 
is underestimated (cf. Austin 2011; Holton 2012; Woodbury 2014), as a single representative 
may bring resources back to a community that are then more widely disseminated and used by 
many more individuals. In discussing issues with language archives, we wish to emphasize that 
roadblocks created by archives will also greatly affect language communities, and, to best suit 
the needs of their audiences, it may be critical for archives to be accessible and interpretable 
to users without specialized linguistic training or extensive technical knowledge. Holton (2012) 
and Woodbury (2014) discuss the different audiences and users of language archives, drawing 
particular attention to the fact that non-Indigenous linguists are not the only audiences 
of archives, and that both linguist and non-linguist members of Indigenous communities 
are using archives (e.g., the DOBES6 Archive, the Archive of Indigenous Languages of Latin 
America [AILLA],7 and ANLA)8 for community-oriented purposes like language revitalization. In 
discussing the role of archived collections in promotion or hiring, therefore, it is also important 
to recognize that academics are not the only users of this material. 

Additionally, implementing such recommendations at an archive level (i.e., having inter-archive 
standards maintained by those who manage archives) can help establish a fairer peer-review 
process of archival collections. By delineating precisely what standards fall under the archive 
management level and what procedures individual depositors are responsible for, the roles of 
“archivist” and “depositor” become clearer. Thus, in reviewing depositors’ archival collections, 
we avoid evaluating the individual for aspects of archiving which are outside their control. 
Having standardization on the side of archives will create more equitable standards by which 
individuals are reviewed. 

2 METHODS
An archive review was conducted between June and August, 2021, by the authors of this paper. 
Our audit focused on archival usability as a whole, as well as two aspects of collections: files 
suitable for phonetic and phonological analysis, and textual archives/archives not exclusively 
maintained for linguistic research. The general archive audit included 41 archives (as listed in 
the Supplementary Materials).9 The archive list was compiled from OLAC’s list of participating 
archives10 as well as Digital Endangered Languages and Musics Archives Network (DELAMAN) 

4 https://www.uaf.edu/anla/.

5 https://cla.berkeley.edu/.

6 https://archive.mpi.nl/tla/.

7 http://ailla.utexas.org.

8 https://www.uaf.edu/anla/.

9 The supplement is available from https://osf.io/daksh/. Anonymous reviewers of this submission had 
differing opinions on the extent to which this choice of archives was appropriate. One reviewer suggested that 
the sample should be expanded, while another felt that it was too broad, including too many archives of different 
types (and that it was inappropriate to generalize across archives with very different levels of institutional 
support and access to funding). It was unclear from our survey how many of the archives in the OLAC and 
DELAMAN lists are actively maintained, what their support is, and how they backup and preserve their holdings. 
This is itself an important issue which should be investigated further. For our purposes, rather than restrict the 
focus to archives that are clearly actively maintained, we preferred to cast a wider net and examine as many 
digital archives as possible (with caveats further discussed in Section 3.1.3 below).

10 http://www.language-archives.org/archives.

https://www.uaf.edu/anla/
https://cla.berkeley.edu/
https://archive.mpi.nl/tla/
http://ailla.utexas.org
https://www.uaf.edu/anla/
https://osf.io/daksh/
http://www.language-archives.org/archives


5Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

members and associate members.11 For this reason, the archives examined are heavily skewed 
towards English-language based collections, though (as discussed in Section 3.1.2 below) we 
actively attempted to address this bias (unfortunately without much success).12

We compiled general information on metalanguages, search and retrieval functions, corpus 
structure, access condition options, and types of materials archived. Prior to the audit, we 
created a questionnaire that probed various aspects of archives and collections that could prove 
problematic in linguistic research. This questionnaire was used to systematically document 
information regarding the archives content, accessibility restrictions, search functions, 
metadata, download options, and file manipulation necessary for analysis (see Babinski et al. 
in prep for a larger summary of findings). Members of the team examined archives individually; 
the results were discussed as a group and CB & IY spot-checked data coded by other authors. 
We found a very high degree of inter-rater consistency, with the exception of problems arising 
from web browsers and access to sites which were blocked from Yale’s campus internet.13 We 
focus on the following points in this paper:

•	 Accessibility

° Which language(s) must a user know in order to navigate sites and collections?

° Is the site accessible to users of screen readers?

° Are there aspects of the site design that impede or promote accessibility?
•	 Restrictions

° How available is material in collections?

° If restrictions are placed on access, what is needed to access collections?

° What types of controls are in place, and for what reason?
•	 Finding information

° How easy is it to find information on the site?
•	 File manipulation

°  How usable are the collection materials? Are there aspects of the site and archive 
design that promote or impede the use of materials?

Another set of possible metrics are the FAIR principles.14 FAIR data is findable, accessible, 
interoperable, and reusable. Our points overlap with FAIR in a number of respects, but the FAIR 
framework was unsuitable for our evaluation for two reasons. Firstly, the findability criterion 
focuses exclusively on metadata structure, whereas we consider issues of “findability” to 
be much broader, as discussed further below. Secondly, the FAIR criteria mostly apply to 
collections, rather than to the overall structure of the archive qua repository.

There are a range of reasons why an archive might have a particular property, ranging from 
constraints introduced by the Content Management System (CMS), to decisions made in light 
of the amount of funding or staffing, to philosophical decisions about the appropriate structure 
of an archive. Therefore, rather than focus on the particular properties of individual archives, we 
instead focus on implications of current design for what end users can accomplish. We do list 
selected examples to illustrate and explain findings, however. While our findings are therefore 
perhaps not fully reproducible (cf. Berez-Kroeker et al. 2018), we have endeavored to make the 
findings replicable by including information in the supplementary materials. This represents a 
snapshot of archival issues as of August, 2021, which will no doubt evolve as sites are updated.

11 https://www.delaman.org/members/.

12 While there are other archives (such as Kielipankki, the Language Bank of Finland; https://www.
kielipankki.fi), restricting the sample to OLAC/DELAMAN archives provided some principle for inclusion in 
the survey. We acknowledge that it is unclear at this point how representative or comprehensive this list is. 
Organizations differ considerably in the extent to which they focus on preservation or access to files, or serving as 
research resources or content delivery platforms, making a clear definition of “language archive” difficult. There 
is, to our knowledge, no global list of language archives. The closest are the DELAMAN and OLAC compilations. 

13 We were unable to diagnose why some sites loaded and others did not, based on IP addresses. We noted 
issues when they arose, since, if they arose during this sampling process, they will likely arise for other users too. 
A reviewer asked why we do not exhaustively list, enumerate, and quantify all points made in this paper. We 
argue that doing so would give rise to misleading precision. As discussed in Footnotes 9 and 12, it is impossible 
to know how representative this sample is. It would therefore be misleading to draw detailed conclusions about 
small differences in prevalence. Instead, we concentrate on reporting common trends in this set of data. This 
allows us to evaluate recurring issues among commonly used language archives without being unduly focused 
on small differences.

14 https://www.go-fair.org/fair-principles/.

https://www.delaman.org/members/
https://www.kielipankki.fi
https://www.kielipankki.fi
https://www.go-fair.org/fair-principles/


6Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

There were some points which we wished to investigate but were unable to include. The 
extent of institutional support may be a critical component of an archive’s longevity, but such 
information was typically unavailable. Other points relating to archival infrastructure, such as 
long-term plans, backup procedures, storage procedures, type of content management system, 
and staffing will also have a major impact on what the archive can deliver. Because we are 
evaluating archives in terms of their usefulness to end users and not in terms of institutional 
and financial barriers they must overcome, we do not consider these points in our analysis, 
though we recognize that archives vary greatly in this dimension.

3 RESULTS
3.1 ACCESSIBILITY

The contents of an archive are only useful as far as they are findable and accessible. Accessibility 
can be impacted by a number of factors, both on the user end and through archive design 
choices. “Web accessibility” is generally understood to refer to compatibility with assistive 
technology. We discuss accessibility in this narrow sense in Section 3.1.3. However, we also 
discuss registration and account-creation requirements and procedures, display language, and 
site navigation. These are also points which may facilitate or impede a user’s access to the 
archive contents.

3.1.1 Accounts and registration

The majority of language archives we surveyed have materials that are available for download 
for free and with minimal registration requirements. Many archives appropriately have access 
restrictions for collections to respect the wishes of language communities and researchers 
(Nathan 2010). Five archives, including the Repository and Workspace for Austroasiatic Intangible 
Heritage (RWAAI),15 required registration to access any materials at all, including the catalog; 
other archives had a public-facing catalog, even if registration was required for download. Four 
archives, including ELAR and the DOBES Archive, had multiple tiers of access, where some tiers 
required registration and/or permission of the depositor for listening or download, while other 
tiers were unrestricted. One archive, the CLA, has materials that are closed-access, in that they 
are not available online and must be accessed in person. Two archives, Kaipuleohone16 and LIA 
Sápmi (Sami Speech Corpus),17 restrict all or most of their contents specifically to academic 
institutions and institution-affiliated researchers, a limitation that may exclude members of 
language communities. Others restrict only parts of their materials to those affiliated with 
academic institutions. The CHILDES Data Repository18 includes password-protected collections 
available only to faculty members, and the CLARIN Slovenian Repository19 requires that users 
access certain materials restricted for “academic use” through their institutional emails. While 
there are good reasons why collections may be not freely available, some of the convoluted, 
unclear, or heavily outdated procedures for requesting permission could be fixed. Archives 
that do not streamline the permission forms or provide unclear contact information could be 
updated. For example, account registrations requiring manual approvals, or emailing specific 
individuals, should be automated.

While the majority of archives are entirely free for use, and we did not encounter any archives20 
requiring payment of fees to access collections during our audit, we acknowledge that for 
some researchers, particularly those who do not have institutional membership with archives 
or sufficient funding, the cost of accessing an archive may be prohibitive. The Linguistic Data 

15 https://projekt.ht.lu.se/rwaai.

16 http://ling.hawaii.edu/kaipuleohone-language-archive/.

17 https://tekstlab.uio.no/glossa2/saami.

18 https://childes.talkbank.org/access/.

19 http://www.clarin.si/info/about.

20 We did not include predominantly physical archives that also have digital materials. This excluded archives 
such as AIATSIS (https://mura.aiatsis.gov.au), which requires the purchase of physical media for accessing 
digital collection items.

https://projekt.ht.lu.se/rwaai
http://ling.hawaii.edu/kaipuleohone-language-archive/
https://tekstlab.uio.no/glossa2/saami
https://childes.talkbank.org/access/
http://www.clarin.si/info/about
https://mura.aiatsis.gov.au


7Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

Consortium (LDC)21 and the European Language Resources Association (ELRA)22 are two 
examples of archives requiring payment in return for access to materials; these fees can range 
into tens of thousands of dollars (Vann 2006). Endangered language archives have tended 
towards a model where the archive is supported through institutional or grant funds, with costs 
supplemented by fees from depositors (similar to “gold” open access models for academic 
publication). Some archives have recently requested that depositors include archiving charges 
in research grant applications. Clearly the funding model for ongoing support for endangered 
language archives needs to be investigated in more detail.

3.1.2 Display language

The language(s) used to display metadata and to navigate the site may also limit accessibility 
of the materials. Bias towards English-language users and lack of built-in site translations 
disadvantages researchers whose primary language is not English and may prevent community 
members from accessing documentation of their own languages or other languages which 
they regularly use.23 While the arrival of digital media devices and technologies can facilitate 
the creation of a “social network of digital exchange” of cultural heritage for Indigenous 
communities (Mansfield 2014:66), unavailability of these resources in endangered languages 
further entrenches generational and educational divides in language communities where 
acquisition of literacy, particularly in English, is not widespread. A number of linguists and 
Indigenous community members have expressed concern that “the majority of digital 
resources available to Indigenous users are in English, even though English is not a first 
language for many” (Carew et al. 2015:310). Only 14 of the archives we examined provide more 
than one language interface, and not all of these had fully functional language options. We 
point to PARADISEC24 (see Figure 1) as an example of an archive providing information through 
7 regional languages (though unfortunately not on the mobile site).

Archives that focused on languages of a particular region often provided interfaces relevant 
to their users. For example, AILLA has interface options in Spanish, and PanGloss is fully 
implemented in both English and French.25 ELAR and CLA have interfaces only in English, 

21 https://www.ldc.upenn.edu.

22 http://www.elra.info/en.

23 The finding that there is a lack of archives with a primary interface in other languages may be, in part, due 
to our own biases as all English-dominant researchers in the USA. However, we made a substantial effort to 
search out archives written in other languages (e.g., Spanish, Russian, French), but they were largely difficult to 
search for because of Internet search engine rankings, which return results based on language and geographic 
region. This should be noted as an issue for linguistics that leads to a substantial reduction in findability of 
materials, though one beyond the control of individuals.

24 https://www.paradisec.org.au.

25 However, the translations caused issues with file matching. Where .mp3 files were labeled in French but the 
transcripts were auto-generated and downloaded by the site, and given English filenames.

Figure 1 PARADISEC’s 
informational language 
options (top right corner of 
banner).

https://www.ldc.upenn.edu
http://www.elra.info/en
https://www.paradisec.org.au


8Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

though at the collection level, ELAR allows filenames and metadata to be in other languages 
and scripts, which helps users if they know of the collection. 

We applied Google Translate to the exclusively English archives (testing languages such as 
Korean, Uzbek, Kyrgyz, and French). Translations were inconsistent, incomplete, and sometimes 
misleading. Some localizations translated only parts of the site text, leaving others, such as 
an embedded map and filenames, in English (see Figures 2, 3, and 4). Therefore using Google 
Translate as a workaround for untranslated sites is not a straightforward alternative.

Figure 4 PARADISEC items 
where the term “Elicitation” is 
translated into Kyrgyz once, 
but not other times. Further, 
green “Open” buttons are not 
translatable as the text is part 
of the icon.

Figure 2 ELAR interface in 
Uzbek with Google Translate 
overlay. Names on the side 
are not consistently translated 
or transliterated: “Hilda 
Lopez” is not transliterated, 
but “James Woodward” 
becomes “Jeyms Vudvord”. 
Selective transliteration 
breaks links elsewhere in the 
collection.

Figure 3 ELAR interface in 
Kyrgyz (Google Translate 
overlay). The “View 16 more” 
button on the menu (in 
orange text) no longer works 
with Google Translate as an 
overlay.


9Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

It is also worth noting that when site translations are available, options are predominantly 
languages of European origin. This is especially striking given the scope of archived languages, 
most of which are indigenous to Africa, Asia, and the Americas. Lack of translations into major 
regional languages limits the abilities of scholars to use these archives, creating a bias in the 
demographics of researchers and restricting potential scholarly innovation. For non-Indo-
European languages whose structures differ greatly from that of languages like English, French, 
and Spanish, automatic translation programs like Google Translate and Yandex are especially 
prone to offering confusing and poor-quality translations. We recognize that this is a much 
bigger problem than what individual archives can solve. For example, search engines filter out 
search queries in languages other than the query language (which made it almost impossible 
for us to search for archives outside the anglosphere internet).26 However, at the collection 
level, depositors should be encouraged to provide materials in languages that will be most 
usable for community members, and the substantial additional time costs for doing so should 
be recognized explicitly.

3.1.3 Disability accommodations

We acknowledge that disability accommodation remains a critical, and often-overlooked 
element of archive accessibility, and indeed, the accessibility of any digital material. In regards 
to the structure of websites and storage of data such as text files, it is essential that Web 
content – including archives – is presented in a way that is accessible for visually-impaired 
researchers.27

It is generally agreed as a principle of accessible Web design not to make different elements 
of a site distinguishable only by their color (Campbell 2018). In order to assess color blindness 
accessibility, we put each archive through a filter28 mimicking how each site would look to users 
with 3 of the most common kinds of color blindness. Sites were subjectively reviewed by a 
member of the team who is colorblind. The archives we surveyed, largely, performed well in this 
regard. The main issue raised by our survey was the low contrast between font and background 
colors, which may compromise readability for users with certain kinds of color blindness and 
other visual impairments; it may also inconvenience users with certain color and brightness 
settings on their computers and browsers. The websites for SIL’s International Language & 
Culture Archives and the Rosetta Project revealed such problems.

3.1.4 Recommendations

The restrictions archives place on access to language data are there for a reason; however, 
it is important that these restrictions do not place too much of a burden on researchers 
and language communities looking to access their contents. Therefore, we suggest archives 
streamline the process of requesting access permission. More specifically, we recommend 
that request forms be built into the site itself, with additional capacities for automated 
password retrieval. This is especially critical for those archives (such as ELAR) whose code 
is not built for long-term accessibility, as passwords cannot be reset and the application 
permission form is built on Google Surveys. For archives not already implemented in multiple 
languages, we strongly suggest expanding the scope of display languages offered, especially 
those languages which may be relevant to language communities and local researchers. 
Furthermore, we recommend against applications that must be physically posted to 
the archive, given their inefficiency and potential to disadvantage researchers in areas 
underserved by the postal system.29 Following the principles of accessible Web design will 
make great strides in overcoming barriers for researchers who require assistive technology. 
Even in the absence of laws like the American Disabilities Act (1990) (or varying legal 

26 https://developers.google.com/search/blog/2010/03/working-with-multilingual-websites 
provides some information about how Google determines relevance for multilingual sites; this includes the 
domain name suffix and IP address of the server, as well as language identification for monolingual web pages. 
It does not include HTML language attribute tags of georeferencing in HTML.

27 For further information see https://www.w3.org/WAI/standards-guidelines/wcag/ (WCAG 2018).

28 https://www.toptal.com/designers/colorfilter/.

29 One assumes that for a digital archive, users who will access the materials are also able to access a digital 
registration form.

https://developers.google.com/search/blog/2010/03/working-with-multilingual-websites
https://www.w3.org/WAI/standards-guidelines/wcag/
https://www.toptal.com/designers/colorfilter/


10Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

requirements across different countries), it is important and not too difficult to improve what 
is already there.

3.2 DISCOVERABILITY

Collections need to be discoverable; that is, users should be able to navigate the site to find 
what they need. Discoverability encompasses both the abilities of users to find archives through 
search engines or aggregation portals (such as OLAC) and to perform searches within those 
archives. The former point is essential for the use and reuse of an archive in general, while the 
latter point sheds important light on the internal organization/description of material.

3.2.1 Search functions and mislabeling

Search functions are vital for navigating large collections, but they can be made frustratingly 
slow and even useless depending on their available options. Six archives offer a map search 
function, which allows users to browse collections by location. This function is especially useful 
for more casual users or for those who are not searching for specific data, but it presents 
its own challenges. Archives like ELAR and the California Language Archive use Google Earth, 
and many others use similar platforms. The California Language Archive does not outwardly 
indicate whether each collection on the map is available online. This makes it initially seem 
as if there are more resources readily available to users than there actually are. While these 
issues are a cause for frustration, they are not necessarily debilitating, and the map function 
tends to be a useful visual aid for users. We point to PanGloss as an example of the map 
function at its most useful; its map function is easy to navigate, contains information on the 
title, researchers, and types of resources available for each collection (as well as a link to each 
collection), and can be filtered by the criteria “with annotation” and “with video”. In contrast, 
AILLA’s map function is non-interactive. That said, some location information within PanGloss 
was inaccurate.

Lack of transparency about the contents of collections was observed in numerous archives. 
Users may have little information about what a deposit contains before accessing its contents. 
Researchers often have specific criteria in their search for language materials—for example, 
linguists looking to perform certain kinds of phonological analysis may have a preference for 
collections whose video and audio recordings are all fully transcribed, and rule out collections 
with too few hours of recorded material or those that consist only of written materials. Other 
criteria may include the specific dialect(s) documented, date of creation, file type, number, age 
and location of speakers, or specific individuals. Some of these categories can be aggregated 
automatically, while others require manual labeling. While the lack of some of this information 
is due to incomplete metadata provided by depositors, we encourage archives to make such 
information easy to find. ELAR, for example, includes a collection landing page consisting of 
sections for “summary of the deposit”, “groups represented”, “language information”, “special 
characteristics”, and “deposit content”, though the quality and specificity of the information 
in these descriptions varied greatly between collections. This could be a point of evaluation 
for individual collections. In most collections, the metadata about the holdings is a file within 
the general collection. It is not consistently named and where collections have many files 
it is difficult to find. Archives could assist the retrieval of such information by flagging such 
metadata files directly or including an explicit link to the metadata file(s) within the collection 
overview.

Most archive portals include search bars, but these have varying degrees of usability. One 
important feature is a filter function, especially for larger archives. All but seven of the archives 
we investigated have some kind of search filter function. Some filtering options include language, 
speaker, depositor, file type, topic, and country, among others. However, the availability and 
usability of the filter function was inconsistent. ELAR’s search filter options vary by collection, 
while The African Language Materials Archive,30 Digital Himalaya,31 and AILLA all lack a search 

30 http://alma.matrix.msu.edu.

31 http://www.digitalhimalaya.com.

http://alma.matrix.msu.edu
http://www.digitalhimalaya.com


11Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

filter function entirely, making large collections more difficult to search. PARADISEC had a 
flexible search and filtering interface, at the item or collection level.

A useful search feature available in some archives is the ability to search within collections. 
This feature is especially useful, almost necessary for archives that contain large collections. 
However, despite its importance, we only found the feature in four of the archives we examined. 
Such free-text search increases finding options for collections extensively, allowing more 
refined searches than filters alone. For example, a filter may exist to restrict files to .xml, but a 
test search makes it easier to distinguish between Flextext transcripts, ELAN transcripts, and 
.xml-format metadata. These are all .xml format files but have very different functions. At the 
collection level, searches were hampered by missing metadata, incorrect tags, case sensitive 
searches and inconsistent metadata (e.g., searches returning either audio or Audio as filetype), 
empty folders, and broken URLs within collections. Correcting these small but time-consuming 
errors would improve intra-archive searches.

Two of the most useful search filter categories are media type and file type (see Figures 5 
and 6). Many researchers using these digital archives can only use files of a specific media 
type (e.g., videos or sound recordings), or, in cases where they plan to use certain software in 
their research, certain file types (e.g., .pdf or .wav files). File type and media type filters greatly 
reduce the time a researcher must spend browsing files to find what they need. Despite this 
importance, only five of the archives we looked at offer the option to filter by file type, and one 
of these archives, The Language Commons,32 returns files that aren’t bundled (with related files 
of different file types) when the file type filter is employed, causing users to miss potentially 
useful materials. Similarly, only four of the archives we looked at offer the option to filter by 
media type. Even fewer allowed users to filter by specific file extensions (such as .mp3 or .wav), 
and, when offered, the archive often displayed results with mislabeled extensions (.xml for .eaf, 
for example).

Mislabeled file types are another issue we encountered. ELAR and AILLA, for example, rename 
ELAN33 .eaf files and FLEx34 .flextext files as .xml (see Figure 7). While these are underlyingly XML 
files and alternate extensions are visible upon downloading the files, one needs to know how 

32 https://archive.org/details/LanguageCommons?tab=about.

33 https://archive.mpi.nl/tla/elan.

34 https://software.sil.org/fieldworks.

Figure 6 ANLA materials 
searchable/filterable by media 
type.

Figure 5 CLA materials with 
media type specified next to 
item name.

https://archive.org/details/LanguageCommons?tab=about
https://archive.mpi.nl/tla/elan
https://software.sil.org/fieldworks


12Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

to change the file extensions in order to open the files with the appropriate applications. It is 
also difficult to differentiate ELAN audio transcripts from FLEx dictionary or interlinearized texts, 
which are both listed as .xml files but have different underlying data structures.

3.2.2 Metadata

We noted considerable inconsistency in what type of metadata was available, across archives 
and collections. It is easy for relevant files to be lost in a search because they do not have the 
type of metadata used in the search.

Another issue we discovered was the use of different layers of metadata. In many cases, 
important metadata was hidden inside the folders of a collection, making it difficult for a user 
to find the specific information they need. AILLA, for example, has three layers of metadata: 
one layer is for the whole collection, another layer is found within each individual folder within 
each collection, and the final layer is attached to the individual files themselves. Such layering, 
combined with the frequent gaps in available metadata, makes it extremely difficult to find 
desired information and reduces the accuracy of the search function.

Sullivant (2020) provides detailed recommendations for collection metadata, breaking down 
these recommendations into categories based on importance. We point to The California 
Language Archive and PARADISEC as two archives that do a good job of including “first priority 
collection metadata”. Finally, it is important to note that, while many of the archives we 
examined do include the most important information in their collection metadata, almost 
none include the information in Sullivant’s next two tiers of recommended metadata. While 
archives are reliant on the metadata provided by depositors, this only reinforces the points 
made by Sullivant (2020) and Burke and Zavalina (2020) that metadata is crucial to the usability 
of a collection. The DACS35 standards may also be useful for both depositors and archives to 
introduce and maintain consistency. 

3.2.3 Site maintenance

Other issues impeded discoverability, with archives being incompatible with specific 
browsers, requiring defunct software, or failing to load entirely. This occasionally varied 
depending on the individual user in ways we were unable to solve. For example, three of the 
team members found that the APS Digital Library36 would not open for them unless they 
accessed it via Yale University’s VPN, while the remaining team members could access the 
site with no difficulties from off campus, all using recent versions of Chrome on MacOS 11.6 
or Windows 10.

Six of the 41 archives gave web access errors or were unreachable.37 Some, such as ALORA,38 
could only be accessed with the Wayback Machine.39 While these workarounds do allow users 
to access materials, users who are unfamiliar with the Wayback Machine would be deterred 
from retrieving relevant information. Moreover, the Wayback machine may provide access to 
the catalog, but not the files in the collection itself. Links provided within archives often faced 

35 https://github.com/saa-ts-dacs/dacs.

36 https://diglib.amphilsoc.org.

37 Academia Sinica English corpora (http://www.ling.sinica.edu.tw/en/announcements/Resources); 
ALORA (https://alora.cerdotola.com); Multimodal Learning Corpus Exchange (http://mulce.org); Standing 
Rock Sioux Tribe Language and Culture Institute (http://wooyake.org); American Philosophical Society Digital 
Library (https://diglib.amphilsoc.org); World Oral Literature Project (http://www.oralliterature.org).

38 https://alora.cerdotola.com.

39 https://web.archive.org/web/20190208220853/https://alora.cerdotola.com.

Figure 7 On the left is a 
screenshot of a sample AILLA 
document, in .eaf format 
(displayed as such). However, 
when the file is downloaded 
(at right), the file receives a 
.eaf.xml extension, which must 
be manually removed before it 
is readable by ELAN.

https://github.com/saa-ts-dacs/dacs
https://diglib.amphilsoc.org
http://www.ling.sinica.edu.tw/en/announcements/Resources
https://alora.cerdotola.com
http://mulce.org
http://wooyake.org
https://diglib.amphilsoc.org
http://www.oralliterature.org
https://alora.cerdotola.com
https://web.archive.org/web/20190208220853/https://alora.cerdotola.com


13Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

the same issues, defeating the purpose of being an archive that safeguards data.40 Furthermore, 
at least two archives41 still required some use of Adobe Flash Player (see Figure 8), which was 
phased out by December 2020.

Many archives contained broken links, though this differed in extent and severity. The problems 
related to both site-internal and -external links, and problems arose due to internal site 
reconfigurations (such as those of the British Library’s Endangered Archives Collections)42 as 
well as the removal of individual pages. It would be ideal for archives to not rely on external 
links, but when necessary, regularly checking for outdated links is crucial. Broken links not 
only hinder the usability of archival materials from an end-user perspective, they also hinder 
the discoverability of such webpages. Search engines penalize broken links43 in search results, 
thus making archive sites with such links less findable. As language resources are entrusted in 
archive sites’ stewardship, it is important that they remain discoverable by those who wish to 
access these language materials.

Corrado and Sandy (2017) draw attention to the lifecycle of a project, as defined by the Life 
Cycle Information for E-Literature.44 They argue that “institutional commitment…ensuring 
that enough financial resources are available to sustain the initiative” is necessary for digital 
preservation to be successful (Corrado & Sandy 2017:11). In order for stewardship organizations 
to faithfully fulfill their fiduciary duties as language resource stewards, website maintenance 
must receive ongoing support to keep up with rapidly-changing software and security 
compatibility requirements.

3.2.4 Recommendations

Offering more detailed descriptions of a collection’s contents, specifically, media types 
(video, audio, text, etc.), completion state of any transcriptions or translations, and number 
of hours of recorded material, would help researchers evaluate the utility of a collection for a 
particular purpose, and give community members a sense of what is in the collection. Allowing 
searches by file type would allow researchers to further refine their queries and determine the 
usability of a given collection for their research purposes; we also recommend that archives 
correctly label file types and remove filetype capitalization dependencies on searching.45 We 
also suggest that archives make it clear to depositors what types of information are indexed 
for searching, and how researchers can structure their collections to make them usable. To 
make archives more easily discoverable, we recommend archive managers use the Sitemaps46 
protocol set to provide site-internal content information to search engines. Finally, we suggest 
that depositors consider how they use links to external sites in their deposits, archiving copies 
where appropriate (or pointing links to the Internet Archive). We suggest that archives regularly 

40 Collections within the Endangered Archives Programme, British Museum (https://eap.bl.uk/project/
EAP347); Online Database of Interlinear Text (https://odin.linguistlist.org); ELAR (https://www.elararchive.
org/dk0611).

41 The Repository and Workspace for Austroasiatic Intangible Heritage (https://projekt.ht.lu.se/rwaai); Yami 
Corpus (http://yamiproject.cs.pu.edu.tw/yami/en_index_flash.htm).

42 https://eap.bl.uk. For approximately 8 months, every collection-level link from the main site catalog was 
broken. However, as of December 15, 2021, this has been fixed.

43 See https://devrix.com/tutorial/crucial-google-penalties/ for more about search engine penalties.

44 http://www.life.ac.uk/glossary.

45 To be clear, the issue we are discussing here is where a search returns both Audio and audio filetypes (for 
example) and treats them as distinct filetypes. This is a claim about variable capitalization in standardized 
vocabularies, not a point about case sensitivity in searches more generally.

46 https://www.sitemaps.org. We thank the anonymous reviewer who pointed us to Sitemaps.

Figure 8 Adobe Flash Player 
required to access materials 
on The Repository and 
Workspace for Austroasiatic 
Intangible Heritage.

https://eap.bl.uk/project/EAP347
https://eap.bl.uk/project/EAP347
https://odin.linguistlist.org
https://www.elararchive.org/dk0611
https://www.elararchive.org/dk0611
http://yamiproject.cs.pu.edu.tw/yami/en_index_flash.htm
https://eap.bl.uk
https://devrix.com/tutorial/crucial-google-penalties/
http://www.life.ac.uk/glossary
https://www.sitemaps.org


14Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

check for link breaks (e.g., by using automated checking tools that generate reports, such as the 
Broken Link Checker plugin),47 particularly to archive-internal pages.

3.3 FUNCTIONALITY

The primary function of an archive is to store and safeguard materials, so it is essential for 
both the process of depositing and retrieving data to be straightforward; after all, material 
is safeguarded for a purpose, not simply to have an unused record of languages. This section 
discusses the functionality of data retrieval and use. Section 3.3.1 focuses on the structure and 
content of various archives, as well as issues surrounding downloads. Section 3.3.2 lists our 
recommendations for archive functionality.

3.3.1 Site content, structure, and downloads

The available content and structures of archive sites posed the first issue with functionality. 
We note that some of the following concerns are affected by the choice of CMS of individual 
archives. We attempted to track CMS use across archives, such as whether the archive used a 
common CMS such as Mukurtu,48 DSpace,49 or a bespoke platform. However, information about 
the CMSs underlying the archives in our audit was not easily accessible; an overwhelming 
majority of archives had no publicly available information at all about their CMS. Half of the 
archives mentioned the institutions that supported the development of the archive, or external 
servers where related language corpora were hosted, but the information about infrastructure 
was not available for enough archives for us to track it. We acknowledge, however, that site 
structure and content capabilities are closely linked to choice of CMS. 

The sites examined here vary extensively in their holdings and scope. Some sites labeled as 
archives only hosted one or two resources (Magoria Books Carib & Romani Archive),50 which 
sometimes required purchase, while others hosted none at all (Multimodal Learning Corpus 
Exchange).51 Others, such as the SIL International Language & Culture Archives,52 appeared 
to function more as directories with both links to external resources and hosted materials. 
They were not “archives” in the sense of storing and safeguarding materials. This is in contrast 
to archives such as the ELAR archive, which has full hosting and offers (per R3data.org) more 
than 462,048 results.53 The most prevalent issue impeding archives’ functionality was the lack 
of a bulk download option. The vast majority (34/41) had no bulk download option for either 
text or audio/video. Two54 had bulk download options for text files only, and five55 provided 
download links for zip files containing all or a selection of the files in the corpus. Requiring users 
to download files individually not only results in loss of time, but also renders some collections 
(e.g., those with 15,000 audio files) virtually inaccessible because of the sheer number of clicks, 
ranging from 1 to 7 per file, required to download their contents. Further, when individual 
downloads are the only option, users would benefit from knowing exactly how many files are 
in each collection, allowing them to assess their own storage capacity before attempting to 
download a corpus.

Another concern that results from downloading files individually is the loss of arrangement of 
items within a collection. For example, nested files lose their relationships to each other and 
must be manually re-sorted when downloaded onto a drive. This is assuming that the archive 

47 https://www.outlookstudios.com/tools-to-find-broken-links-on-your-website/#Broken-Link-
Checker.

48 https://mukurtu.org.

49 https://duraspace.org/dspace.

50 http://archive.magoriabooks.com.

51 http://lrl-diffusion.univ-bpclermont.fr/mulce2/accesCorpus/accesCorpusMulce.php.

52 https://www.sil.org/resources/language-culture-archives.

53 https://www.re3data.org/repository/r3d100013583.

54 LIA Sápmi - Sami Speech Corpus (http://tekstlab.uio.no/LIA/samisk/index.html); CHILDES Data 
Repository (https://childes.talkbank.org/access).

55 DOBES The Language Archive (https://archive.mpi.nl/tla); The Language Commons (https://archive.org/
details/LanguageCommons?tab=about); Slovenian language resource repository (http://www.clarin.si/
info/about/); Eurac Research CLARIN Centre (https://clarin.eurac.edu/index.html); Open Resources and Tools 
for Language (ORTOLANG) (https://www.ortolang.fr).

http://R3data.org
https://www.outlookstudios.com/tools-to-find-broken-links-on-your-website/#Broken-Link-Checker
https://www.outlookstudios.com/tools-to-find-broken-links-on-your-website/#Broken-Link-Checker
https://mukurtu.org
https://duraspace.org/dspace
http://archive.magoriabooks.com
http://lrl-diffusion.univ-bpclermont.fr/mulce2/accesCorpus/accesCorpusMulce.php
https://www.sil.org/resources/language-culture-archives
https://www.re3data.org/repository/r3d100013583
http://tekstlab.uio.no/LIA/samisk/index.html
https://childes.talkbank.org/access
https://archive.mpi.nl/tla
https://archive.org/details/LanguageCommons?tab=about
https://archive.org/details/LanguageCommons?tab=about
http://www.clarin.si/info/about/
http://www.clarin.si/info/about/
https://clarin.eurac.edu/index.html
https://www.ortolang.fr


15Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

site has not already collapsed structures that existed when researchers originally deposited 
their files. When this happens, crucial information can be lost for collections that depend on file 
structure to match transcripts and metadata files to audio and video files (for further discussion 
of arrangement, see Patterson 2021: §6.3.1).

We do recognize that there are non-trivial issues concerning bandwidth, web server traffic, and 
validation of large files that limit download capabilities and may require additional funding to 
resolve. Still, since these issues directly affect archives’ functionality, they should be addressed 
sooner rather than later. Even if downloads must be done individually, solutions such as putting 
all of a collection’s download links on a single page (as opposed to requiring users to enter into 
individual folders to download) exist. We draw attention to the DOBES Archive for providing an 
effortless method of downloading files in bulk. Their “basket” system allows users to select and 
bundle individual files or entire collections, then after an amount of time proportional to the 
number of files they have requested, a link to a zip file is emailed directly to them.

Other issues surrounding downloads included non-functioning download buttons or downloads 
that resulted in unreadable data. AILLA’s download links are blocked by Chrome and Firefox 
browsers due to security settings, and could only be accessed by changing web browsers. The 
Hindu-Kush Areal Typology,56 while not strictly an archive, had a bulk download option for 
wordlists. However, users had to ensure that they were properly opening the UTF-8 encoded 
CSV file in order to read the data without broken text. While workarounds like these exist, they 
may deter users with less familiarity with technology from using such archives effectively.

3.3.2 Recommendations

Firstly, and perhaps most critically, we suggest adding the option to download files in bulk, 
including an option for the entire corpus and for each folder in it, while preserving the original 
arrangement configuration. We recognize that this may be a complex request, given how file 
storage may work for the archive, but it is a necessary part of making files accessible. A 15,000-
item collection with no bulk download option is neither accessible nor realistically usable. 
Furthermore, we suggest archives either allow depositors to preserve the original file structure 
of their collections upon deposit, or develop tools to help them better structure collections 
once archived in-site, for example through tags. It is vital that archives provide layout guides 
and naming conventions for depositors, so that users may quickly locate corresponding files 
and recreate file structures in the event that they are lost, and care should be taken when 
depositing collections to make sure that vital information about metadata and collection 
structure is not lost.

4 CONCLUSIONS
Digital archives, even when poorly maintained, may offer protection to language data that 
may otherwise have been lost, forgotten, or destroyed. We recognize that decisions made 
by both archives and depositors can greatly impact the accessibility of archived materials. 
We further recognize that there are tradeoffs in the creation of archives and some decisions 
that were made long ago continue to affect our methods, procedures, and choices. The 
power that both archivists and depositors have over these materials conveys a responsibility 
to ensure that materials will be able to be used and reused into the future. To that end, 
these findings and recommendations can help set procedural standards that greatly help 
those who access archives. We recognize that additional resources are necessary for this to 
succeed.

One incentive for depositors to increase the usability of their collection is for that work to be 
included in evaluations for promotion. By setting out how archives vary, and how that variation 
can affect the utility of collections and the user experience, we provide clarification to the 
scope of possible review. Individuals should not be evaluated for aspects of archiving which 
are outside their control; and if archives are to feature in hiring and/or promotion reviews, they 
may need to provide more explicit information about the scope and limitations of their services.

56 https://hindukush.clld.org/.

https://hindukush.clld.org/


16Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

APPENDIX
Information about the archive review:

•	 Archive name: the name of the archive
•	 Site link: the url of the web portal for the archive
•	 Metalanguage(s): the primary language which is used to deliver the records and to 

navigate the site
•	 Broken links: a qualitative assessment of the number of broken links encountered
•	 Types of materials available: a broad description of the filetypes available for download 

from the web portal
•	 Access restrictions: the types of access restrictions found across the site (or as described 

in the archive meta-information). 
•	 Search function: information about how searches can be conducted on the site and the 

types of materials returned
•	 Filter by: discussion of how search results may be filtered.
•	 Bulk download: whether collection items must be downloaded individually (e.g. using 

the “save as” command through a web browser) or whether there are options for 
downloading multiple files at once.

•	 Number of clicks to download: how many steps does it take between a collection item’s 
information and being able to download it.

•	 Metadata location: where metadata for a collection is accessed

ADDITIONAL FILE
The additional file for this article can be found as follows:

•	 Supplementary Files 1. Archive Audit Spreadsheet. Summarizes findings and comments. 
DOI: https://doi.org/10.5334/johd.59.s1

ACKNOWLEDGEMENTS
We are grateful to the Fieldwork Reading Group at Yale University for their valuable feedback and 
discussion throughout this project that have led to considerable improvements. We would also 
like to thank the audiences of The 7th International Conference on Language Documentation 
& Conservation (ICLDC2021), PARADISEC@100, and the UC Berkeley Language Revitalization 
Working Group for their insightful and useful feedback. 

COMPETING INTERESTS
The authors have no competing interests to declare.

AUTHOR CONTRIBUTIONS
IY: Data curation, formal analysis, investigation, validation, writing—original draft, writing—
review & editing

AL: Data curation, formal analysis, investigation, validation, writing—original draft, writing—
review & editing

JK: Conceptualization, data curation, formal analysis, investigation, methodology, validation, 
writing—review & editing

KH: Data curation, formal analysis, investigation, validation, writing—original draft, writing—
review & editing

JJ: Data curation, formal analysis, investigation, validation, writing—review & editing

SB: Conceptualization, data curation, formal analysis, investigation, methodology, validation, 
writing—review & editing

CB: Conceptualization, data curation, formal analysis, investigation, methodology, validation, 
writing—review & editing

https://doi.org/10.5334/johd.59.s1


17Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

AUTHOR AFFILIATIONS
Irene Yi   orcid.org/0000-0001-9255-4235 
Linguistics Department, Yale University, New Haven, CT, US

Amelia Lake 
Linguistics Department, Yale University, New Haven, CT, US

Juhyae Kim 
Linguistics Department, Cornell University, Ithaca, NY, US

Kassandra Haakman 
Linguistics Department, Yale University, New Haven, CT, US

Jeremiah Jewell 
Linguistics Department, Yale University, New Haven, CT, US

Sarah Babinski   orcid.org/0000-0001-7764-5876 
Linguistics Department, Yale University, New Haven, CT, US

Claire Bowern   orcid.org/0000-0002-9512-4393 
Linguistics Department, Yale University, New Haven, CT, US

REFERENCES
Administration for Native Americans (ANA). (2006). Native language preservation: A reference 

guide for establishing archives and repositories. http://www.aihec.org/our-stories/docs/
NativeLanguagePreservationReferenceGuide.pdf

Americans With Disabilities Act of 1990, Pub. L. No. 101–336, 104 Stat. 328 (1990).
Austin, P. (2011). “Who uses digital language archives?” PARADISEC blog. https://www.paradisec.org.au/

blog/2011/04/who-uses-digital-language-archives/ (last accessed 27 September 2021).
Austin, P. (2021). “Corpora and archiving in language documentation, description, and revitalization.” 

Presented at FieldLing Seminar 2021. Paris. http://www.peterkaustin.com/docs/teaching/2021-09-09_
FieldLing.pdf 

Aznar, J., & Seifart, F. (2020). RefCo: An initiative to develop a set of quality criteria for fieldwork corpora. 
2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et 

de Terrain (LIFT), 95–101. https://hal.archives-ouvertes.fr/hal-03066031/file/lift.pdf#page=100 (last 
accessed 27 January 2022).

Babinski, S., Jewell, J., Kim, J., Haakman, K., Lake, A., Yi, I., & Bowern, C. (2022). “How usable are digital 
collections for endangered languages? A review.” Proceedings of the Linguistic Society of America 

(PLSA) 7(1). 5219.

Baldwin, D., & Olds, J. (2007). Miami Indian language and cultural research at Miami University. In D. 
Cobb & L. Fowler (Eds.), Beyond red power: American Indian politics and activism since 1900, 280–90. 

Santa Fe: SAR Press.

Berez, A. L. (2013). The Digital Archiving of Endangered Language Oral Traditions: Kaipuleohone at the 
University of Hawai‘i and C’ek’aedi Hwnax in Alaska. Oral Tradition, 28(2), 261–270. DOI: https://doi.
org/10.1353/ort.2013.0010

Berez-Kroeker, A., Gawne, L., Kung, S., Kelly, B., Heston, T., Holton, G., Pulsifer, P., Beaver, D., Chelliah, 
S., Dubinsky, S., Meier, R., Thieberger, N., Rice, K., & Woodbury, A. (2018). Reproducible research in 
linguistics: A position statement on data citation and attribution in our field. Linguistics, 56(1), 1–18. 

DOI: https://doi.org/10.1515/ling-2017-0032
Bird, S., & Simons, G. (2003). Seven Dimensions of Portability for Language Documentation and 

Description. Language, 79(3), 557–582. DOI: https://doi.org/10.1353/lan.2003.0149
Burke, M., & Zavalina, O. L. (2020). Descriptive richness of free‐text metadata: A comparative analysis 

of three language archives. Proceedings of the Association for Information Science and Technology, 

57(1), e429. DOI: https://doi.org/10.1002/pra2.429
Burke, M., Zavalina, O. L., Phillips, M. E., & Chelliah, S. (2021). Organization of Knowledge and 

Information in Digital Archives of Language Materials. Journal of Library Metadata, 20(4), 185–217. 

DOI: https://doi.org/10.1080/19386389.2020.1908651
Campbell, L., & Belew, A. (2018). Introduction: Why catalogue endangered languages? In L. Campbell 

& A. Belew (Eds.), Cataloguing the World’s Endangered Languages, 1–14. London: Routledge. DOI: 

https://doi.org/10.4324/9781315686028
Campbell, L., Lee, N. H., Okura, E., Simpson, S., & Ueki, K. (2013). New Knowledge: Findings 

from the Catalogue of Endangered Languages (“ELCat”). 3rd International Conference 

on Language Documentation & Conservation. https://scholarspace.manoa.hawaii.edu/
bitstream/10125/26145/2/26145.pdf

Campbell, M. H. (2018). Accessibility of Archives’ Digital Resources for Users with Hearing and Visual 
Impairments. Master’s Thesis, University of North Carolina at Chapel Hill. https://doi.org/10.17615/
c11t-gs09

https://orcid.org/0000-0001-9255-4235
https://orcid.org/0000-0001-7764-5876
https://orcid.org/0000-0002-9512-4393
http://www.aihec.org/our-stories/docs/NativeLanguagePreservationReferenceGuide.pdf
http://www.aihec.org/our-stories/docs/NativeLanguagePreservationReferenceGuide.pdf
https://www.paradisec.org.au/blog/2011/04/who-uses-digital-language-archives/
https://www.paradisec.org.au/blog/2011/04/who-uses-digital-language-archives/
http://www.peterkaustin.com/docs/teaching/2021-09-09_FieldLing.pdf
http://www.peterkaustin.com/docs/teaching/2021-09-09_FieldLing.pdf
https://hal.archives-ouvertes.fr/hal-03066031/file/lift.pdf#page=100
https://doi.org/10.1353/ort.2013.0010
https://doi.org/10.1353/ort.2013.0010
https://doi.org/10.1515/ling-2017-0032
https://doi.org/10.1353/lan.2003.0149
https://doi.org/10.1002/pra2.429
https://doi.org/10.1080/19386389.2020.1908651
https://doi.org/10.4324/9781315686028
https://scholarspace.manoa.hawaii.edu/bitstream/10125/26145/2/26145.pdf
https://scholarspace.manoa.hawaii.edu/bitstream/10125/26145/2/26145.pdf
https://doi.org/10.17615/c11t-gs09
https://doi.org/10.17615/c11t-gs09


18Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

Carew, M., Green, J., Kral, I., Nordlinger, R., & Singer, R. (2015). Getting in touch: Language and digital 
inclusion in Australian Indigenous communities. Language Documentation & Conservation, 9, 

307–323. http://hdl.handle.net/11343/57354
Corrado, E. M., & Sandy, H. M. (2017). Digital Preservation for Libraries, Archives, and Museums. Lanham, 

MD: Rowman & Littlefield.

Dobrin, L., & Schwartz, S. (2021). The social lives of linguistic field materials. Language Documentation 
and Description, 21. http://www.elpublishing.org/docs/1/21/ldd21_01.pdf

Evans, N., & Sasse, H.-J. (2004). Searching for meaning in the Library of Babel: field semantics and 
problems of digital archiving. In L. Barwick, A. Marett, J. Simpson & A. Harris (Eds.), Researchers, 

Communities, Institutions, Sound Recordings, 1–42. Sydney: University of Sydney. http://hdl.handle.
net/2123/1509

Gaby, A., & Woods, L. (2020). Towards linguistic justice for Indigenous people: A response to Charity 
Hudley, Mallinson, and Bucholtz. Language, 96(4), e268–e280. DOI: https://doi.org/10.1353/
lan.2020.0078

Hinton, L. (2003). How to teach when the teacher isn’t fluent. In J. Reyhner, O. V. Trujillo, R. L. Carrasco, 
& L. Lockard (Eds.), Nurturing Native Languages, 79–92. Flagstaff, AZ: Northern Arizona University. 

https://jan.ucc.nau.edu/~jar/NNL/NNL_6.pdf
Hinton, L. (2018). Approaches to and Strategies for Language Revitalization. In K. L. Rehg & L. Campbell 

(Eds.), The Oxford Handbook of Endangered Languages, 442–465. Oxford University Press. DOI: https://
doi.org/10.1093/oxfordhb/9780190610029.013.22

Holton, G. (2012). Language archives: They’re not just for linguists any more. In F. Seifart, G. Haig, N. P. 
Himmelmann, D. Jung, A. Margetts, & P. Trilsbeek (Eds.), Potentials of Language Documentation: 

Methods, Analyses, and Utilization, 111–117. Honolulu: University of Hawai’i Press. http://hdl.handle.
net/10125/4523

Kaplan, J., & Lemov, R. (2019). Archiving Endangerment, Endangered Archives: Journeys through the 
Sound Archives of Americanist Anthropology and Linguistics, 1911–2016. Technology and Culture 

60(2), S161-S187. DOI: https://doi.org/10.1353/tech.2019.0067
Kung, S. S., Sullivant, R., Pojman, E., & Niwagaba, A. (2020). Archiving for the Future: Simple Steps for 

Archiving Language Documentation Collections. New York, NY: Teach Online with Teachable. https://
archivingforthefuture.teachable.com

Mansfield, J. (2014). Polysynthetic sociolinguistics: The language and culture of Murrinh Patha youth. PhD 
dissertation, Australian National University. https://doi.org/10.25911/5d723cd88582b

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal Forced Aligner: 
trainable text-speech alignment using Kaldi. Proceedings of the 18th Conference of the International 

Speech Communication Association. https://montrealcorpustools.github.io/Montreal-Forced-Aligner/. 
DOI: https://doi.org/10.21437/Interspeech.2017-1386

Nathan, D. (2010). Archives 2.0 for endangered languages: From disk space to MySpace. International 
Journal of Humanities and Arts Computing, 4(1–2), 111–124. https://doi.org/10/c7ct5f. DOI: https://
doi.org/10.3366/ijhac.2011.0011

Paterson, H. J., III (2021) “Where Have All the Collections Gone?” Poster presented at the 15th Annual 
Society of American Archivists Research Forum. https://hughandbecky.us/Hugh-CV/publication/2021-
where-have-all-the-collections-gone/Where-have-all-the-Collections-Gone.pdf 

Schwartz, S., & Dobrin, L. (2016). The cultures of Native North American language documentation and 
revitalization. Reviews in Anthropology, 45, 88–123. DOI: https://doi.org/10.1080/00938157.2016.117
9522

Simons, G. F., & Lewis, M. P. (2013). The world’s languages in crisis: A 20-year update. In E. Mihas, B. 
Perley, G. Rei-Doval, & K. Wheatley (Eds.), Responses to language endangerment. In honor of Mickey 

Noonan, 3–19. Amsterdam: John Benjamins. DOI: https://doi.org/10.1075/slcs.142.01sim
Sonnemaker, T. (2020). “The Number of Americans without Reliable Internet Access May Be Way 

Higher than the Government’s Estimate—and That Could Cause Major Problems in 2020.” 

https://www.businessinsider.com/americans-lack-of-internet-access-likely-underestimated-by-
government-2020-3 (last accessed 20 September 2021).

Sullivant, R. (2020). Archival description for language documentation collections. Language 
Documentation & Conservation, 14, 520–578. http://hdl.handle.net/10125/24949

Vann, R. E. (2006). Frustrations of the Documentary Linguist: The State of the Art in Digital Language 
Archiving and the Archive that Wasn’t. Department of Spanish Research, 1. Western Michigan 

University. https://scholarworks.wmich.edu/spanish_research/1
Wasson, C., Holton, G., & Roth, H. S. (2016). Bringing User-Centered Design to the Field of 

Language Archives. Language Documentation & Conservation, 10, 641–681. http://hdl.handle.
net/10125/24721

Web Content Accessibility Guidelines (WCAG). (2018). Web Accessibility Initiative. WCAG 2.1. https://
www.w3.org/WAI/standards-guidelines/wcag/

http://hdl.handle.net/11343/57354
http://www.elpublishing.org/docs/1/21/ldd21_01.pdf
http://hdl.handle.net/2123/1509
http://hdl.handle.net/2123/1509
https://doi.org/10.1353/lan.2020.0078
https://doi.org/10.1353/lan.2020.0078
https://jan.ucc.nau.edu/~jar/NNL/NNL_6.pdf
https://doi.org/10.1093/oxfordhb/9780190610029.013.22
https://doi.org/10.1093/oxfordhb/9780190610029.013.22
http://hdl.handle.net/10125/4523
http://hdl.handle.net/10125/4523
https://doi.org/10.1353/tech.2019.0067
https://archivingforthefuture.teachable.com
https://archivingforthefuture.teachable.com
https://doi.org/10.25911/5d723cd88582b
https://montrealcorpustools.github.io/Montreal-Forced-Aligner
https://doi.org/10.21437/Interspeech.2017-1386
https://doi.org/10/c7ct5f
https://doi.org/10.3366/ijhac.2011.0011
https://doi.org/10.3366/ijhac.2011.0011
https://hughandbecky.us/Hugh-CV/publication/2021-where-have-all-the-collections-gone/Where-have-all-the-Collections-Gone.pdf
https://hughandbecky.us/Hugh-CV/publication/2021-where-have-all-the-collections-gone/Where-have-all-the-Collections-Gone.pdf
https://doi.org/10.1080/00938157.2016.1179522
https://doi.org/10.1080/00938157.2016.1179522
https://doi.org/10.1075/slcs.142.01sim
https://www.businessinsider.com/americans-lack-of-internet-access-likely-underestimated-by-government-2020-3
https://www.businessinsider.com/americans-lack-of-internet-access-likely-underestimated-by-government-2020-3
http://hdl.handle.net/10125/24949
https://scholarworks.wmich.edu/spanish_research/1
http://hdl.handle.net/10125/24721
http://hdl.handle.net/10125/24721
https://www.w3.org/WAI/standards-guidelines/wcag/
https://www.w3.org/WAI/standards-guidelines/wcag/


19Yi et al. 
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.59

TO CITE THIS ARTICLE:
Yi, I., Lake, A., Kim, J., 
Haakman, K., Jewell, J., 
Babinski, S., & Bowern, 
C. (2022). Accessibility, 
Discoverability, and 
Functionality: An Audit of and 
Recommendations for Digital 
Language Archives. Journal 
of Open Humanities Data, 8: 
10, pp. 1–19. DOI: https://doi.
org/10.5334/johd.59

Published: 24 March 2022

COPYRIGHT:
© 2022 The Author(s). This is an 
open-access article distributed 
under the terms of the Creative 
Commons Attribution 4.0 
International License (CC-BY 
4.0), which permits unrestricted 
use, distribution, and 
reproduction in any medium, 
provided the original author 
and source are credited. See 
http://creativecommons.org/
licenses/by/4.0/.

Journal of Open Humanities 
Data is a peer-reviewed open 
access journal published by 
Ubiquity Press.

Whalen, D. H., DiCanio, C., & Dockum, R. (2018). Phonetic documentation in the literature: Coverage rates 
for topics and languages. The Journal of the Acoustical Society of America, 144(3), 1936–1936. DOI: 

https://doi.org/10.1121/1.5068471
Whalen, D. H., Moss, M., & Baldwin, D. (2016). Healing through language: Positive physical health effects 

of indigenous language use. F1000Research, 5. DOI: https://doi.org/10.12688/f1000research.8656.1
Woodbury, A. C. (2014). Archives and audiences: Toward making endangered language documentations 

people can read, use, understand, and admire. In D. Nathan & P. K. Austin (Eds.), Language 

Documentation and Description: Special Issue on Language Documentation and Archiving, 12, 19–36. 

London: SOAS. http://www.elpublishing.org/PID/135

https://doi.org/10.5334/johd.59
https://doi.org/10.5334/johd.59
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
https://doi.org/10.1121/1.5068471
https://doi.org/10.12688/f1000research.8656.1
http://www.elpublishing.org/PID/135