So what are you going to do with that?: The promises and pitfalls of massive data sets


Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=wcul20

Download by: [University of Michigan] Date: 20 November 2017, At: 11:41

College & Undergraduate Libraries

ISSN: 1069-1316 (Print) 1545-2530 (Online) Journal homepage: http://www.tandfonline.com/loi/wcul20

So what are you going to do with that?: The
promises and pitfalls of massive data sets

Sigrid Anderson Cordell & Melissa Gomis

To cite this article: Sigrid Anderson Cordell & Melissa Gomis (2017): So what are you going to do
with that?: The promises and pitfalls of massive data sets, College & Undergraduate Libraries, DOI:
10.1080/10691316.2017.1338979

To link to this article:  http://dx.doi.org/10.1080/10691316.2017.1338979

Published online: 20 Jul 2017.

Submit your article to this journal 

Article views: 42

View related articles 

View Crossmark data

http://www.tandfonline.com/action/journalInformation?journalCode=wcul20
http://www.tandfonline.com/loi/wcul20
http://www.tandfonline.com/action/showCitFormats?doi=10.1080/10691316.2017.1338979
http://dx.doi.org/10.1080/10691316.2017.1338979
http://www.tandfonline.com/action/authorSubmission?journalCode=wcul20&show=instructions
http://www.tandfonline.com/action/authorSubmission?journalCode=wcul20&show=instructions
http://www.tandfonline.com/doi/mlt/10.1080/10691316.2017.1338979
http://www.tandfonline.com/doi/mlt/10.1080/10691316.2017.1338979
http://crossmark.crossref.org/dialog/?doi=10.1080/10691316.2017.1338979&domain=pdf&date_stamp=2017-07-20
http://crossmark.crossref.org/dialog/?doi=10.1080/10691316.2017.1338979&domain=pdf&date_stamp=2017-07-20


COLLEGE & UNDERGRADUATE LIBRARIES
https://doi.org/./..

So what are you going to do with that?: The promises and
pitfalls of massive data sets

Sigrid Anderson Cordell a and Melissa Gomis b

aHatcher Graduate Library, University of Michigan, Ann Arbor, Michigan, USA; bPerkins Library, Doane
University, Crete, Nebraska, USA

ARTICLE HISTORY
Received  February 
Revised  June 
Accepted  June 

KEYWORDS
Data mining; library services;
supporting DH across the
institution; teaching DH

ABSTRACT
Thisarticletakesasitscasestudythechallengeofdatasetsfortext
mining, sources that offer tremendous promise for DH methodol-
ogy but present specific challenges for humanities scholars. These
text sets raise a range of issues: What skills do you train humanists
to have? What is the library’s role in enabling and supporting use
of those materials? How do you allocate staff? Who oversees sus-
tainability and data management? By addressing these questions
through a specific use case scenario, this article shows how these
questions are central to mapping out future directions for a range
of library services.

Introduction

When the first set of texts from the Early English Books Online Text Creation
Partnership (EEBO-TCP) was released on January 1, 2015 (Text Creation Partner-
ship [TCP] 2014), there was understandable excitement about the release of 25,000
openly available texts from the Early Modern period (Levelt n.d.). In addition to
making these texts available to read, this release also opened up possibilities for text
mining the EEBO-TCP data set. However, while there is clear potential for digital
humanities research in making a relatively clean data set of texts from the early mod-
ern period available, the structure of the data set itself poses considerable challenges
for scholars without a background in programming. Most humanities scholars can-
not take advantage of a data set like this one—or similar data sets, such as the
historical newspapers that ProQuest has recently made available to institutions that
have purchased perpetual access—without considerable training and support. The
question becomes, who is best positioned to provide that support? For many, the
obvious answer to this question is the library because of its position as provider of
resources and expertise in navigating them. If the library is to provide this support,
however, how can it do so most effectively? The gap between the promise and
usability of massive humanities data sets like the EEBO-TCP project presents an

CONTACT Melissa Gomis msgomis@gmail.com Perkins Library, Doane University,  Boswell Ave, Crete, NE
.

Published with license by Taylor & Francis ©  Sigrid Anderson Cordell and Melissa Gomis

https://doi.org/10.1080/10691316.2017.1338979
https://crossmark.crossref.org/dialog/?doi=10.1080/10691316.2017.1338979&domain=pdf&date_stamp=2017-07-20
mailto:msgomis@gmail.com


2 S. A. CORDELL AND M. GOMIS

opportunity to consider a host of questions facing libraries today as they develop ser-
vice models and expertise to support traditional and emerging forms of scholarship.

This article takes as its case study the challenge of massive data sets for text min-
ing, sources that have been lauded as offering tremendous promise for DH method-
ology but present very specific challenges for humanities scholars with minimal pro-
gramming skills. The data management and use issues with which we are concerned
in this article engage the question of whether humanists should learn to code; how-
ever, they go beyond that in scale and scope. The text sets under discussion in this
article raise a broad range of issues if they are to be used by researchers: What skills
do you train humanists to have? While the library in most cases helped to create and
provides access to these data sets, what is the library’s evolving role in enabling and
supporting use of those materials? How do you allocate staff in this situation? Who’s
going to oversee sustainability and data management? By addressing these questions
through the lens of a specific use case scenario, this article shows how these ques-
tions are central to mapping out future directions for a range of library services.

Background

New digital methodologies and sources for humanistic scholarship raise new ques-
tions for training humanities scholars, as well as for the roles that libraries can play
in supporting emerging scholarly approaches. As many have noted, emerging digital
methodologies in humanities scholarship have opened up new ways to analyze texts
at scale. As Heuser, Le-Khac, and Moretti (2001) observe, digital methodologies
open up the possibility of asking broader questions of larger corpora to understand
texts and underlying social and cultural phenomena at scale. Traditional scholarly
methods, in particular the close reading of texts, necessarily limit the scale of anal-
ysis, leaving open the question of how authoritative any analysis based on reading
a necessarily limited corpus can be. As Heuser, Le-Khac, and Moretti point out,
machine reading methods hold promise for allowing us to answer new questions
based on a larger, more inclusive corpus: “These emerging methods promise ways
to pursue big questions we have always wanted to ask with evidence not from a
selection of texts, but from something approaching the entire literary or cultural
record. Moreover, the answers produced could have the authoritative backing of
empirical data” (79).

Alongside the “authoritative backing” that “empirical data” promises, these
approaches raise concerns among humanists, especially for disciplines that have
long defined themselves in opposition to the sciences. As Heuser, Le-Khac, and
Moretti (2011) observe,

By offering an entirely different model of humanities scholarship, the digital humanities
raise many questions …. Can we leverage quantitative methods in ways that respect the
nuance and complexity we value in the humanities? … Under the flag of interdisciplinar-
ity, are the digital humanities no more than the colonization of the humanities by the
sciences? (79).


COLLEGE & UNDERGRADUATE LIBRARIES 3

In conjunction with this lively debate over whether the core values of the human-
ities are lost by drawing on computational approaches is the question of how best
to train humanists to undertake these approaches, as well as a necessary discussion
about what might get lost in the process. Some of the resistance to computational
training by humanists, Kirschenbaum argues, stems from a misunderstanding of
what computer science is about, as well as its relevance to critical thinking:

Many of us in the humanities think our colleagues across the campus in the computer-
science department spend most of their time debugging software. This is no more true than
the notion that English professors spend most of their time correcting people’s grammar
and spelling. More significantly, many of us in the humanities miss the extent to which
programming is a creative and generative activity. (2009, B10)

Scholars like Kirschenbaum (2009) have argued forcefully for rethinking human-
ities training so as to incorporate programming skills. One way to make space,
Kirschenbaum suggests, is to replace the foreign language requirement in PhD pro-
grams with programming. These skills are crucial, he argues, because

Computers should not be black boxes but rather understood as engines for creating pow-
erful and persuasive models of the world around us. The world around us (and inside us)
is something we in the humanities have been interested in for a very long time. I believe
that, increasingly, an appreciation of how complex ideas can be imagined and expressed as
a set of formal procedures—rules, models, algorithms—in the virtual space of a computer
will be an essential element of a humanities education.

As Kirschenbaum argues, humanities scholars cannot explore the “complex ideas”
that humanities computing generates without an understanding of the underlying
computational systems.

Likewise, scholars connected to the Humanities, Arts, Science, and Technology
Alliance and Collaboratory (HASTAC) have devoted considerable energy to advo-
cating for humanists to learn coding. Hunter (2016) describes an anecdote that her
advisor told her when she wanted to do DH work but resisted taking a programming
class: “‘I’ll never forget this young scholar who put himself forward as an expert on
Chekhov,’ he mused. ‘I asked if he spoke Russian, and he proudly said he’d never even
taken a class. He lost all credibility in that moment. Don’t be the Chekhov scholar
who didn’t take Russian 101.”’ As Hunter suggests, scholars need to understand code
to design digital projects.

While there is some consensus in the scholarship that it is valuable for humanists
to learn programming skills, there has been less detailed attention paid to what the
best process is for teaching those skills. Antonijevic’s (2015) ethnographic study of
digital humanists reveals an informal, unstructured mode of learning that is focused
on point-of-need,

where learning is linked to immediate scholars’ needs, arising from specific research prob-
lems, which generally makes this way of learning preferred over organized efforts, such
as library workshops, where learning is decontextualized from scholarly practice. This
method also successfully makes use of one of the scholars’ most scarce resources: their
time. (80–81)


4 S. A. CORDELL AND M. GOMIS

As Antonijevic (2015) points out, this method has the disadvantage of “depend[ing]
on a scholar’s social network and its knowledge capacity” (81).

The idea of a “social network” as the basis for acquiring programming skills is
linked to another solution to the training dilemma offered by the literature on digital
scholarship: collaboration. Gibson, Ladd, and Presnell (2015) argue that,

“Unlike traditional humanities research, digital humanities scholarship is not a solitary
affair. Generally, no single person has all the skills, materials, and knowledge to create a
research project. By nature, the digital humanities project, big or small, requires a collabo-
rative team approach with roles for scholars, ‘technologists,’ and librarians” (4).

Liu echoes this sentiment, arguing that DH work

requires a full team of researchers with diverse skills in programming, database design,
visualization, text analysis and encoding, statistics, discourse analysis, website design,
ethics (including complex ‘human subjects’ research rules), and so on, to pursue ambi-
tious digital projects at a grant competitive level premised on making a difference in today’s
world. (2009, 27)

Collaboration, however, requires considerable support and advocacy in a disci-
plinary landscape where it is not the norm. Reid points out that,

Unlike a laboratory, which requires a team of people to operate, the default mode for
humanities academic labor has been for a professor to work independently …. It is unusual
for humanities scholarship to appear with more than two authors, let alone the long list of
authors that will accompany work in the sciences …. While there are certainly examples
of notable, long-standing collaborations in the humanities, they are exceptions to the rule.
(2012, 356)

Although collaboration can be fruitful for scholars in the humanities, it requires
both a cultural shift and a rethinking of the workflow for scholarly projects. At this
point, collaboration has not been fully embraced by scholars across the disciplines.

In addition to differing disciplinary attitudes that engender resistance to collabo-
ration in the humanities, collaboration can have its own drawbacks, especially when
the collaboration is not seen as fully equitable. As Edmond points out, “In the worst
cases, teamwork based on an ethos of knowledge sharing can degenerate into the
negotiation of uncomfortable tacit hierarchies, where some contributors (regardless
of their expertise or seniority) feel like service providers working in the shadow
of otherwise autonomous project leaders” (2015, 57). Further, Edmond observes
that collaboration doesn’t just require bringing people together but also reimagining
projects so that all people involved have an intellectual stake. According to Edmond,
successful digital humanities collaborations “ensure from the outset that the project
objectives propose interesting research questions or otherwise substantive contribu-
tions for each discipline or specialty involved” (56). As Reid (2012) explains, “Given
that the assemblage operates effectively with a single author, one essentially has to
invent new roles for additional participants” (356).

Because of their well-established role supporting research, librarians have taken
up the question of how to enable fruitful collaborations and how best they can
train humanists seeking to create DH projects or learn programming skills. Green


COLLEGE & UNDERGRADUATE LIBRARIES 5

asks how libraries can facilitate “scholars’ initial skills acquisition in text encoding”
(2014, 222). Green recommends a workshop model that does “not simply inculcate
scholars with the latest software; rather librarians and scholars work together to
facilitate scholars’ entry into the communities of practice that make up digital
humanities” (222). Pointing to the TEI (Text Encoding Initiative) consortium as
a model, she argues that it “presents a strong case study of the role of librarians
in building learning environments that enable scholars to become members of its
community of practice” (223).

One key question is whether it is the role of libraries to offer technical support
for digital projects, train researchers in attaining new skills (through workshops,
for example), or enable collaboration. Lewis et al. assert that “Organizations most
successful at building expertise among faculty, students, and staff tended to share
characteristics such as an open and collaborative interdisciplinary culture in which
each team member contributes expertise and is respected for it” (2015, 2).

Discussions of the library’s role in supporting scholars in emerging digital schol-
arship skills necessarily invites a conversation about staffing in libraries. Should the
library provide support staff for digital projects, or should that support staff come
from the ranks of graduate students? If graduate students are used as labor for these
projects, how can it be organically integrated into graduate training? Lewis et al.
(2015) point to both the advantages and disadvantages of this model for graduate
students:

Often, digital scholarship projects rely on graduate student assistants. The experience gives
students opportunities to build their knowledge and provides inexpensive labor. But such
projects must contend with frequent turnover; as one faculty member put it, “I get these MA
students, I train them, they graduate.” One university that offers degree programs in digital
scholarship tries to recruit its own students as staff, but there aren’t necessarily enough
students to meet the demand, especially with competition from other organizations. Most
of their graduates go to industry, since “they can offer more money. The only people we
have are here because of idealism.” (2015, 27)

Likewise, sustainability can be an issue when the support model is based on labor
by students who necessarily stay only a short period of time. In describing the com-
munity of practice support model that has been used by various projects such as
TEI, Documenting the American South, and the Victorian Women Writers Project,
Green points out, “The labor and craft taught for encoding texts generates a ‘shared
repertoire’ of skills that is continually disseminated and refined through the training
of new and established scholars. This shared repertoire is a critical element to the
ability of a community of practice to sustain and expand itself”(2014, 228). The com-
munity of practice model constantly requires new participants, especially because
many graduate students in library and information science programs or schools of
information are only pursuing master’s degrees and graduate after two years.

At the center of the question of library staffing, training, and support for digital
scholarship is the debate over whether libraries should establish digital humanities
centers. Ithaka’s report on supporting DH outlines three “campus models for sup-
port”: the service model, the lab model, and the network model. In the network


6 S. A. CORDELL AND M. GOMIS

model, “there are multiple units whose services have developed over time, in the
library and IT departments, but also visualization labs, centers in museums, and
instructional technology groups, each of which was formed to meet a specific need”
(Maron and Pickle 2014, 34).

Maron follows up on the Ithaka report on DH centers by arguing that the
service model has been controversial in libraries because of the debate over “the
degree to which librarians should envision themselves in a ‘service role”’ (2015, 33).
Nevertheless, this is the most common model, and it is driven by the fact that it

meet[s] faculty and students where they are—to offer courses, training, and some pro-
gramming support for members of the campus community. This often takes the form of
developing a full range of programming, from workshops to courses, and bringing in guest
speakers. The library or center following this model seeks to identify and respond to faculty
needs rather than “independently identifying a path of innovation” (33), Maron identifies
the “path of innovation model” as closer to the lab model.

Likewise, digital humanities centers can create a central space for networking and
collaboration. As Freistat explains,

Digital humanities centers are key sites for bridging the daunting gap between new technol-
ogy and humanities scholars, serving as the crosswalks between cyberinfrastructure and
users, where scholars learn how to introduce into their research computational methods,
encoding practices, and tools and where users of digital resources can be transformed into
producers. (2012, 281)

While there is much support for the development of digital humanities centers, there
are also detractors. Schaffner and Erway argue that “There are many ways to respond
to the needs of digital humanists, and a digital humanities (DH) center is appropriate
in relatively few circumstances” (2014, 5). Instead, libraries can draw on a host of
other approaches to support DH on their campuses. In this case, Shaffner and Erway
assert, “[i]n most settings, the best decision is to observe what the DH academics
are already doing and then set out to address gaps” (5).

Whether or not libraries build digital humanities centers, there is widespread
consensus that libraries are natural partners in supporting digital scholarship. At the
same time, there has been much less discussion of the specific challenges raised by
complex data sets that are not inherently user-friendly. Libraries offer varying mod-
els of support, and there is a robust conversation in the scholarly literature about
whether training, direct technical support, or enabling collaboration—or a combi-
nation of all three—is the best approach to supporting digital scholarship. As we
argue in the next section, the potential and challenges of large data sets provide an
opportunity to think through approaches to training, as well as the library’s role in
supporting teaching and research using these data sets.

Case study: The EEBO-TCP data set

As new digital methodologies emerge, along with new data sets that enable textual
analysis at scale, many scholars have sought help from librarians, other researchers


COLLEGE & UNDERGRADUATE LIBRARIES 7

(both in and beyond their disciplines), and technology experts as they begin nav-
igating resources and methodologies far outside their traditional training. While
there are expected challenges to learning the basic methods of digital scholarship
and analysis, a significant additional barrier exists in formatting and preparing the
data sets themselves, even beyond the programming skills that are necessary for
analysis. For example, while many researchers can operate basic web-based text
visualization tools such as Voyant with relative ease, finding and then preparing a
corpus for analysis with these tools is often far more daunting. The challenge in
this case comes from the complex nature of raw data sets, as well as other factors
that work against usability. Creating data sets for analysis often involves individual
downloads of plain text files (in the relatively limited cases in which platforms allow
that functionality), using R or Python to isolate subsets of larger corpora, or being
limited to corpora that are larger than the researcher may need. While it would be
unrealistic to suggest that it is possible to eliminate all challenges to creating cor-
pora, putting resources toward facilitating the creation of corpora from raw data sets
would offer significant advances in scholars’ involvement with digital scholarship.
Even data sets that have been produced by libraries pose challenges in usability for
researchers.

Without a significant infusion of resources aimed at increasing the usability of
these data sets by researchers at all levels of technical abilities, the question becomes,
who is best positioned to offer researchers and instructors support in using these
data sets? Likewise, who is best positioned to communicate the research possibilities,
as well as how to determine a fruitful research question, for using these data sets?
Preparing a corpus takes time, and there is no guarantee that text analysis will yield
usable results. This article takes the EEBO-TCP data set as a case study to discuss
the challenges and potential approaches for libraries to support digital humanities
work using these corpora. We draw on the EEBO-TCP data set both because its
potential and challenges are representative of other data sets being made available
for humanities research and because it is openly available.

EEBO-TCP offers considerable potential because it makes transcriptions of early
modern texts available for scholars, as well as because it is a clean data set. EEBO-
TCP is based on the Early English Books microfilm collection that includes over
130,000 titles from Pollard and Redgrave’s Short Title Catalogue (1475–1640), Wing’s
Short-Title Catalogue (1641–1700), and the Thomason Tracts (1640–1661) (Early
English Book Online [EEBO] n.d.). When the microfilm set was originally digitized,
the scans appeared as images, and only the metadata was searchable. To make the
texts themselves searchable, and because optical character recognition (OCR) soft-
ware has not yet advanced to handle early modern fonts with any degree of accuracy,
the Text Creation Project made the ambitious decision to re-key (i.e., transcribe) the
texts, as well as to mark them up using XML/SGML encoding. Although the original
goal was to make the texts full-text searchable, emerging text mining methodolo-
gies have made the existence of clean data sets particularly desirable for researchers.
Because the texts have been re-keyed, there are fewer errors in the texts than in
those that have been OCR’d. As part of its agreement with ProQuest, which makes
the EEBO database commercially available, Phase I of the EEBO-TCP texts, which


8 S. A. CORDELL AND M. GOMIS

includes the first 25,000 re-keyed texts, was made publicly available in December
2014.

While the data set offers considerable potential for researchers and also makes
the texts themselves available, the data set itself is not easy for researchers to use
for a variety of reasons. The texts are available either as a full data set on Box and
github, or as individual, HTML, ePUB, and TEI P5 XML files through the Oxford
Text Archive. The files on Box and Github are referenced by TCP number, a number
that is not available on the ProQuest platform, meaning that researchers who are not
interested in working with the corpus as a whole—who, for example, are interested
only in texts from a specific time frame or author—have to do considerable extra
work to identify the relevant files before they can begin downloading and formatting
them for analysis.

While researchers who are fluent in programming languages such as R or Python
have little trouble accessing these texts, in our experience many researchers in the
humanities are understandably daunted when faced with zip files containing 25,000
files, each of which contains XML or SGML markup that they must decide whether
(and how) to scrub or retain. There is little documentation on strategies for accessing
and cleaning up the text in preparation for mining or information on analysis tools
once you have the data.

Likewise, ProQuest has recently made their historical newspaper collections
available (for a fee) to libraries that have already purchased perpetual access to spe-
cific titles. When libraries license the full-text data sets of historical papers, they are
given access to the marked-up files. The Los Angeles Times, for example, is a col-
lection of 4.5 million files, presented in no particular order and with no metadata
in the file names. As in the case of the EEBO-TCP data set, to make use of these
files, researchers must begin by pulling down slices of the corpus (such as by year or
article type) using R or Python. Unlike the EEBO-TCP files, most LA Times articles
are not available one by one as plain text files on a platform for researchers to cob-
ble together a corpus through the search interface (and license agreements generally
limit bulk downloads in any case). Once researchers have pulled down a subset of
the corpus, they must decide how much of the markup to keep or strip out before
they can run it through a text visualization tool (unless they decide to use the text
mining package in R or a similar programming language). Leaving aside the techni-
cal skills needed to do this, researchers must also decide how to approach the dirty
OCR problem because the texts themselves are riddled with errors due to the con-
version process from microfilm. While data sets like this offer tremendous poten-
tial, it is not feasible for humanities scholars to make use of it without considerable
support.

Another example outside of the humanities is the United States Census Bureau,
which provides access to data sets through a variety of different websites and for-
mats. Determining the type of data that is needed and locating that data can be chal-
lenging to researchers new to working with census data. The Census Bureau offers
a list of recommended software and provides workshops, webinars, and classroom
trainings to help people get what they need. They also provide phone and e-mail


COLLEGE & UNDERGRADUATE LIBRARIES 9

support for researchers and people using census data in their work. Libraries are just
beginning to offer a range of data sets to their users either through their subscription
databases or through their own digital projects. Usually this type of information is
provided without creating a service model. Faculty and students often have to figure
out how to use these data sets themselves. Once users have the data set, the library
doesn’t play a strong role in helping them use it. The U.S. Census Bureau could serve
as a service model for supporting text mining in the digital humanities.

When an institution or a company provides access to a data set, do they have
a responsibility to assist researchers in using the data set? The following section
presents different support models that allow us to examine the ways libraries are
supporting digital scholarship projects with large data sets for research and learning.
Gaining access to the texts and analysis tools is not always the barrier to digital schol-
arship, especially for content out of copyright. Researchers often need help locating
resources, including money for staff, storage space, and software and technological
expertise to execute their projects.

Potential support models for digital scholarship using unwieldy data sets

Although there are certainly scholars out there who are capable of making use of
raw data sets, the majority are not. We as librarians and scholars need to advocate
for the ways in which our scholars want to use these materials. At the moment, we are
operating in a bifurcated context: On the one hand, there exist graphical interface
tools that do not give you much flexibility or control to manipulate or build the
corpus you are analyzing but that meet the needs of some researchers, such as the
Google N-Gram tool, or on the other hand, a move by publishers to dump the raw
data. As in the case of the ProQuest Historical Newspapers data sets, publishers have
responded to requests from researchers by making data sets available; these data
sets are usually delivered in large raw text file dumps that are not manageable to the
average humanist scholar.

Advocacy

As a first step in enabling research with these data sets, libraries, as the purchasers
and as the supporters of researchers, need to advocate for tools that create bridges
between easy-to-use digital tools (like Voyant and AntConc) and the data sets. For
example, rather than having either the entire raw data set for EEBO-TCP or the
Oxford cut-and-paste formatted version, why not create tools that make it easy to
use the platform to designate a corpus (i.e., by doing a search using the parameters
on the platform) and then extract plain text files from the search results? In the case
of the ProQuest Historical Newspapers example mentioned, it is not consistently
possible across the PQHN platform to download plain text files of individual files,
although this would make text mining custom corpora much more manageable for
researchers without a background in programming or the resources to hire an assis-
tant to manage the technical aspects.


10 S. A. CORDELL AND M. GOMIS

Creating new tools

Leonard recommends that libraries create tools or adopt open source tools to make
analysis easier. At the Yale University Library, they adopted the HathiTrust Book-
worm tool to analyze a small digital corpus of the Vogue collection. By creating tools
that researchers can use to search text in other ways, they also help patrons to analyze
their large digital collections (2014).

To facilitate work on the EEBO-TCP data set, Washington University in St. Louis
created the Early Modern Print (n.d.) project, which is supported by the Humanities
Digital Workshop at Washington University. The Early Modern Print project pro-
vides exploration tools tailored to the EEBO-TCP data. They describe the tools as

an aggregate view of the corpus that enables us to probe English lexical and orthographic
history in ways that usefully complement the search capabilities of EEBO-TCP and the
Oxford English Dictionary; they also help us to see early modern book culture in a new
way, as a structured flow of words. (Early Modern Print n.d.)

The developers have created graphical interface tools, such as an EEBO N-GRAM
Browser, to facilitate use of the collection by researchers, but users necessarily have
less ability to manipulate the corpus when they are using this tool.

Until there are more robust tools available to make working with a broad range of
data sets easier for scholars, libraries can play a role in supporting emerging research
by teaching scholars basic skills.

The workshop model: Creating stages for learning

In designing workshops to teach skills in digital scholarship, librarians need to be
attentive to felt needs in their community and to carefully stage those workshops
to make sure that instructors are not spending too much time on technical minu-
tiae, such as constructing a corpus or setting up frustration with tools. To do this,
workshop facilitators need to draw on the principles of backward design by asking,
what is the intellectual outcome that they want to have in the session? Wiggins and
McTighe explain backward design as a methodology that conceives of curricular
design by thinking at the outset in terms of outcomes rather than lessons: “Given a
task to be accomplished, how do we get there? … What kinds of lessons and prac-
tices are needed to master key performances?” (1998, 8). In just the same way that
you might design a classroom exercise to focus narrowly on imparting a specific
skill or research strategy, it is useful to isolate the specific technical skill, as well as
the possibilities for further exploration, that you hope to impart. This is likely to
require more setup in advance by the workshop leaders—for example, creating a
specific corpus to work with or downloading example files to practice on—but it
will allow the session to focus on that specific skill rather than the frustrations of
getting ready to learn that skill. A scenario to avoid is when workshop participants
try to download software and wind up spending most of the time troubleshooting
the download and relatively little time on using the tool.


COLLEGE & UNDERGRADUATE LIBRARIES 11

Designing workshops in ways that focus narrowly on outcomes may also require
participants to use the same operating system and computers that have all been set
up the same in advance. Creating an equal computing environment is a big chal-
lenge, especially when people have different skill levels and different technology
vocabularies. As the scholarship on how researchers learn technical skills suggests,
if you can give an opening to the possibilities, and offer a framework for follow-
up support, interested researchers will take the time to teach themselves or request
consultations on how to do the technical minutiae. A key goal for a workshop can
often be illustrating the possibilities. How can you illustrate the possibilities in the
approach so that scholars are motivated to learn the details of downloading and con-
structing their own corpus? Can you create a session that focuses on a piece of the
process—i.e., looking at a predetermined corpus in AntConc? One approach is to
make the entry easy so that scholars can decide if they want to do more, then offer
resources for them to take the next steps. A significant goal for workshops can be
illustrating why researchers would want to learn these approaches.

Workshops can also be augmented by working sessions, such as the Hackfest
sponsored by the Bodleian libraries in 2015 (Oxford University n.d.). This full-day
session included researchers as well as robust technical support, as participants had
a chance to “pitch ideas and find collaborators, firm up projects and groups, and
request (or indeed recruit) technical help as necessary” (Willcox 2015). Key to the
success of this model, practiced also by Software Carpentry, whose goal is “teaching
basic lab skills for research computing” (Software Carpentry n.d.), is the availability
of support from multiple people, rather than one or two workshop leaders trying to
troubleshoot and lead the session.

Classroom approach

In addition to workshops aimed at researchers at all levels, librarians can offer con-
siderable support for digital scholarship through course-integrated instruction at
the undergraduate or graduate level. If integrated thoughtfully into a course’s learn-
ing goals and assignments, course-integrated instruction can be, arguably, at least
as effective as workshops because the individual skills to be taught are bound up
with the questions raised by a specific course theme. By working with the faculty
member leading the course, and by being attentive to the specific learning goals and
questions for the course, librarians can design exercises that are targeted toward spe-
cific research questions. Just as in workshops, it is essential that librarians front-load
the planning for these instruction sessions to isolate the specific learning goal for the
course. While it is not possible, nor is it realistic (or, really, desirable), to eliminate all
possible frustration in working with complex data sets, librarians can anticipate and
minimize potential pain points so that the session can focus on the learning goals.

For example, in one undergraduate class session at the University of Michigan,
the librarian and technology specialist worked closely with the faculty member to
design an instruction session that drew on the EEBO-TCP data set in a 300-level


12 S. A. CORDELL AND M. GOMIS

course. Because the point of the assignment was not necessarily to teach students
how to compile corpora for analysis but rather to allow students to perform text
analysis on a set of relevant texts, they set the session up so that students were cre-
ating a limited corpus of only ten texts, based on search criteria that students deter-
mined (and determining the search words was part of the goal for the exercise). To
minimize frustration with the data set as a whole, they first showed students how
to use the EEBO platform so as to explore texts related to their topics and identify
ten potential texts. Once they had identified the ten texts, it was relatively easy for
students to find those texts on the Oxford platform and cut and paste the text into
plain text files. Although this approach may have glossed over some of the intrica-
cies of the data set and corpus creation, it allowed students to create a minicorpus
relatively easily to import into Voyant, where the bulk of the learning was meant to
happen.

The lab approach: ScholarSpace at the University of Michigan Library

ScholarSpace at the Graduate Library at the University of Michigan provides access
to technologies for small-scale experimentation and technologies for formal project
support with the understanding that anyone can access them. ScholarSpace sup-
ports humanists working on text mining projects by providing access and expertise
for digitization, storage, text cleanup, and analysis. We have purchased text mining
software that is not available elsewhere on campus, thereby providing access to
anyone affiliated with the university. This approach relies on humanists to be willing
to experiment with librarians and to train each other. Text mining varies greatly
by discipline; through creating a community of scholars, we can build a network of
experts and draw on experiences and expertise related to text mining in Chinese
studies, economics, history, English language and literature, and more.

Staffing models

Across these different models, the question remains as to how best to apportion
staffing to support digital scholarship. In a distributed model, where librarians are
leading workshops for the campus community and for classes, subject specialists,
technology librarians, and undergraduate learning librarians can provide consid-
erable support, especially if they are provided training and if the workshops are a
natural extension of their expertise and outreach areas. Depending on the demand
on campus, this model can, however, lead to librarians being stretched too thin; thus,
creative staffing, such as training students to lead or support workshops, is necessary.

Likewise, students can be brought into a project to work on a specific slice—such
as OCR-ing pdf files and cleaning up the resulting OCR. In this case, however, it
is important to bring the students into the conversation about the project at some
level so that they understand how their work fits into the larger intellectual work of
the project. Otherwise, libraries miss out on the opportunity to mentor students in
emerging questions and methodologies of digital scholarship. The bulk of preparing


COLLEGE & UNDERGRADUATE LIBRARIES 13

texts for mining and analysis can also be tedious, and it requires careful attention to
detail. Librarians or others overseeing students working on DH projects need to
be vigilant in keeping the work moving forward and in checking the quality and
consistency of the work.

Sustainability and scalability are challenges across all staffing models. Projects
that have dedicated funding may not have enough funding to cover the entire
project. Students cycle off projects either because they graduate or because they
receive other opportunities such as internships or jobs.

Conclusion

As the preceding discussion of staffing illustrates, challenges remain in think-
ing through collaborative work in digital scholarship, especially in terms of the
necessary—but not as obviously exciting—work of data preparation and cleanup.
The need to develop and create digital scholarship projects will continue to grow in
the humanities, and at some institutions it will be embedded into the curriculum.
Learning project management, digitization, and analysis are skills humanists will
need in the future, and they will learn them through the channels available. These
skills can translate easily to a number of positions postgraduation and will be desired
by employers. Having graduate students work on digital projects can provide them
with perfect opportunities to obtain new skills.

Considering that resources are not currently in place to make data sets easier to
use in the near future, librarians can advance digital scholarship by helping scholars
in incremental ways targeted at the specific challenges and frustrations that data sets
pose. Librarians can set the expectation that they will work with students and faculty
to explore these new areas together and work to scaffold the learning experience so
that humanists beginning text mining see the possibilities and not just the minutiae.
Some challenges that still persist include developing relationships across campus,
continually building skills, and finding partners to collaborate.

ORCID

Sigrid Anderson Cordell http://orcid.org/0000-0003-3956-0606
Melissa Gomis http://orcid.org/0000-0002-5622-8560

References

Antonijevic, Smiljana. 2015. Amongst Digital Humanists: An Ethnographic Study of Digital Knowl-
edge Production. New York: Palgrave Macmillan.

Early English Books Online (EEBO). n.d. “What Is Early English Books Online?” http://
eebo.chadwyck.com/about/about.htm#top

“Early Modern Print: Text Mining Early Printed English.” n.d. http://earlyprint.wustl.edu
Edmond, Jennifer. 2015. “Collaboration and Infrastructure.” In A New Companion to Digi-

tal Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth, 54–65.
Chichester, UK: John Wiley & Sons.

http://orcid.org/0000-0003-3956-0606
http://orcid.org/0000-0002-5622-8560
http://eebo.chadwyck.com/about/about.htm#top
http://earlyprint.wustl.edu


14 S. A. CORDELL AND M. GOMIS

Freistat, Neil. 2012. “The Function of Digital Humanities Centers at the Present Time.” In Debates
in the Digital Humanities, edited by Matthew Gold, 281–91. Minneapolis: University of Min-
nesota Press.

Gibson, Katie, Marcus Ladd, and Jenny Presnell. 2015. “Traversing the Gap: Subject Specialists
Connecting Humanities Researchers and Digital Scholarship Centers.” In Digital Humanities
in the Library: Challenges and Opportunities for Subject Specialists, edited by Arianne Harsell-
Gundy, Laura Braunstein, and Liorah Golomb, 3–17. Chicago: Association of College and
Research Libraries.

Green, Harriett E. 2014. “Facilitating Communities of Practice in Digital Humanities: Librarian
Collaborations for Research and Training in Text Encoding.” The Library Quarterly 84(2):
219–34.

Heuser, Ryan, Long Le-Khac, and Franco Moretti. 2011. “Learning to Read Data: Bringing out
the Humanistic in the Digital Humanities.” Victorian Studies: An Interdisciplinary Journal of
Social, Political, and Cultural Studies 54(1): 79–86.

Hunter, Elizabeth. 2016. “Must Humanists Learn to Code? Or: Should I Replace My
Own Carburetor?” HASTAC (blog), December 7, https://www.hastac.org/blogs/shakespeare-
games/2016/12/07/must-humanists-learn-code-or-should-i-replace-my-own-carburetor

Kirschenbaum, Matthew. 2009. “Hello Worlds: Why Humanities Students Should Learn to
Program.” The Chronicle Review 55(20): B10.

Leonard, Peter. 2014. “Mining Large Datasets for the Humanities.” IFLA Library. http://
library.ifla.org/930/1/119-leonard-en.pdf

Levelt, Sjoerd. n.d. “#EEBOLiberationDay.” https://storify.com/SjoerdLevelt/eeboliberationday
Lewis, Vivian, Lisa Spiro, Xuemao Wang, and Jon E. Cawthorne. 2015. Building Expertise to Sup-

port Digital Scholarship: A Global Perspective. Washington, DC: Council on Library and Infor-
mation Resources.

Liu, Alan. 2009. “Digital Humanities and Academic Change.” English Language Notes 47(1): 17–
35.

Maron, Nancy. 2015. “The Digital Humanities Are Alive and Well and Blooming: Now What?”
Educause Review. http://er.educause.edu/∼/media/files/articles/2015/8/erm1552.pdf

Maron, Nancy, and Sarah Pickle. 2014. “Sustaining the Digital Humanities: Host Insti-
tution Support beyond the Start-Up Phase.” Ithaka S+R. http://www.sr.ithaka.org/
wp-content/mig/SR_Supporting_Digital_Humanities_20140618f.pdf

Oxford University. n.d. “Text Creation Partnership: EEBO, ECCO and Evans Texts.” http://ota.ox.
ac.uk/tcp/

Reid, Alexander. 2012. “Graduate Education and the Ethics of the Digital Humanities.” In Debates
in the Digital Humanities, edited by Matthew Gold, 350–67. Minneapolis: University of Min-
nesota Press.

Schaffner, J., and R. Erway. 2014. “Does Every Research Library Need a Digital Humanities
Center?” OCLC Research Report. http://www.oclc.org/content/am/research/dpublications/
library/2014/oclcresearch-digital-humanities-center-2014.pdf

Software Carpentry. n.d. “Software Carpentry: Teaching Basic Lab Skills for Research Comput-
ing.” https://software-carpentry.org

Text Creation Partnership (TCP). 2014. “EEBO-TCP Phase I Public Release: What to Expect
on January 1.” http://www.textcreationpartnership.org/2014/12/24/eebo-tcp-phase-i-public-
release-what-to-expect-on-january-1/

Wiggins, Grant P., and Jay McTighe. 1998. Understanding by Design. Alexandria, VA: Association
for Supervision and Curriculum Development.

Willcox, Pip. 2015. “Early English Books Hackfest.” Bodleian Libraries (blog), April 22, http://
blogs.bodleian.ox.ac.uk/digital/2015/04/22/early-english-books-hackfest/

https://www.hastac.org/blogs/shakespeare-games/2016/12/07/must-humanists-learn-code-or-should-i-replace-my-own-carburetor
http://library.ifla.org/930/1/119-leonard-en.pdf
https://storify.com/SjoerdLevelt/eeboliberationday
http://er.educause.edu/~/media/files/articles/2015/8/erm1552.pdf
http://www.sr.ithaka.org/wp-content/mig/SR_Supporting_Digital_Humanities_20140618f.pdf
http://ota.ox.ac.uk/tcp/
http://www.oclc.org/content/dam/research/publications/library/2014/oclcresearch-digital-humanities-center-2014.pdf
https://software-carpentry.org
http://www.textcreationpartnership.org/2014/12/24/eebo-tcp-phase-i-public-release-what-to-expect-on-january-1/
http://blogs.bodleian.ox.ac.uk/digital/2015/04/22/early-english-books-hackfest/

	Abstract
	References