Modeling Digital Humanities Collections as Research Objects Modeling Digital Humanities Collections as Research Objects Katrina Fenlon kfenlon@umd.edu College of Information Studies, University of Maryland, College Park College Park, Maryland ABSTRACT Advancing digital libraries to increase the sustainability and useful- ness of digital scholarship depends on identifying and developing data models capable of representing increasingly complex schol- arly products. This paper considers the potential for an emergent model of scientific communication, the research objects data model, to accommodate the complexities of digital humanities collections. Digital humanities collections aggregate and enrich diverse sources of evidence and context, serving simultaneously as "publications" and dynamic, interactive platforms for research. The research ob- jects model is an alternative to traditional formats of publication, facilitating aggregation and description of all of the inputs and outputs of a research process, ranging from datasets to papers to executable code. This model increasingly underpins research infrastructures in some scientific domains, yet its efficacy for repre- senting humanities scholarship, and for undergirding humanities cyberinfrastructure, remains largely untested. This study offers a qualitative content analysis of digital humanities collections relying on a content/context analytical framework for characterizing collec- tion components and their interrelationships. This study then maps those components and relationships into a research objects model to identify the model’s strengths and limitations for representing diverse digital humanities scholarship. CCS CONCEPTS • Information systems → Data structures. KEYWORDS data models, digital humanities, digital libraries, research objects ACM Reference Format: Katrina Fenlon. 2019. Modeling Digital Humanities Collections as Research Objects. In Proceedings of ACM Conference (JCDL ’19). ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn 1 INTRODUCTION Across disciplines, the growth and evolution of digital scholarship has overwhelmed traditional systems for the representation and communication of research. Digital scholarship in the humanities Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. JCDL ’19, June 2019, Urbana-Champaign, IL, USA © 2019 Association for Computing Machinery. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn produces resources that range widely beyond our traditional con- cept of publication, resources that incorporate not only narratives and rich media, but also datasets and linked data, interactive and functional components, and objects and processes that are physi- cally and logically dispersed as well as dynamic and evolving over time. Despite the rise of digital scholarship, most existing research infrastructures lack support for the creation, management, shar- ing, maintenance, and preservation of complex, networked digital objects. This paper considers the potential for emergent models of scien- tific communication and publication to accommodate the complex- ities of digital humanities scholarship, and therefore to underpin shared research infrastructure in the humanities. In particular, this study analyzes the suitability of the research objects model,1 one among several emergent models for representing and describing complex digital objects that interweave data, workflows, and supple- mentary and contextual information, models for logically bundling the diverse inputs and outputs of research processes [3, 4]. Research objects comprise metadata frameworks with associated packaging standards. The model has gained uptake in some disciplines and witnessed concomitant growth in related tools, management sys- tems, and supportive communities [2, 11, 26], which indicate its usefulness and contribute to its sustainability. This study offers a starting point for answering the question: To what extent may existing (scientific) data models for repre- senting research objects accommodate DH research products and processes? This paper focuses on a common form in DH scholar- ship: digital collections (often called digital archives and thematic research collections), which are scholar-built aggregations of digital sources of evidence about a topic [12, 15, 27]. This study provides selected results of a qualitative content analysis of DH collections, and offers a content/context analytical framework to characterize collection components and their interrelationships. This study then retrospectively maps those components and their relationships into the research objects model in order to identify the strengths and limitations of that model for representing DH scholarship. 1.1 Digital scholarship and sustainability In the past few decades, research and scholarship have witnessed sweeping efforts to rethink existing formats for knowledge transfer and scholarly publication, and to develop technologies that support the publication and interlinking of data, software, workflows, and narratives, all as first-class research objects [8]. In the humanities, scholarship takes an increasing variety of forms, ranging from digi- tal scholarly editions (e.g., the Walt Whitman Archive2) to curated collections of content (e.g., Colored Conventions3), from layered 1http://www.researchobject.org/ 2https://whitmanarchive.org/ 3http://coloredconventions.org/ https://doi.org/10.1145/nnnnnnn.nnnnnnn https://doi.org/10.1145/nnnnnnn.nnnnnnn http://www.researchobject.org/ https://whitmanarchive.org/ http://coloredconventions.org/ JCDL ’19, June 2019, Urbana-Champaign, IL, USA visualizations (e.g., the Torn Apart/Separados project4) to models and simulations (e.g., the MayaArch3D Project5). The outputs of DH research are increasingly media-rich, data-centric, interactive, dynamic, interlinked, and subject to indefinite evolution. As infrastructures for sustaining digital research struggle to keep pace with the advance of scholarly communication tech- nologies, DH confronts sustainability challenges [19, 22, 23]. Digi- tal libraries—including data repositories, aggregations of cultural records and artifacts, and certain publication platforms—are im- portant components of research infrastructure in the humanities. While the capacity of digital libraries for representing complex dig- ital objects and workflows continues to advance [6, 20, 29], there remains an urgent need for data models and standards to represent and describe increasingly complex scholarly products [13, 19]. Digital humanities (DH) collections, including those analyzed in this paper, often resemble cultural heritage digital libraries, broadly conceived. But DH collections are differentiated in several ways that make sustainability uniquely problematic. DH collections are often developed and maintained outside of the walls and purview of dedicated memory institutions. They tend to be centered in scholarly communities; scholars create them and maintain them for their own purposes, with fluctuating resources and support. Because they function simultaneously as scholarly "publications" and as platforms and hubs for ongoing research and communication among scholarly communities, and because they tend to be funded on short cycles, they often rely on bespoke infrastructures and take unique forms to serve specific research purposes. These factors combine to make DH collections uniquely difficult to sustain over time, and suggest the urgent need for shared infrastructure that does not limit the diversity of digital scholarship. 1.2 Research objects in the humanities The basic concept of the research object is simple. Conceptually, research objects are composed of two main parts: aggregated re- sources (listed in a manifest with minimal metadata, and packaged into the research object using one of several packaging formats), and annotations (used to express metadata about, provenance of, and relationships among aggregated and external resources). The standard model specifies how relationships are declared, relying on extant linked data standards, primarily on OAI-ORE,6 and W3C standards including the Annotation Data Model7 and Prov8. The re- search object may be packaged and serialized in different ways, but always contains a manifest of metadata about the research object and its contents represented in JSON-LD. There are other models closely related to the research objects model, including for enhanced publications [2], executable papers, and scientific publication pack- ages. Research objects have seen growing application in several domains, in various commercial and open-source implementations [5, 7, 10, 11, 18]. In the humanities, research objects and closely related models have been applied to repository and data-sharing architectures [1, 6], digital preservation and archive serialization [21, 30], semantic 4http://xpmethod.plaintext.in/torn-apart/volume/2/index 5http://www.mayaarch3d.org/language/en/sample-page/ 6https://www.openarchives.org/ore/ 7https://www.w3.org/TR/annotation-model/ 8https://www.w3.org/TR/prov-overview/ publishing [24], and digital libraries for musicology [25]. These applications are compelling, and suggest the need for and timeliness of a systematic investigation of whether or to what extent the model could serve to represent a range of DH collections as whole, cohesive objects, and therefore have potential to underpin a widely adoptable, sustainable DH infrastructure with cross-disciplinary investment and impact. Data modeling is a pervasive scholarly practice in DH [16]. Like research objects, DH collections may be conceptualized and modeled as assemblages of resources with semantic interconnections, designed to support research objectives [4, 13]. This study considers to what extent that resemblance bears out in the application of the research objects data model to complete representation of collections. 2 METHODS The analysis presented in this paper builds upon an ongoing, mul- timodal study of digital collections [12, 13]. The study seeks to thoroughly characterize DH collections as a scholarly genre using three approaches: (1) a survey and typological analysis of DH collec- tions (n=150 to date); (2) a qualitative content analysis of exemplary collections; and (3) interviews with researchers and practitioners who build digital collections, to identify challenges for libraries and other institutions in supporting and sustaining DH scholar- ship. The typological analysis identified three primary types, useful for describing DH collections in terms of their purposes and the completeness toward which they are developed; those types are briefly described in Table 1. Complete results of the first phase of the study and a detailed account of the interrelated methods are given in [13]. 2.1 Qualitative content analysis The current paper extends the qualitative content analysis to ad- dress the question: What components of these collections must be modeled in order to logically represent DH collections as research objects? In other words, what are the main products of the collec- tion—its discrete, publishable outcomes—and how are they related to one another and to other resources? The initial phase of content analysis identified close to forty distinct aspects of the content, de- sign, and contexts of digital collections. Table 2 gives an overview of the whole content analysis protocol and each aspect of the sample collections that has been subject to analysis and characterization. The two most immediately relevant aspects of this protocol to the analysis at hand are items and interrelatedness. These aspects concern (1) what are the items in the collection, and (2) how are they interrelated with one another, with contextual information, with external resources, etc.? A closer analysis of items and in- terrelatedness in each of our sample collections identified discrete components of collections along with the relationships, both tech- nical and abstract, that obtain between components. This study uses the terms "item" and "component" loosely, not only to indicate a collection’s main conceptual units of gathering (such as books or artifacts), but also other parts of collections that substantially contribute to a collection’s intended contribution to the scholarly and cultural records. The analysis focuses on discrete logical pieces that may be understood to have some kind of mereological, mem- bership, or isGatheredInto relationship to the collection as a whole http://xpmethod.plaintext.in/torn-apart/volume/2/index http://www.mayaarch3d.org/language/en/sample-page/ https://www.openarchives.org/ore/ https://www.w3.org/TR/annotation-model/ https://www.w3.org/TR/prov-overview/ Modeling Digital Humanities Collections as Research Objects JCDL ’19, June 2019, Urbana-Champaign, IL, USA Table 1: Collection types Type Purpose Definitive- source Provide access to high-quality, authoritative, or otherwise definitive primary sources, (re- )assembling and shaping the affordances of the cultural record on the Web Exemplar- context Interrelate and (re-)contextualize diverse primary sources, building rich context and connection within and around exemplary sources Evidential platform Aggregate, deconstruct, and remodel sources for new uses, leveraging evidence into more flexible platforms for analysis and interpretation Table 2: Content analysis protocol overview Cluster Categories of analysis Context Theme; Purposes; Impact; Creators; Au- dience; Documentation; Provenance; Re- lated collections; Related projects and pub- lications; Review; Funding; Developmen- tal stage; Host; Rights; Sustainability and preservation plans; Method of collection Content Items; Interrelatedness; Diversity; Size; Narrativity; Quality; Language; Complete- ness; Density; Spatial coverage; Temporal coverage Design Data models; Navigation; Infrastructural components; Interface design; Interactivity; Interoperability; Openness; Identification and citation; Modes of access and acquisi- tion; Accessibility; Flexibility [31], and which contribute to its scholarly purpose according to the collection’s self-described objectives. 2.2 Content/context component framework To refine the analysis of collections, this study developed and ap- plied an analytical framework for characterizing components of collections more precisely. This characterization leverages a few different properties of components—including whether they are primary or secondary sources, and whether they are original to a collection—with the goal of identifying different ways in which components contribute to collections as wholes and, in turn, to the wider scholarly record. Figure 1 illustrates the "content/context" an- alytical framework used to focus the content analysis of collections in anticipation of applying the research objects model. The framework is intended to refine analysis of how collections are constituted, and how their constitution determines the ways in which they contribute to scholarship. Using this framework, each component is first categorized as either content or context. "Con- tent" includes components that are discrete, independent sources of evidence for scholarship. "Context" includes components that play a supportive, interpretive, representational, or functional role that is essential or utilitarian for the use and understanding of con- tent. The reason for differentiating these categories conceptually, despite the difficulty of teasing them apart in practice, is to refine our understanding of collection contributions. The next question put to components identified as content is: Are they primary or secondary sources, or would it be more accurate to say they fall somewhere in between? For both content and context components, a third question is: Is the component original, or has it been previously published or published externally to the collection? The final question is, how are both context and content components interrelated? These questions are intended to challenge our intu- itions about aspects of collections that are commonly understood to be peripheral to collections. Figure 1: Content/context analytical framework Content components in these collections include primary sources, secondary sources, and derived sources. Primary sources are well understood to be representations of original documents or first- hand evidence, while secondary sources offer substantial interpre- tation of primary sources. However, some resources seem to fall be- tween these two categories, such as datasets extracted from primary sources. This study considers such sources to be derived. Derived sources are generated "directly" from primary sources through some interpretive intervention, where interpretation is manifested in the mode or method of derivation, such as an algorithm or encoding scheme designed to foreground or extract specific pieces of data from the sources. I posit that derived sources are more closely re- lated to primary sources than other secondary sources because they are intended as alternative (usually computational) representations of primary sources. Content components further divide into categories of original versus previously published/external. "Original" implies that a source is the first (digital) source of its kind, or has no available counter- part. "Previously published" implies that a source or comparable version has been published or digitized elsewhere, or is a reference component that exists externally to a collection. Contextual components in these collections include elements that are essential or important to the interpretation, use, manage- ment, curation, and preservation of collections, but which do not constitute the main content. For example, contextual components include documentation and data models such as markup schemas or ontologies. Finally, many contextual components are functional, dynamic, and interactive features or affordances. Context compo- nents may also be original, previously published or external, or somewhere in between. JCDL ’19, June 2019, Urbana-Champaign, IL, USA 2.3 Collections The following three collections were selected for close qualitative content analysis: the Shelley-Godwin Archive,9 the Vault at Pfaff’s,10 and O Say Can You See: Early Washington D.C. Law & Family.11 These collections were selected to represent three distinctive types of collection, summarized in Table 1 [13], which were identified in prior typological analysis. The Shelley-Godwin Archive (Shelley-Godwin) represents a defini- tive source collection, a digital library focused on the representation of definitive primary sources, such as scholarly editions and author- itative archival sources intended for close study by scholars in a domain. Shelley-Godwin provides digitized, transcribed manuscripts from the Shelley-Godwin family of 18th- and 19th-century writ- ers, including Percy Bysshe Shelley, Mary Wollstonecraft Shelley, William Godwin, and Mary Wollstonecraft. The collection aims to be a definitive digital source for close study of the Shelly-Godwin manuscripts—including major literary works such as Frankenstein (M. W. Shelley) and Prometheus Unbound (P. B. Shelley). Manuscripts are supplemented with biographical, bibliographical, and other sec- ondary sources. The Vault at Pfaff’s (Vault) represents an exemplar-context col- lection, which aims to present exemplary (rather than definitive) sources on a subject, and to interrelate them with interpretive, con- textual materials. Vault gathers primary and secondary sources about the historically significant bohemians of antebellum New York, U.S.A., particularly the social network revolving around the historical bar Pfaff’s, which became an epicenter for a literary move- ment. The site provides a searchable annotated bibliography of more than 8,000 texts, linking to full-text internal and external sources. Critically, while some of the primary sources are hosted by Vault, many are instead references (with some linked to external sites), because the main content of this collection is the records of primary sources and the rich, interwoven contextual information with which records are augmented. The site also provides a map, timelines, biographies, and historical essays. Unlike Shelley-Godwin, Vault does not aim to provide an original or definitive set of pri- mary sources for close study, but rather a massive set of interrelated sources, social entities, and contextual information to support the discovery of new connections. O Say Can You See: Early Washington, D.C., Law and Family (O Say) represents an evidential platform, a digital library focused on gathering sources to provide evidence for a specific interpretive or analytical goal [13]. O Say gathered, digitized, and analyzed freedom suits filed in Washington, D.C., and surrounding areas between 1800 and 1862, in order to explore family, legal, and social networks. Like Shelley-Godwin, O Say provides carefully transcribed and encoded primary sources, but with a central goal of deconstructing and remodeling those sources for use as data (e.g., for computational social-network analysis). 3 COMPONENTS OF COLLECTIONS In this section I consider what components of our sample collections must be modeled in order to logically represent them as research 9http://shelleygodwinarchive.org/ 10https://pfaffs.web.lehigh.edu/ 11http://earlywashingtondc.org/ objects, to lay the groundwork for attempting a retrospective map- ping to the research objects data model. For each collection, content analysis and the application of the content/context framework serve to identify the main products of the collection—its discrete, pub- lishable outcomes—and how they are related to one another and to other resources. The remainder of this section characterizes the items and interrelatedness of the current instantiation, identified through content analysis of each collection. 3.1 Shelley-Godwin Archive components Shelley-Godwin aims to provide a definitive collection of manuscripts, digitized as high-quality page images with corresponding TEI- encoded transcriptions. These manuscripts are augmented by in- novative modes of access and participation for users, including features for multimodal and comparative reading, and features for facilitating future participation in the archive through user annotation and curation of manuscripts. What are the original con- tributions and important contextual components, and how are they related? Content analysis of the collection identified the following components: • Manuscripts: Manuscripts are abstract objects, with mul- tiple possible orderings, of sequential transcriptions and corresponding page images, currently instantiated through TEI-XML files that reference and order the separate TEI-XML files representing transcribed pages (see below). – Page images: Digitized manuscript page images. The image files are hosted remotely and appear on the site through a call to the Bodleian digital library’s IIIF API; but images were digitized under the auspices of the Shelley- Godwin Archive project and thus constitute a contribution of the project. – Encoded transcriptions: Transcriptions of page images, encoded in a TEI-XML schema for representation of pri- mary sources. Multiple representations of the page images and transcribed text stem from Shared Canvas manifests that are generated based on these TEI files; these transcrip- tions are the foundation of this project’s contribution. • Narrative components: – Original texts: The project offers manuscript descrip- tions, currently instantiated as HTML files. – Excerpted texts: The project includes excerpts of previ- ously published texts, including manuscript descriptions and a chronology, currently instantiated as HTML files. • Browse and search functionalities: Browse and search of Shelley-Godwin operate across manuscripts as wholes, and across components of manuscripts. These functionalities are customized to offer multiple reading orders, taking advan- tage of the highly rich encodings. • Reading viewer: The custom implementation of the reading viewer takes advantage of Shared Canvas/IIIF representa- tions of the manuscript images in addition to the encodings, to allow readers to compare the original handwritten text with its transcriptions, and to limit views by authorial hands. • Schemata and utilities: Shelley-Godwin relies on multi- ple custom data models and utilities for constituting the manuscripts from numerous components. http://shelleygodwinarchive.org/ https://pfaffs.web.lehigh.edu/ http://earlywashingtondc.org/ Modeling Digital Humanities Collections as Research Objects JCDL ’19, June 2019, Urbana-Champaign, IL, USA Table 3: Collection objectives Shelley-Godwin Vault O Say Provide access to a complete set of encoded manuscripts Aggregate access to distributed, related sources Digitize, transcribe, encode archival docu- ments to extract data for analysis Facilitate multi-modal, comparative read- ing and user participation Illuminate a network of works (sources), people, places Reconstruct and expose hidden relation- ships and personal histories The components of Vault and O Say are described in less detail, below, to facilitate comparison with Shelley-Godwin. 3.2 Vault at Pfaff’s components Vault, which aims to help users discover connections among a large set of related sources and people, decomposes into the following main components: Annotated bibliographic metadata records, which include annotated internal hyperlinks to related people entities (whether authors or mentions) and internal/external hyperlinks to electronic sources when available; annotated biographical records (people entities); a dedicated relationships browser, along with other browsing and searching facilities; original narrative components including historical essays and full biographies; an extended time- line and interactive map; and transcriptions and page images of the subset of primary sources hosted by Vault (most primary sources in this aggregation are externally linked). 3.3 O Say Can You See components O Say provides encoded primary sources and extracted data. Its main contributions may be decomposed into the following compo- nents: Page images of archival documents; encoded transcriptions of archival documents in TEI-XML; extracted and augmented per- son data (represented as RDF data documenting relationship and personal information, derived from a central CSV file, all extracted from case documents); family guides (family trees that interrelate "people" entities, derived from the same central data source); cases (abstract entities, a mechanism for aggregating extracted data and documents, such as person entities and case documents references); annotated cases (which are the same as cases, but including long annotations with hyperlinks); a relationships ontology (OWL) and other customized data models; a special browse and search function- ality, including relationship browse and search with multiple seri- alization options and simple relationship API; stories (original long- form narratives heavily linked both to internal entities/resources and external resources); and a bibliography with links to related projects, and primary and secondary sources. 3.4 Content and context components Applying the context/context framework to the components identi- fied through content analysis exposes a few important characteris- tics of DH collections, which any data model intended to represent and describe collections must take into account. As an example of how this analytical framework applies to collections, Figure 2 shows selected content and context components of all three collec- tions mapped to a two-dimensional grid, to demonstrate how com- ponents fall along two spectra of (1) Primary/Derived/Secondary sources and (2) Previously published (or external) versus Original sources. The grid differentiates six boxes or categories for the sake of making the framework more legible, but in reality the category boundaries are fuzzy and each axis should be understood as a spec- trum. Components of the three collections fall into almost every category. (The only category into which no components fall, in this analysis, is the category of components that are both derived from primary sources and previously/externally published; but it is easy to imagine components that would fall into such a category, such as datasets hosted in an external repository.) Figure 2: "Content" components mapped to framework Mapping components identified above to this framework, as in Figure 2, exposes the following essential and interesting character- istics of DH collections: Components contribute to scholarship in diverse ways. The mapping illustrates the great variety among the components of even just a few collections—variety not only in type and form, but also in less predictable dimensions, including their originality and how they participate in the scholarly record, whether as primary, sec- ondary, or derived sources. The contributions of a collection are often framed in terms of concrete, novel additions to the scholarly and cultural records, but such additions are more various, and some- times more abstract, than usually imagined. The multidimensional JCDL ’19, June 2019, Urbana-Champaign, IL, USA diversity of the components that constitute our collections may complicate our judgments about which pieces are priorities for sustainability and preservation. Not all essential content is original or internal to the col- lection. For example, many of the primary sources that make Vault a valuable resource for discovery were previously published and constitute external references. In a different case, the manuscript page images that constitute a major part of Shelley-Godwin’s contri- bution to scholarship are original but externally referenced, which will pose fundamentally different challenges to the sustainability and preservation of the collection as a whole than if they were co-located with the rest of the collection components. Content is not the only essential contribution of a digi- tal collection. The contribution may be partly or even centrally manifested in the interrelationships among components, or in the context surrounding the content. These relationships and context have been called the "connective tissue" of a collection [13]. For example, the customized schemas and utilities used to constitute the archive and its contents may represent a technical contribution to DH as a field of practice. The custom relationships browsers of Vault and O Say serve to enact scholarly interpretations; the ability to search and browse fine-grained relationships within and among components in bespoke ways is essential to the purposes of those collections. Flanders (2014) invites us to "consider what happens to our understanding of a ’collection’ when its constituent items are no longer the primary unit of meaning" [15]; at the least, this idea suggests that standard repository models for representing "items + metadata" as constituting a collection are insufficient to represent and describe DH collections. The next section breaks some of the connective tissue down to have a closer look, prior to the application of the research objects model. 3.5 Relationships among components Components of collections are interrelated both conceptually and technically, and these relationships are essential to representing and describing collections as complex and cohesive wholes. In the case of Shelley-Godwin relationships are implemented in various ways. The collection leverages identifiers, schemata, utilities (scripts or processes), and data files to construct the archive’s representation of each manuscript. Figure 3 offers a reductive illustration of components and re- lationships of Shelley-Godwin and relationships among them. In Figure 3, items included in the collection are enclosed in (blue) squares. Note that page images appear in a separate square; while they are logically part of Shelley-Godwin, they are maintained and hosted by a different institution in a separate digital library (Dig- ital Bodleian12) and called via API. In Figure 3, arrows represent relationships. Solid arrows represent referential relationships that are formalized and actionable (if not semantically encoded), such as relationships performed by hyperlinked URIs. These include the following (broadly described): (a) Custom data models refer to (and extend) standard, external data models, for purposes of validation and documentation. For example, the Shelley-Godwin TEI-ODD file references 12https://digital.bodleian.ox.ac.uk/ Figure 3: Conceptual and technical relationships among components the TEI standard, in addition to the standard that defines the ODD. (b) Scripts and utilities refer to all components in order con- struct or enact the functional website. For example, the site relies on the Unbind utility, a Python utility to create Shared Canvas manifests (which underlie the interactive reading viewer) from Shelley-Godwin TEI files. Dashed (yellow) arrows represent conceptual or abstract relation- ships, which are implemented indirectly through various means. These are conceptual relationships, made visible to users by the design of the site, but technically performed by completely separate components of the collection. These include the following: (c) Relationships between page images and corresponding en- coded transcriptions. For users this relationship is experi- enced via the juxtaposition of both in the reading viewer. Behind the scenes, this juxtaposition is created by the utili- ties described above. (d) Relationships between each manuscript and its components. Each manuscript is an abstract entity with a proxy in the form of XML documents, one for each volume, which list the URIs for the individual pieces, or pages, that constitute the volume and manuscript. The identifiers for pieces of https://digital.bodleian.ox.ac.uk/ Modeling Digital Humanities Collections as Research Objects JCDL ’19, June 2019, Urbana-Champaign, IL, USA the manuscript serve to identify both page images and cor- responding XML files, because scripts and utilities expand the identifiers into URIs. The dashed circle in Figure 3 en- compasses the abstract object of the manuscript, an abstract entity that is evident and interactive to users through brows- ing mechanisms and the comparative reading viewer, but which is constructed behind the scenes through a complex, distributed process. (e) Relationships between narrative components and manuscripts. References to manuscripts within the narrative components of the site are implemented as hyperlinks between textual references the landing pages for corresponding manuscripts. Through this analysis of the final aspect of our "content/context" framework—the aspect of relationships among components—we find another crucial observation about DH collections: not all relationships among components are equal. Some are imple- mented directly using mechanisms such as URI addresses, which would readily translate to alternate representations, such as seman- tic relationships in a linked data or research objects model. Other relationships are implemented indirectly via processes that may prove more difficult to translate or migrate. Dwelling on relation- ships within Vault and O Say is out of scope for this paper, but those collections, even more than Shelley-Godwin, realize their purposes and contributions through their connective tissue, and demand a deeper analysis in future work. 4 RESEARCH OBJECTS AND COLLECTIONS So far this analysis has broken collections down into sets of logical components and relationships, with the goal of applying the re- search objects model to describing and representing them. By way of reminder, research objects are comprised of two main kinds of things: aggregated items and annotations. In this model, a research object may be serialized as a bundle, which is a zip archive of a file structure and all constituent data files, along with a JSON-LD manifest of metadata about the aggregation contents. How well can this model capture the logic and meaning of dig- ital collections? This section suggests a basic mapping of compo- nents and relationships of one collection, Shelley-Godwin, to the research objects model, in order to begin to identify challenges and implications of this model for representing DH scholarship. The following examples assume the goal of trying to migrate the Shelley-Godwin—the complete collection, as data—into a research object bundle. The collection could then be migrated into a research objects management system, so that other digital humanists could access and use the data alongside (presumably) many other collec- tions, or so that third-party applications could draw on the data to support custom interactions. The details of access and use are not imagined here, but some potential implications for varieties of access and use are considered in section 5. First, adopting the model means capturing components that fall into the content category of the content/context framework articulated above. For Shelley-Godwin, these components are (at least): (1) page images, (2) encoded transcriptions corresponding to page images, and (3) narrative components that serve to de- scribe manuscripts. Manuscripts, in turn, are abstract entities that are manifested by relationships among page images and encoded transcriptions. In a research object, each component would be refer- enced in the manifest as an aggregated item. The following example record shows a portion of a research object manifest, which lists ag- gregated items including (1) an XML file (ending in "volume_i.xml") representing Volume 1 of Mary Shelley’s Frankenstein manuscript, and which references the individual pages in order; (2) a single digital page image (in JPEG2000 format); (3) an XML file (ending in "c56-0001.xml") representing a single page of the Frankenstein manuscript; (4) an HTML file representing a narrative introduction to the manuscript; and (5) the TEI-ODD schema that governs the Shelley-Godwin implementation of TEI-XML. Figure 4: Snippet of partial manifest for Shelley-Godwin re- search object aggregation Note that the aggregates field already captures several impor- tant relationships among the components of Shelley-Godwin, even prior to the addition of explicit relationship annotations. First, the research object manifest represents and make explicit the relation- ships between "tangible" or self-contained components (such as files or documents) and abstract components of the collection. In this ex- ample, the volume-level XML file stands as a proxy for a manuscript, which, as discussed above, is an abstract object in Shelley-Godwin’s architecture. It would also be possible to represent the manuscript as an abstract entity more explicitly in this model, perhaps relying on the OAI-ORE proxy mechanism. In addition, URIs for aggregated objects may reference both local files contained within a research object and remote resources. In Figure 4, relationships to external resources are highlighted. The conformsTo field allows a research object creator to indicate schemas or standards to which a given aggregated resource conforms; in this case conformsTo references schemas both internal and external to the collection. Relationships between the encoded transcriptions and relevant schemas and standards, embedded in the TEI-XMl file headers, can also be described in the research object mani- fest, where they can be exposed to consumption by independent applications. Figure 4 gives an example of how a Shelley-Godwin research object might reference page images hosted externally to the collection, in Digital Bodleian. Digital Bodleian is in fact the JCDL ’19, June 2019, Urbana-Champaign, IL, USA source of page images displayed in the Shelley-Godwin website. But in the current Shelley-Godwin site, this referential relationship is only made explicit within the code used to generate pages. The research objects model makes this relationship explicit, semantic, and discoverable in the outward-facing manifest. Annotations, constituting the second major piece of a research ob- ject manifest, are used to express descriptive metadata about aggre- gated resources, including relationships among resources (internal or external) and detailed provenance information. Annotations rely on domain-specific ontologies and vocabularies. Figure 5 exempli- fies annotations that make explicit several relationships among ag- gregated components of Shelley-Godwin, including Shelley-Godwin relationships (c), (d), and (e), identified in section 3.5, above: (c) Relationships between page images and corresponding en- coded transcriptions: In this example research object, these relationships are made explicit in annotations that link each XML file representing a single transcribed page to its corre- sponding page image, via prov:wasDerivedFrom. There are, of course, other ways to express this relationship. (d) Relationships between the various components that con- stitute a manuscript: In this example research object, the relationships are made explicit in annotations that link each XML file representing a single transcribed page to its cor- responding TEI-XML file representing a single volume, via dct:hasPart. There are other ways this relationship could be represented. (e) Relationships between narrative components and manuscripts. Hyperlinks forge relationships between textual references and manuscripts; therefore these relationships are best mod- eled not at the document level but at a lower level within the text. These relationships could simply remain as embedded hyperlinks, relying on unique identifiers for manuscripts (assuming the URLs continue to function in the new context of a research object). Alternatively, the fact that a narrative component refers to a manuscript could be made explicit in the manifest, via an annotation such as crm:refersTo.13 But it is not immediately clear how a document-level annotation indicating references would be useful. Figure 6 offers an alternative view of these relationships, ex- pressed as an RDF snippet derived from an ROHub research object14 and visualized.15 The research objects model supports the use of domain ontolo- gies (such as CIDOC-CRM and bibliographic ontologies) for rich descriptions of the interrelationships among collection components and external sources. There are numerous alternative ontologi- cal approaches to modeling the relationships given in the exam- ples above. Current research object management systems (such as ROHub) offer a limited set of terms for adding annotations to objects, mainly oriented toward description of computational and scientific research workflows. For example, ROHub’s "RO Basic Requirements" require research objects to include hypotheses or research questions, along with conclusions. For expressing relation- ships among the aggregated research object resources, ROHub relies 13http://www.cidoc-crm.org/cidoc-crm/ 14http://www.rohub.org/ 15http://ontodia.org/ on terms from the Prov and Wf4Ever Research Object ontologies, which are both focused on scientific workflows. Such ontologies will prove inadequate to fully describe the processes or workflows of digital scholarship in the humanities. Figure 5: Snippet of partial manifest for Shelley-Godwin re- search object annotations This example application of the research objects model has not accounted for the components of collections that are interactive, dynamic, and functional, such as Shelley-Godwin’s custom search and browse options, and its comparative reading viewer. These are essential aspects of the project’s contributions to scholarship. Not only do they represent technological contributions to the DH land- scape, but they were built for symbiosis with Shelley-Godwin data, which was modeled to support the use of these advanced tools. As flat code, of course, these pieces readily fit into the research objects model, which has been shown to be useful for aggregating data and code for migration and preservation purposes. But as performative, interactive components that function to enable new kinds of explo- ration and encounter with collection contents, these components challenge the research objects model. While the model has been applied to software preservation [5], and while workflow-oriented research objects usefully represent certain kinds of dynamic and executable content, the functional and interactive components of DH collections are really about enabling specific, purposeful kinds of real-time, end-user interaction. The duties of the functional, con- textual components of collections—to enable exploration, discovery, connection-making, learning, etc.—would be assumed not by a data model but by the interactive components of a research objects man- agement system or other applications built on top of a research objects management system. The potential for such systems and applications to enact the diverse methodological and functional goals of DH scholarship is a topic for future investigation. 5 DISCUSSION AND FUTURE WORK This study has analyzed three DH collections using qualitative content analysis, employing a novel content/context analytical http://www.cidoc-crm.org/cidoc-crm/ http://www.rohub.org/ http://ontodia.org/ Modeling Digital Humanities Collections as Research Objects JCDL ’19, June 2019, Urbana-Champaign, IL, USA Figure 6: Partial visualization of RO framework in order to characterize collection components and their interrelationships. Applying the framework highlighted a few important characteristics of DH collections that complicate our un- derstanding of how collections are constituted, and which therefore carry implications for the data models that represent collections along with approaches to sustaining and preserving them. These characteristics are: (1) Components of collections contribute to scholarship in diverse ways. (2) Not all of the essential content of a collection is necessarily original or internal to the collection. (3) Contextual components and interrelationships among components may be equally as essential as the main content of a collection. Research objects have the potential to represent and describe a wide range of scholarly products—more fully and more sustain- ably than models that currently dominate content management and publication systems. In this paper, the components and inter- relationships of the sample DH collections were retrospectively mapped into a research objects model in order to identify strengths and limitations of that model for representing DH scholarship. The following three central strengths emerged. (1) Research objects readily perform the most essential function of a collection: to aggregate related resources in order to support scholarly objectives. (For this reason, research objects have been leveraged to support digital preservation and big-data transfer [9]). (2) Research objects have the capacity to accommodate rich se- mantic descriptions of interrelationships among components, using domain ontologies. These interrelationships may obtain between components with identifiable and addressable representations, such as documents or files, and components that are more abstract. In DH collections, such interrelationships are often inexeplicit or "hid- den", enacted by or encoded in the layers of scripts and processes that operate to assemble collections for presentation on the Web. When these relationships are hidden, they may be more vulnerable to dissolution in the course of data manipulation, preservation, and migration processes. Formalizing these relationships not only makes them more sustainable; it also opens them to linked data representation and computational uses. (3) The research objects model accommodates aggregations of linked data, offering researchers the opportunity to create and annotate virtual, fully referential collections in any context and at scale. In addition, structured descriptions of aggregations in research objects are amenable to third-party annotation, and can be leveraged by external applications. These advantages of the research objects model for representing DH collections suggest new possibilities for collaboration, communication, and data reuse within scholarly communities. The most immediate limitation of the model for DH collections is that functional components designed for end-user interaction are not usefully captured in a basic research objects model. Instead, these components raise questions about the capacities of research objects management systems to serve the distributed development of a diversity of applications. How can management systems serve to underpin experimental, interactive, and dynamic platforms? Dif- ferent kinds of DH scholarship aim to facilitate different kinds of interactions between users, evidence, and context; the diversity of DH scholarship and the compulsion toward experimentation and innovation have hindered large, sustainable, cross-disciplinary infrastructures. Realizing the advantages of research objects and related efforts for DH will depend on implementations that establish dynamic platforms for experimentation, participation, and co-creation. This study has treated collections in terms of their logical components and relationships, setting aside for now several other important characteristics and properties, such as collections’ look and feel, their digital materiality, and the detailed contours of their imple- mentations. These aspects are essential to the experience and preser- vation of some collections; it is hard to see how the research objects model could benefit such projects after their development, in ret- rospective sustainability or preservation efforts, but it is clear that the model could underpin systems going forward that support a wide variety of implementations. DH research objects would necessarily represent extensions of the basic research objects model, based on the representational and user requirements in different domains and scholarly communities. The work of ontologizing the humanities is well underway. A re- search objects profile16 specific to representing collections such as Shelley-Godwin, Vault, and O Say will depend on cobbling together ontologies and vocabularies to express a diversity of relationships among primary, derived, and secondary sources, in addition to workflows, people, and contextual entities. Prior research has em- phasized the necessity of highly granular systems of identification, addressability, and reference for supporting DH research and col- lection practices within digital libraries [14]. Indeed, implementing the research objects model at scale within a linked data paradigm would demand more pervasive use of persistent identifiers for DH objects at varying levels of granularity, including ideally address- able identifiers for each component of a collection, the pieces that make up a component, and so on. In terms of architecture, DH collections bear significant resem- blance to other kinds of digital libraries. The benefits, constraints, and practical challenges of applying the research objects model for DH collections seem, for the most part, likely to hold for cultural heritage digital libraries generally. Emerging linked data collections of cultural heritage institutions stand to support the rise of research objects and similar publication models across disciplines. Future work will investigate the potential intersections between research 16http://www.researchobject.org/scopes/ http://www.researchobject.org/scopes/ JCDL ’19, June 2019, Urbana-Champaign, IL, USA objects and linked data representations of cultural collections in libraries, archives, and museums. There are numerous emergent models for representing digital publications and digital objects, including models for publishing media-rich and interactive digital monographs along with sup- plementary materials, and experiments with alternative scientific publishing models such as nanopublications [17]. Future work will investigate the intersections between the research objects model and various alternatives for representing the breadth of DH schol- arship, collections, and data, including forerunning applications of research objects to humanities collections [1, 6, 21, 24, 25, 30], and ongoing studies of other approaches to containerization in DH.17 The research objects data model evaluated in this paper is "data- centric"; workflow-oriented research objects, as a closely related alternative, extend the basic model to capture holistic, executable research workflows. While workflows have received growing atten- tion in the humanities from both technical and strategic perspec- tives [20, 28], the implications of workflow-oriented data models for capturing the idiosyncracies of humanities research processes need further investigation. Future work will extend this analysis to a more complete study of DH scholarship, scholars, and work- flows, in order to advance data models that may help us realize the benefits of standard infrastructure while minimally attenuating the irrepressible diversity of digital humanities scholarship. REFERENCES [1] Bridget Almas. 2017. Perseids: Experimenting with Infrastructure for Creating and Sharing Research Data in the Digital Humanities. Data Science Journal 16, 0 (2017). [2] Alessia Bardi and Paolo Manghi. 2014. Enhanced Publications: Data Models and Information Systems. LIBER Quarterly 23, 4 (2014). [3] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, and Carole Goble. 2013. Why linked data is not enough for scientists. Future Generation Computer Systems 29, 2 (2013). [4] Sean Bechhofer, David De Roure, Matthew Gable, Carole Goble, and Iain Buchan. 2010. Research Objects: Towards Exchange and Reuse of Digital Knowledge. Nature Proceedings (2010). [5] Khalid Belhajjame, Jun Zhao, Daniel Garijo, Matthew Gable, Kristina Hettne, Raul Palma, Eleni Mina, Oscar Corcho, José Manuel Gómez-Pérez, Sean Bechhofer, Graham Klyne, and Carole Goble. 2015. Using a suite of ontologies for preserving workflow-centric research objects. Journal of Web Semantics 32 (2015), 16–42. [6] Tobias Blanke and Mark Hedges. 2013. Scholarly primitives: Building institutional infrastructure for humanities e-Science. Future Generation Computer Systems 29, 2 (2013). [7] Joshua Borycz and Bonnie Carroll. 2017. Managing Digital Research Objects in an Expanding Science Ecosystem: 2017 Conference Summary. Data Science Journal (2017). [8] Phil E. Bourne, Tim Clark, Robert Dale, Anita de Waard, Ivan Herman, Eduard Hovy, and David Shotton. 2011. FORCE11 Manifesto: Improving Future Research Communication and e-Scholarship. Technical Report. FORCE11. [9] Kyle Chard, Mike D’Arcy, Ben Heavner, Ian Foster, Carl Kesselman, Ravi Madduri, Alexis Rodriguez, Stian Soiland-Reyes, Carole Goble, Kristi Clark, Eric W. Deutsch, Ivo Dinov, Nathan Price, and Arthur Toga. 2016. I’ll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets. 2016 IEEE Conference on Big Data (2016). [10] Tracey Clarke and Andy Bussey. 2018. Research Information Systems – fit for the future? A report on the situation and plans of the University of Sheffield Library. o-bib. Das offene Bibliotheksjournal / Herausgeber VDB (2018). [11] Anita de Waard. 2018. FAIR4CURES: A research object authoring tool for the data commons. CNI Fall Membership Meeting (2018). [12] Katrina Fenlon. 2017. Thematic research collections: Libraries and the evolution of alternative digital publishing in the humanities. Library Trends 65, 4 (2017), 523–539. 17http://digits.pub/ [13] Katrina Fenlon. 2017. Thematic research collections: Libraries and the evolution of alternative scholarly publishing in the humanities. Ph.D. Dissertation. University of Illinois at Urbana-Champaign, http://hdl.handle.net/2142/99380. [14] Katrina Fenlon, Megan Senseney, Harriett Green, Sayan Bhattacharyya, Craig Willis, and J. Stephen Downie. 2014. Scholar-built collections: A study of user requirements for research in large-scale digital libraries. In Proceedings of the American Society for Information Science and Technology. [15] Julia Flanders. 2014. Advancing Digital Humanities. Palgrave Macmillan UK, Chapter Rethinking Collections. [16] Julia Flanders and Fotis Jannidis. 2012. Knowledge Organization and Data Modeling in the Humanities. Technical Report. Brown University. [17] Richard Freedman, Raffaele Viglianti, and Adam Crandell. 2017. The collaborative musical text. Music Reference Services Quarterly (2017). [18] Andres Garcia-Silva, José Manuel Gómez-Pérez, Raul Palma, and Ikay ...Altin- tas. 2018. Enabling FAIR Research in Earth Science through Research Objects. arXiv:1809.10617 [cs] (2018). [19] David Hansen, Liz Milewicz, Paolo Mangiafico, Will Shaw, Mattia Begali, and Veronica McGurrin. 2018. A framework for library support of expansive digital publishing. Technical Report. Duke University. [20] Mark Hedges, Heike Neuroth, Kathleen M. Smith, Tobias Blanke, Laurent Romary, Marc Küster, and Malcolm Illingworth. 2013. TextGrid, TEXTvre, and DARIAH: Sustainability of Infrastructures for Textual Scholarship. Journal of the Text Encoding Initiative 5 (2013). [21] Inna Kouper, Beth Plale, Dharma Akmon, and Margaret Hedstrom. 2014. Prac- tical and conceptual considerations of research object preservation. Digital Preservation 2014 (2014). [22] Christine Madsen and Megan Hurst. 2018. Are digital humanities projects sus- tainable? A proposed service model for a DH infrastructure. In CNI Membership Meeting. https://www.slideshare.net/mccarthymadsen/are-digital-humanities- projects-sustainable-a-proposed-service-model-for-a-dh-infrastructure. [23] Nancy L. Maron and Sarah Pickle. 2014. Sustaining the digital humanities: Host institutional support beyond the startup phase. Technical Report. ITHAKA S+R, https://sr.ithaka.org/publications/sustaining-the-digital-humanities/. [24] Dominic Oldman and Diana Tanase. 2018. Reshaping the Knowledge Graph by Connecting Researchers, Data and Practices in ResearchSpace. The Semantic Web - IWSC (2018). [25] Kevin Page, David Lewis, and David Weigl. 2017. Contextual interpretation of music notation. Digital Humanities (2017). [26] Raúl Palma, Piotr Hołubowicz, Oscar Corcho, José Manuel Gómez-Pérez, and Cezary Mazurek. 2014. ROHub — A Digital Library of Research Objects Supporting Scientists Towards Reproducible Science. Springer International Publishing, 77–82. [27] Carole L. Palmer. 2004. A Companion to Digital Humanities. Blackwell Publishing, Chapter Thematic research collections. [28] Roger C. Schonfeld and Donald J. Waters. 2018. The Turn to Research Workflow and the Strategic Implications for the Academy. CNI Spring Membership Meeting (2018). [29] Sarah J. Sweeney, Julia Flanders, and Abbie Levesque. 2017. Community- Enhanced Repository for Engaged Scholarship: A case study on supporting digital humanities research. College and Undergraduate Libraries 24, 2-4 (2017), 322–336. [30] David Tarrant, Ben O’Steen, Tim Brody, Steve Hitchcock, Neil Jefferies, and Leslie Carr. 2009. Using OAI-ORE to Transform Digital Repositories into Interoperable Storage and Services Applications. The Code4Lib Journal (2009). [31] Karen M. Wickett, Allen H. Renear, and Jonathan Furner. 2011. Are collections sets?. In Proceedings of the American Society for Information Science and Technology, Vol. 48. http://digits.pub/ Abstract 1 Introduction 1.1 Digital scholarship and sustainability 1.2 Research objects in the humanities 2 Methods 2.1 Qualitative content analysis 2.2 Content/context component framework 2.3 Collections 3 Components of collections 3.1 Shelley-Godwin Archive components 3.2 Vault at Pfaff's components 3.3 O Say Can You See components 3.4 Content and context components 3.5 Relationships among components 4 Research objects and collections 5 Discussion and future work References