Microsoft Word - fenlon_RO2019_preprintSubmission.docx 1 Interactivity, Distributed Workflows, and Thick Provenance: A Review of Challenges confronting Digital Humanities Research Objects Katrina Fenlon (kfenlon@umd.edu; https://orcid.org/0000-0003-1483-5335) Introduction While Research Objects (ROs) are primarily oriented toward scientific research workflows, the RO model and parallel approaches have gained some uptake in the humanities, enough to suggest their potential to undergird sustainable, networked humanities research infrastructures. Digital scholarship in the humanities takes a great variety of forms that range widely beyond traditional publications, and which incorporate narratives, media, datasets and interactive components—any of which may be physically dispersed as well as dynamic and evolving over time. Despite the rapid growth of digital scholarship in the humanities, most existing research infrastructures lack support for the creation, management, sharing, maintenance, and preservation of complex, networked digital objects. ROs, and the community and tools that are growing around ROs, offer a potential, partial solution. While the concept of the RO has seen significantly more uptake in the humanities than has the formal data model (Bechhofer, 2013; Belhajjame et al., 2015), several compelling applications of the concept that suggest the time is ripe for considering broader integration of the model into distributed infrastructures. These applications include platforms for data sharing and collaborative scholarship, platforms for digital and semantic publishing, and digital repositories in several domains. This paper reviews existing applications of the ROs model to identify challenges confronting the application of ROs to humanities digital scholarship. This paper builds on Fenlon (2019), which investigated the application of the ROs model to digital humanities collections, and which identified three promising strengths of the model for the realm of digital humanities: (1) ROs readily perform the most essential function of a collection: to aggregate related resources in order to support scholarly objectives; (2) ROs have the capacity for explicit, semantic descriptions of interrelationships among components that are often hidden in digital humanities collections (and therefore vulnerable to dissolution); and (3) the RO model accommodates aggregations of linked data, offering researchers the opportunity to create and annotate virtual, fully referential collections. Having identified some strengths and limitations of the RO model for digital humanities collections through one experimental application of model, this paper builds on that analysis by reviewing the literature on ROs in the humanities and examining a range of applications of the RO and similar models within humanities and cultural heritage domains. This paper frames the review around three main challenges and their implications for future implementations of ROs to support digital research in the humanities: First, digital humanities scholarship requires specialized interactive use, so realizing the advantages of ROs for the humanities will depend on implementations that create platforms for experimentation and development by communities. Second, the idiosyncratic workflows employed in the construction of networked humanities scholarship means that workflow-oriented ROs will not gain significant uptake in the humanities unless they can capture distributed, sociotechnical workflows in meaningful ways. Third, humanities ROs will require capturing provenance in ways and at a level of detail that may be unfamiliar to the ROs scientific origins; humanities scholarship requires “thick,” multilayered, context-rich provenance descriptions that can accommodate conflicting assertions and formalize uncertainty. 2 Challenge 1. Essential interactivity for specialized use Much of humanities digital scholarship is essentially interactive. New modes of production and publication in the humanities are intended for user interaction or participation, and dynamic and responsive representation based on research context. Digital collections and archives, digital editions, maps, models, and simulations, and other modes of digital scholarship all rely on interactive components to express their interpretive contributions, or to enact their scholarly purposes. The interactive and dynamic components of digital scholarship include things like customized browsing and searching facilities that take advantage of extensive, rich scholarly encodings and annotations; platforms for collaborative annotation; dynamic maps and visualizations; etc. Such components are intended to do multiple things at once: to make arguments, to manifest interpretive stances, to enable knowledge transfer, and simultaneously to serve as active platforms for ongoing interpretation and research (Palmer, 2009; Fenlon, 2017; and others). Prior empirical work on applying the RO model to digital humanities collections found the main limitation of the model for digital humanities collections to be that functional components, designed for ongoing end-user interaction, are not usefully captured in a basic RO model and instead fall to the implementations built on top of research-object management systems (Fenlon, 2019). ROs can, of course, accommodate as flat code objects that are intended to be interactive; and ROs have been employed for this purpose to support data migration and archiving (e.g., the RO BagIt profile). But the purpose of digital humanities scholarship is to be alive and functional, and for ROs to be useful in this domain will require implementations that support platforms for flexible, participatory development. In a conceptual sense, the RO model has demonstrated value for this kind of platform approach in the humanities. The Perseids project offers a platform for sharing and peer-review of the transcriptions, annotations, and analyses that constitute research data in the Classics. The Perseids architecture is built around the concept of data publications, which are modeled as collections of related data objects. The Perseids team explicitly relates the data publication model to the RO model (Almas, 2017). Like ROs, Perseids data publications weave in several domain standards (including the TEI Epidoc schema, W3C Web Annotation, and others) to undergird an infrastructure that supports scholarly requirements specific to the Classics domain: transcription, fine-grained annotation, collaborative editing (with versioning), a research environment that facilitates data-type-specific extensions, and tailored workflows for peer review (Almas, 2017). Similarly, the CERES (Community Enhanced Repository for Engaged Scholarship) toolkit, created by the Northeastern University Libraries Digital Scholarship Group, explicitly draws on the concept of the RO in its system for supporting networked humanities scholarship and publishing. CERES allows digital humanities creators to build custom publications that pull objects from different repositories using APIs (including the Northeastern University Libraries’ Digital Repository Service and the Digital Public Library of America) (Sweeney, Flanders & Levesque, 2017). It is unclear how the RO model may fit into the broader, more diversified landscape of linked data and the Semantic Web in cultural institutions and in the humanities, but the conceptual fit within digital scholarship is established. ROs and similar models have substantial potential to underpin systems that support a variety of implementations. Realizing the advantages of ROs for the humanities will depend on implementations that create platforms for experimentation and collaborative development by distributed communities (Fenlon, 2019). Such platforms must accommodate dynamic interface-building, to allow scholarly communities with distinctive interests and needs to mobilize ROs in different ways. They must also accommodate participation and co- 3 creation through contributions of linked-data annotations and enrichments, including linking among ROs and the concepts and entities within ROs. Challenge 2. Distributed and idiosyncratic workflows of networked humanities scholarship Humanities digital scholarship is increasingly networked: heavily interconnected with and dependent on external resources for functionality and meaning. Many digital humanities publications in various forms—monographs, multimedia productions, exhibits, collections—draw on, reference, embed, and patch together distributed resources called from other collections, often via API. For example, a collection may center on a set of high-resolution images of primary sources, which are called from another digital library’s IIIF image server. Some of the longest- running, large-scale cultural heritage digital libraries (including Europeana and the Digital Public Library of America) are aggregations of descriptive surrogates, which link to original content hosted externally. Externally maintained schemas, authorities, and utilities undergird digital editions. Visualization and mapping projects generate content using external services. And with the growth of linked data in cultural collections, projects increasingly leverage external data sources as primary content, to which scholars then add layers of interpretive narrative, annotations, context, and interconnection. Humanities workflows rarely happen in self-contained or end-to-end research infrastructures, thwarting the possibility of sufficiently rich, automatic workflow capture. Indeed, efforts to build a workflow-oriented, unified cyberinfrastructure for supporting humanities scholarship tend to founder (e.g., Dombrowski, 2014). However, niche, task- or domain-specific infrastructures can capture constrained workflows. For example, in the domain of musicology, Page et al. (2017) observe how digital editions and annotations of encoded works are “manifestations of workflows deployed in musicological scholarship,” and offer a compelling framework for representing musical ROs, which include images, text, audio, and encoded music (Page et al., 2017; De Roure et al., 2018). Computational workflows are readily captured within humanities research environments, and ROs have come into play for this purpose. For example, the HathiTrust Research Center Data Capsule environment is moving toward systematic provenance-capture for computational text analysis workflows. These workflows take as inputs worksets (Jett et al., 2017), which are conceptually and technically akin to ROs: aggregate digital objects that implement addressability for and relational expressivity among components using domain ontologies. Unlike ROs, worksets are envisioned as the inputs of workflows in the current model of the HathiTrust Data Capsule environment, rather than encompassing whole research workflows (Murdock et al., 2017). But workflow-oriented ROs will not gain significant uptake in humanities contexts unless they can also capture and make useful more complex, distributed, sociotechnical workflows in meaningful ways. With their capacity for linked data using domain vocabularies, ROs readily accommodate many of the artifacts of networked digital scholarship in the humanities, along with their interrelationships (Fenlon, 2019). But can ROs accommodate humanities workflows in useful ways? In their effort to undergird DARIAH (pan-European infrastructure for digital arts and humanities research) through the systematic production of humanities ROs, Blanke and Hedges (2013) observed that humanities scholars employ sequential workflows, but “except in relatively specialised cases we rarely encountered workflows that could be automated, shared with and used by others, such as occur in many scientific disciplines.” While auto-generated and computer- useable workflows may not apply to most humanities research processes, formally characterized, (semi-) manually captured workflows would be highly useful for review, validation, archiving, reproducibility, reuse, and other purposes. While the RO model has the capacity and flexibility for complex workflow representation, more research is needed to characterize humanities workflows; 4 to identify how such characterizations can be made useful; and to identify model extensions and unique implementation strategies workflows might require in different domains. Challenge 3. Thick provenance Drilling down on the problem of workflow capture, digital humanities scholarship places special demands on data provenance—not only on the provenance of digital resources (such as files, compound objects, datasets) or components thereof (such as passages of music, paragraphs of a text, or lines of a poem), but also the provenance of attached, contextual information. Archival artifacts—the evidence of the humanities—often possess simultaneous, multiple and parallel provenances (Gilliland, 2014; Hurley, 2005). Documenting the provenance of the evidence itself can be complicated, but beyond that, the provenance of the provenance must also be documented. Any assertion made about any artifact (in the form of metadata or annotation), or any contextual and secondary information attached to artifacts in the context of digital scholarship, require provenance. Annotations and metadata are often, in the humanities, products of scholarly, interpretive work. Therefore, each annotation or metadata proposition itself is subject to claims of authorship, competing perspectives, expression of uncertainty, and further annotation—all requiring provenance information. Because provenance is a multilayered thing in humanities scholarship, different humanities disciplines and subdisciplines may require domain-specific provenance schemas and standards, which specialize existing standards for the expression of the provenance of different kinds of resources, ranging from digital media files to annotations. Humanities ROs will require thick, multilayered, context-rich provenance descriptions, which can accommodate conflicting assertions and formalize uncertainty. It is unclear whether existing implementations of the RO model can accommodate this level of description, though the model itself has the capacity. The ResearchSpace environment (Oldman and Tanase, 2018) offers exemplary support for documentation of thick, multifaceted provenance of humanities ROs. ResearchSpace is an open- source platform created by the British Museum to facilitate scholarly data sharing, formal argumentation, and semantic publishing within communities of researchers. ResearchSpace does not directly employ the RO model, though its architecture does rely on aggregates of linked data, taking advantage of related standards including W3C Web Annotation and Linked Data Platform containers. In this environment, provenance and argumentation are expressed using the CIDOC-CRM specialization CRMInf (The Argumentation Model). Scholars can use this vocabulary to build narratives and thick descriptions around digital ROs through annotation and data-linking. These narratives of provenance allow and formalize the expression of uncertainty and competing perspectives, and the environment also serves to document the scholarly work that goes into building these narratives (ResearchSpace Team, 2018). The reasons for highlighting the ResearchSpace approach to provenance in this review of humanities ROs are (1) to exemplify the unique demands of formalizing humanities provenance, and (2) to exemplify the highly distinctive, domain-specific implementation requirements that confront the RO and other domain-independent data models. Describing humanities provenance will require vocabularies to express argument and belief, as Oldman et al. (2015) observe. Beyond the RO model’s use of Prov and Web Annotation, humanities provenance will demand domain- specific argumentation extensions such as CRMInf. It is clear that ROs can theoretically accommodate thick provenance description, just as they can theoretically accommodate the representation of highly complex workflows, but can they usefully undergird implementations that are centered in humanities research needs? The ResearchSpace interface is tailored toward knowledge work, toward the collaborative construction of multifaceted provenance descriptions, 5 without requiring users to code or gain expert-level knowledge of domain ontologies. Tools for the authorship of humanities ROs, or tools that implement ROs behind the scenes, may benefit from taking the same approach. Conclusion ROs make a great deal of sense for modeling cultural information; skeletons of a similar shape— the simple and powerful combination of aggregation and annotation to represent compound digital objects—already structures large-scale cultural data aggregations, e.g., through the Europeana Data Model and the Digital Public Library of America Metadata Application Profile, which are both founded on ore:aggregations plus oa:annotations. But the challenges confronting widespread application of the RO model to humanities digital scholarship are significant. This review of existing applications has identified three central challenges: 1. Digital humanities scholarship requires specialized interactive use, so realizing the advantages of ROs for the humanities will depend on implementations that create platforms for experimentation and development by communities. 2. The idiosyncratic workflows of networked humanities scholarship means that workflow- oriented ROs will not gain significant uptake in the humanities unless they can capture distributed, sociotechnical workflows in meaningful ways. 3. Humanities ROs will require thick, multilayered, context-rich provenance descriptions that can accommodate conflicting assertions and formalize uncertainty, along with implementations that support the documentation of such provenance. In particular, the challenge of characterizing and formally expressing diverse humanities workflows, along with the provenance of data and contextual information within those workflows, presents the most urgent challenge and exciting opportunity for the future of humanities cyberinfrastructure. To many stakeholders in humanities cyberinfrastructure, “workflows are the new content” (Dempsey, 2016; Baynes et al., 2016; Schonfeld and Waters, 2018). While research on workflows is underway on multiple fronts (including Liu et al., 2017), it is clear already that there will be significant semantic differences between conceptual and technical elements in scientific workflows (and provenance) and those in the humanities; and these differences will affect the implementation of ROs for humanities research. Historically, attempts to implement scientific research infrastructures (including data models like the RO model) to support humanities scholarship have hit an obstacle in the form of semantic gulfs. For example, in the Linking and Querying Ancient Texts (LaQuAT) project, an effort to transfer eScience infrastructure in support of a humanities virtual research environment, Anderson and Blanke observed a fundamental challenge in integrating humanities data from different databases. They located the solution to that problem in humanities research communities: “integrating humanities research material...will require researchers to make the connections themselves, including decisions on how they are expressed and how to understand and explore the data more effectively” (Anderson and Blanke, 2012). Oldman et al. (2015), reviewing the state of linked data in the humanities, observed that basic linked data publication for many kinds of humanities sources can be counterproductive, “unless adapted to reflect specific methods and practices, and integrated into the epistemological processes they genuinely belong to.” This caution resonates with the challenges identified for the adoption of the RO model—or indeed for the importation of any data model, even domain- independent data models—into the humanities. The main challenges to implementing ROs for humanities research also present exciting opportunities for a more sustainable cross-disciplinary infrastructure (Fenlon, 2019), but implementation strategies must be centered in scholarly communities, and grow out from the practices, needs, and epistemologies of specific areas of study in the humanities and cultural institutions. 6 References Almas, B. (2017). Perseids: Experimenting with Infrastructure for Creating and Sharing Research Data in the Digital Humanities. Data Science Journal, 16(0). https://doi.org/10.5334/dsj-2017-019 Anderson, S., & Blanke, T. (2012). Taking the Long View: From e-Science Humanities to Humanities Digital Ecosystems. Historical Social Research / Historische Sozialforschung, 37(3 (141)), 147–164. Baynes, M. A., Sommer, D., Melley, D., & Lickiss, T. (2016, April). Workflow is the new content: Expanding the scope of interaction between publishers and researchers. Panel presentation presented at the Society for Scholarly Publishing. Retrieved from https://www.sspnet.org/events/past-events/workflow-is-the-new-content-expanding-the- scope-of-interaction-between-publishers-and-researchers/ Bechhofer, S., Buchan, I., De Roure, D., Missier, P., Ainsworth, J., Bhagat, J., … Goble, C. (2013). Why linked data is not enough for scientists. Future Generation Computer Systems, 29(2), 599–611. https://doi.org/10.1016/j.future.2011.08.004 Belhajjame, K., Zhao, J., Garijo, D., Gamble, M., Hettne, K., Palma, R., … Goble, C. (2015). Using a suite of ontologies for preserving workflow-centric research objects. Journal of Web Semantics, 32, 16–42. https://doi.org/10.1016/j.websem.2015.01.003 Blanke, T., & Hedges, M. (2013). Scholarly primitives: Building institutional infrastructure for humanities e-Science. Future Generation Computer Systems, 29(2), 654–661. https://doi.org/10.1016/j.future.2011.06.006 De Roure, D., Klyne, G., Page, K., Pybus, J., Weigl, D. M., & Willcox, P. (2018, July). Digital Music Objects: Research Objects for Music. Presented at the Research Object workshop (RO2018) at IEEE eScience Conference 2018. Retrieved from https://zenodo.org/record/1442453#.XB6Chc9KhhE Dempsey, L. (2016, October). The Library in the Life of the User: Two Collection Directions. Education. Retrieved from https://www.slideshare.net/lisld/the-library-in-the-life-of-the- user-two-collection-directions Dombrowski, Q. (2014). What Ever Happened to Project Bamboo? Literary and Linguistic Computing, 29(3), 326–339. https://doi.org/10.1093/llc/fqu026 Fenlon, K. (2017). Thematic research collections: Libraries and the evolution of alternative scholarly publishing in the humanities (Doctoral dissertation, University of Illinois at Urbana-Champaign). Retrieved from http://hdl.handle.net/2142/99380 Fenlon, Katrina. (2019). Modeling Digital Humanities Collections as Research Objects. Presented at the ACM/IEEE Joint Conference on Digital Libraries 2019. Retrieved from https://hcommons.org/deposits/item/hc:24889/ Gilliland, A. J. (2014). Conceptualizing 21st-Century Archives. ALA Editions. Hurley, C. (2005). Parallel provenance [Series of parts]: Part 1: What, if anything, is archival description?. [An earlier version of this article was presented at the Archives and Collective Memory: Challenges and Issues in a Pluralised Archival Role seminar (2004: Melbourne).]. Archives and Manuscripts, 33(1), 110. Jett, J., Cole, T. W., & Downie, J. S. (2017). Exploiting graph-based data to realize new functionalities for scholar-built worksets. Proceedings of the Association for Information Science and Technology, 54(1), 716–717. https://doi.org/10.1002/pra2.2017.14505401128 Liu, A., Kleinman, S., Douglass, J., Thomas, L., Champagne, A., & Russell, J. (2017). Open, Shareable, Reproducible Workflows for the Digital Humanities: The Case of the 4Humanities.org “WhatEvery1Says” Project. Presented at the Digital Humanities (DH2017). Retrieved from https://dh2017.adho.org/abstracts/034/034.pdf Murdock, J., Jett, J., Cole, T., Ma, Y., Downie, J. S., & Plale, B. (2017). Towards Publishing Secure Capsule-based Analysis. Proceedings of the 17th ACM/IEEE Joint Conference on 7 Digital Libraries, 261–264. Retrieved from http://dl.acm.org/citation.cfm?id=3200334.3200367 Oldman, D., Doerr, M., & Gradmann, S. (2015). Zen and the Art of Linked Data. In A New Companion to Digital Humanities (pp. 251–273). https://doi.org/10.1002/9781118680605.ch18 Oldman, D., & Tanase, D. (2018). Reshaping the Knowledge Graph by Connecting Researchers, Data and Practices in ResearchSpace. In D. Vrandečić, K. Bontcheva, Mari Carmen Suárez-Figueroa, V. Presutti, I. Celino, M. Sabou, … E. Simperl (Eds.), The Semantic Web – ISWC 2018 (pp. 325–340). Retrieved from https://link.springer.com/chapter/10.1007%2F978-3-030-00668-6_20 Page, K., Lewis, D., & Weigl, D. (2017). Contextual interpretation of digital music notation. Presented at the Digital Humanities (DH2017), Montréal, Canada. Palmer, C. L., Teffeau, L. C., & Pirmann, C. M. (2009). Scholarly Information Practices in the Online Environment: Themes from the Literature and Implications for Library Service Development. Retrieved from OCLC Research and Programs website: http://www.oclc.org/content/dam/research/publications/library/2009/2009-02.pdf ResearchSpace Team, British Museum. (2018, December). Moving from Documentation to Knowledge Building: ResearchSpace Principles and Practices. Presented at the Stiftung Preußischer Kulturbesitz (Prussian Cultural Heritage Foundation) Berlin. Retrieved from https://www.researchspace.org/docs/Berlin.pdf Schonfeld, R. C., & Waters, D. (2018, April). The turn to research workflow and the strategic implications for the academy. Presented at the Coalition for Networked Information (CNI) Spring Membership Meeting, San Diego, CA. Retrieved from https://vimeo.com/271130388 Sweeney, S. J., Flanders, J., & Levesque, A. (2017). Community-Enhanced Repository for Engaged Scholarship: A case study on supporting digital humanities research. College & Undergraduate Libraries, 24(2–4), 322–336. https://doi.org/10.1080/10691316.2017.1336144