Small, thick, and slow Thinking about data and research publication in the Humanities in the age of Open and FAIR Daniel Paul O’Donnell University of Lethbridge Curtin University November 25, 2019 DOI (this version): 10.5281/zenodo.3551791 DOI (latest version): 10.5281/zenodo.3551790 About this paper ● Going to be speaking of how data are used in the humanities and implications for infrastructure design ○ How infrastructure currently interacts with typical humanities research practices ○ Why humanities researchers have been slow to adopt such infrastructure ○ How this infrastructure can be adapted to support (and improve) humanities research without requiring it to abandon its primary features/strengths ■ “Small” — focussed on very small number of data points or sets ■ “Thick” — involves intense curation and analysis of these few data ■ “Slow” — the same data points can be subject to years (generations) of subsequent, alternate, and supplementary analysis About this paper ● Important to recognise that I’m dealing in generalities ○ Not all humanities data are small or “representational” in focus ○ Not all humanities work is about thick description ○ Not all humanities work is about reworking old material ● But much is and these are the ones that are least well catered to in current infrastructure About me ● Traditionally trained medieval philologist and textual critic ● Means history of “big” and small data techniques ○ Thesis (1996) was analysis of (unpublished) database of textual variation in the Old English poetic canon ■ Letter-by-letter differences in about 20 poems surviving in more than one copy from the pre-conquest period ○ Later (2005) did 100,000 word edition of 9-line Cædmon’s Hymn (s. viii) ○ Now working on 5 object “edition” of the cross in pre-conquest England ● But ○ Coming from a textual/linguistic/literary approach ○ Focus on “editing” (i.e. the development and publication of “Primary Source” material — mediated representational data) Part 1 The problem of humanities data Traditionally, humanists resist speaking of data ● “Primary sources” = Texts, artifacts, objects of study ○ Can be originals (i.e. the artifact itself) ○ More often mediated and contextualised in some way (i.e. an edition, transcription, or similar) ● “Secondary sources” = Works of other scholars (often based on “Primary sources”) ● “Readings” (1) = Passages, extracts, quotations for interpretation or support ● “Readings” (2) = Interpretation, the end product of research (literary study) Traditionally, humanists resist speaking of data ● These definitions are highly contingent ○ “Primary source” in one context, can be the “secondary source” in another (and vice versa) ○ Or simultaneously “Primary” and “Secondary” (e.g. a critical edition) ● Also hard to constrain “[a]lmost any document, physical artifact, or record or human activity can be used to study culture” and arguments proposing previously unrecognised sources (“high school yearbooks, cookbooks, or wear patterns in the floors of public places”) are valued acts of scholarship” (Borgman 2007) How does data work in other fields? ● Resistance makes sense, because Humanities data is different from other forms of data ● In other domains, “data” (“given things”) is more properly “capta” (“taken”): generated through experiment, observation, and measurement ● Think about Darwin and his work in the Galapagos Islands ○ What is his data? How does data work in other fields? ● Resistance makes sense, because Humanities data is different from other forms of data ● In other domains, “data” (“given things”) is more properly “capta” (“taken”): generated through experiment, observation, and measurement ● Think about Darwin and his work in the Galapagos Islands ○ What is his data? The finches? How does data work in other fields? ● Resistance makes sense, because Humanities data is different from other forms of data ● In other domains, “data” (“given things”) is more properly “capta” (“taken”): generated through experiment, observation, and measurement ● Think about Darwin and his work in the Galapagos Islands ○ What is his data? The notes about the finches? How does data work in other fields? ● In fact, in the sciences, it is the notes. ● “Data” = “represent[ation of] information in a formalized manner suitable for communication, interpretation, or processing” (NASA 2012); “the facts, numbers, letters, and symbols that describe an object, idea, condition, situation, or other factors” (NRC 1999) The notes about the finches. But in the humanities? ● Can be both “data” and “capta”, but very often “data” ● Very specific and often provisional: small; ● Depends on interpretation and argument (argue whether something is data): thick; ● Frequently revisit the same datasets to see them differently, provide new contexts, reuse: slow But in the humanities? ● Can be both “data” and “capta”, but very often “data” ● Very specific and often provisional: small; ● Depends on interpretation and argument (argue whether something is data): thick; ● Frequently revisit the same datasets to see them differently, provide new contexts, reuse: slow Usually the Finch. But in the humanities? ● Can be both “data” and “capta”, but very often “data” ● Very specific and often provisional: small; ● Depends on interpretation and argument (argue whether something is data): thick; ● Frequently revisit the same datasets to see them differently, provide new contexts, reuse: slow Usually the Finch. Sometimes the notes. But in the humanities? ● Can be both “data” and “capta”, but very often “data” ● Very specific and often provisional: small; ● Depends on interpretation and argument (argue whether something is data): thick; ● Frequently revisit the same datasets to see them differently, provide new contexts, reuse: slow Usually the Finch. Sometimes the notes. And sometimes what Darwin thought he was doing in his notes about the Finch. In Humanities, “Data” is arguably mostly “Finch” ● Interesting proof: Humanities “data,” unlike science “data” is almost all practically and theoretically non-rivalrous. ● Humanities researchers rarely have an incentive (or capability) to prevent others from accessing their raw material. ● 200 years of Jane Austen studies based on five main pieces of data. Usually the Finch. Sometimes the notes. And sometimes what Darwin thought he was doing in his notes about the Finch. The “Digital Humanities” don’t change this ● DH adds to this basic fact, but doesn’t change it: ○ We can now have “capta” (intermediate “observations” extracted algorithmically to form large data sets that then require interpretation) ○ We can now work across complete historical or geographic corpora: all known nineteenth-century English periodicals; every surviving tract from the U.S. Civil War ○ Introduces the possibility of deductive work ○ Makes method questions more important than when you worked inductively from the collections you could access The “Digital Humanities” don’t change this ● But DH is not the perfection of the Humanities ○ A lot of research continues with “data” rather than “capta” ○ This “traditional” work remains sound and important ○ The distinction between “capta” and “data” is not teleological ■ “Big data” (“big capta”) DH is not better than “small data” (traditional) Humanities ■ Not all DH is “big capta” (you can do traditional work with computers) ■ “Big capta” approaches to Humanities questions can miss the point ● Intensive curation and analysis of small data sets remains a major function of humanities research Why does this matter? ● Although much humanities research is (appropriately) “small, thick, and slow,” it is also, in theory, useful for “big capta” work ○ Collectively, traditional humanists produce a lot of very high quality data ■ Intensely curated datasets and data points; ■ Broadly compatible with each other (i.e. each generation reedits and reconsiders the canon); ● If we could find a way to capture the value of this traditional data in a way that would allow them to be reused, ○ We’d have extremely useful material to repurpose ○ We’d be maximising the benefit of the traditional work that has been done on it Why does this matter? ● But FAIR small data is by-and-large uneconomical for small data researchers ○ Their goal is to publish contextualised small-data datasets to ■ Serve as primary sources for others ● e.g. an edition of Jane Austen’s Pride and Prejudice is intended to support secondary work on that novel) ■ Support very specific arguments about the specific instance ● e.g. that there are three versions of Hamlet ○ The features that are required for reuse require (in essence) a separate, standalone, publication ■ Deposit in repository ■ Standardised metadata ■ Loss of key interpretative context and information ● Since the mid-1990s, there have been hundreds if not thousands of digital editions published of medieval and renaissance texts. ● Almost all of these contain high quality digital photographs of the original artifacts, often with very detailed, research-based expert commentary and analysis (transcriptions, bibliographic and other descriptions, etc.) ● Represents, in theory, a potentially huge, extremely rich, dataset for new cross-project work ○ Automatic scribe identification ○ Dating training sets ○ History of the Book The case of manuscript photography The case of manuscript photography ● Because the purpose of these photographs has been to support the contextual analysis and/or supply users with representations of the individual objects in question, very few are easily recovered or used by machines: ○ Few/no standards for metadata, APIs, etc ○ Very few explicitly connected to expert description ○ Relationship to other images and publication status not machine readable ● Result is a lost opportunity to create a “big capta” dataset of thickly described data from hundred of individual “small data” projects So what to do? ● The solution to this is to accept the traditional nature and use-case involved in the production and consumption of Humanities research data ○ I.e. recognise that FAIR must accommodate the small, thick, and slow as easily as it does the big stand-alone examples from STEM ● That means that we have to either ○ Work within the traditional Humanities research workflow ○ Encourage traditional Humanities researchers to work within ours ● As long as FAIR data publication means, in essence, publishing small, thick, and slow data twice (once in context and once without), we will never fully reap the benefit of these important and potentially huge cultural datasets We’ve been here before ● The New English Dictionary (later the Oxford English Dictionary) provides a non-digital model for this ○ NED was based on “historical principles” (i.e. definitions derived from and supported by historical quotations) ○ Massive crowd-sourced big-data collection effort, involving thousands of readers collecting 1.8 million quotation slips from thousands of books prepared by generations of authors, scholars, and publishers (i.e. small data datasets) ○ In essence, an analogue version of what we want to do digitally We’ve been here before ● They had the same problem ○ Discovered almost immediately after setting up the reading programme that the texts they were planning to use were unsuitable ■ Not available in modern editions ■ Poor or difficult-to-determine quality ○ In other words, they discovered that they needed to improve and standardise the small datasets from which they were going to draw their big data records. We’ve been here before ● The solution was to create a demand and platform for new editions of medieval and renaissance works ○ Established text societies and publishers to publish new editions that met the NED’s requirements but also supported traditional humanities goals ○ Encouraged leading scholars to edit (and later reedit) the texts they needed according to the format they required ○ Very symbiotic relationship between what was going on in historical textual research at the time and the needs of this big-data dictionary ● Result was an increase in high quality (from both a big and a small data perspective) editions, providing the NED with the material it needed for its own big-data work What to do ● What we need is something similar for the digital age ○ A workflow that encourages small-data researchers to prepare their datasets in a way that ■ Respects their traditional requirements for the intensive curation and analysis of individual data points or small datasets ■ Opens these small, thick, and slow datasets up to big data analysis ■ Does not increase (and preferably reduces) the cost of production, publication, and maintenance ● In other words: work with the traditional workflows and do it within our systems. What to do? ● What we need, therefore, is a similar approach for the digital age, that is comfortable dealing with the small, thick, and slow nature of the work ○ Has to accept that most Humanities research is (properly) about a small numbers of objects (small) ○ That the purpose of most Humanities research is to analyse these small number of datapoints intensely (thick) ○ That researchers are going to want to rework these individual data points as part of the natural progress of their research (slow) ● A workflow in which suitability for “big capta” research is inherent in the publication “small data” workflow rather than a separate step. Part 2 Being FAIR to the small, thick, and slow Introduction ● In the rest of this talk, I’m going to talk about the “Data-First” approach we are developing for the Visionary Cross Project 1. The project and some of our parameters 2. Background issues and models 3. The implementation 4. Further work 30 About the Visionary Cross Project ● 9 year-old SSHRC funded project to produce an “edition” and “archive” of the “Visionary Cross cultural matrix” in Anglo-Saxon England ○ “Edition” means “Scholarly mediated reproduction” ○ “Archive” means “dataset of facsimiles and transcriptions” ○ “Visionary Cross Cultural Matrix” means “Collection of individual objects that also belong together for cultural reasons” 31 ● Objects include some of the best known objects and texts from Pre-conquest England and Scotland. 32 About the Visionary Cross Project ● Objects include some of the best known objects and texts from Pre-conquest England and Scotland. 33 About the Visionary Cross Project Vercelli Book Dream of the Rood and Elene poems (s. x/xi, South) ● Objects include some of the best known objects and texts from Pre-conquest England and Scotland. 34 About the Visionary Cross Project Ruthwell Cross (s. Viii, North) ● Objects include some of the best known objects and texts from Pre-conquest England and Scotland. 35 About the Visionary Cross Project Bewcastle Cross (s. viii, North) ● Objects include some of the best known objects and texts from Pre-conquest England and Scotland. 36 About the Visionary Cross Project Brussels Cross (s. x/xi, South) About the Visionary Cross Project ● Interesting as individual objects and as a group: ○ Span period temporally, geographically, linguistically ○ (possibly) Earliest attested poetry ○ Complete runic poem ○ Include 1 of only 2~3 examples of poetic quotation ○ “Multiply attested” poetic text (>3% of the corpus) ○ Related to each other thematically (cult of the cross) and textually and/or artistically 37 About the Visionary Cross Project ● In other words we anticipate use as both ○ A traditional small-data project (as well as a not-so-traditional small-data project): ■ Individuals coming to us for limited amounts of data in the context of our thick description because they want to use our material as the primary source for subsequent work ○ A contribution to potential big-data purposes: ■ Data that can be used, reused, supplemented, and aggregated by others without negotiation Project Requirements A. Flexible: ○ Choose to view individual/group in appropriate format B. Extensible: ○ Add, rearrange, or reuse material without negotiation C. Authoritative: ○ Preserve credit/responsibility for all contributions D. Durable: ○ Permanently discoverable and available ○ Low/no maintenance 39 Different approaches over the years ● Wiki? ○ Flexible (e.g. categories/entries) (A) ○ Add and (re)connect material without negotiation (B) ○ But ■ Doesn’t preserve Authority (C) ■ Requires ongoing maintenance (D) ■ One kind of presentation (A) 40 Different approaches over the years ● Game engine ○ Provided different ways of organising material and good at object/collection (A) ○ Preserved authority (C) ○ Some engines allowed some external contributions (B) ○ But ■ Requires others to use our system (B) ■ None strong on external contributions (B) ■ Requires ongoing maintenance (D) 41 OPenn (http://openn.library.upenn.edu/) ● Repository for MS information, images, transcriptions ● Built to replace a previous “turning the pages” type interface for MS collections ○ Open the collection up to machine access (i.e. via rsync, ssh, ftp, etc) ○ Maintain human readability 42 http://openn.library.upenn.edu/ OPenn (http://openn.library.upenn.edu/) ● Essentially a lightly-skinned directory structure (i.e. a RESTful-like API) ○ Human-readable HTML pages 43 http://openn.library.upenn.edu/ OPenn (http://openn.library.upenn.edu/) ● Essentially a lightly-skinned directory structure (i.e. a RESTful-like API) ○ Human-readable HTML pages 44 http://openn.library.upenn.edu/ OPenn (http://openn.library.upenn.edu/) ● Essentially a lightly-skinned directory structure (i.e. a RESTful-like API) ○ Human-readable HTML pages 45 http://openn.library.upenn.edu/ OPenn (http://openn.library.upenn.edu/) ● Essentially a lightly-skinned directory structure (i.e. a RESTful-like API) ○ Human-readable HTML pages 46 http://openn.library.upenn.edu/ OPenn (http://openn.library.upenn.edu/) ● Essentially a lightly-skinned directory structure (i.e. a RESTful-like API) ○ Human-readable HTML pages 47 http://openn.library.upenn.edu/ OPenn (http://openn.library.upenn.edu/) ● Love approach because it touches on all parts of vision ○ Flexible (i.e. A): can skin different groupings, focus on individuals or collections ○ Extensible (i.e. B): can extract from system ○ Authoritative (i.e. C): preserves authority ○ Durable (i.e. D): requires no software maintenance 48 http://openn.library.upenn.edu/ OPenn (http://openn.library.upenn.edu/) ● But not perfect ○ Inflexible (i.e. A): Hierarchical data structure (can’t have machine readable virtual collections) ○ Not extensible (i.e. B): ■ Additions/reorganisations require server access ■ Collections are “official” (entire libraries/fonds) ○ Not durable (i.e. D): ■ Publisher responsible for maintaining server ■ No persistent identifiers 49 http://openn.library.upenn.edu/ Requirements (further points) E. Externally registered persistent identifiers F. Users need to be able to present alternatives/additions to our material inside or outside the same system G. Has to be “Publish-and-Forget”: once we are finished with it, it needs to be maintained by others. 50 Our solution ● Use Zenodo and GitHub to create an OPenn-like data repository, while answering its lacunae ● A “Data-first” approach to publication that 1. Is human and machine readable 2. Preserves attribution 3. Open to non-negotiated addition, reorganisation, reuse 4. Uses standard, third-party-maintained, persistent IDs 5. Maintained for free by others (requires no post-publication maintenance by the project) 51 Zenodo ● EU-funded OpenAire Data Repository ○ Hosted at CERN ○ Guaranteed by EU ○ Accepts “all research outputs from all fields of science” ○ Assigns DOIs to all submissions (“conceptual” and “record”) ○ Based on Invenio Digital Repository Engine ■ Excellent metadata and LOD capabilities 52 GitHub ● Code repository, version control, distribution system ● Used by millions for developing code-based projects ● Recently added ability to publish web-pages using Jekyll-based “GitHub pages” ● Based on Open Source Git ● But ○ Recently bought by Microsoft (it’s always been private) ○ Not archival (conditions of use allow for suspension of service for any reason at any time) 53 Interaction of Zenodo and GitHub ● GitHub repositories can be archived in Zenodo ○ Snapshots are deposited in Zenodo as Zipped directories ○ Given a Zenodo DOI and treated like any other record ● Means: 1. Replace GitHub’s non-guarantee with Zenodo’s permanent guarantee 2. Presentation (versions) are also citable research objects (FAIR data AND FAIR code) 54 An example: Cædmon’s Hymn ● Originally CD-ROM (2005) ● Now online (2018) ● Code published using GitHub pages ○ https://caedmon.seenet.org/ ○ https://seenet-medieval.github.i o/caedmonshymn ● Code base preserved as Zenodo object (in all versions) 55 https://caedmon.seenet.org/ https://seenet-medieval.github.io/caedmonshymn https://seenet-medieval.github.io/caedmonshymn An example: Cædmon’s Hymn ● Originally CD-ROM (2005) ● Now online (2018) ● Code published using GitHub pages ○ https://caedmon.seenet.org/ ○ https://seenet-medieval.github.i o/caedmonshymn ● Code base preserved as Zenodo object (in all versions) 56 https://caedmon.seenet.org/ https://seenet-medieval.github.io/caedmonshymn https://seenet-medieval.github.io/caedmonshymn An example: Cædmon’s Hymn ● Originally CD-ROM (2005) ● Now online (2018) ● Code published using GitHub pages ○ https://caedmon.seenet.org/ ○ https://seenet-medieval.github.i o/caedmonshymn ● Code base preserved as Zenodo object (in all versions) 57 https://caedmon.seenet.org/ https://seenet-medieval.github.io/caedmonshymn https://seenet-medieval.github.io/caedmonshymn An example: Cædmon’s Hymn ● Originally CD-ROM (2005) ● Now online (2018) ● Code published using GitHub pages ○ https://caedmon.seenet.org/ ○ https://seenet-medieval.github.i o/caedmonshymn ● Code base preserved as Zenodo object (in all versions) 58 https://caedmon.seenet.org/ https://seenet-medieval.github.io/caedmonshymn https://seenet-medieval.github.io/caedmonshymn Visionary Cross as Data ● Combining two systems allows us to publish a data-centric edition that is ○ Flexible ○ Extensible ○ Authoritative ○ Durable ○ Externally registered persistent IDs ○ Maintained by others 59 Heart is the Zenodo record ● Basic unit of edition (1 record = 1 datum) ● Provides machine readability, extensibility, persistence, and archiving ● *Also acts as document server for rest of edition 60 Zenodo record ● Human and machine readable metadata record + file(s) ● *Typed “additional identifiers” ● *Two kinds of DOIs: ○ “Conceptual” (latest) ○ “Version” (current) ● *RESTful files URLs ○ No link rot 61 Zenodo record ● Human and machine readable metadata record + file(s) ● *Typed “additional identifiers” ● *Two kinds of DOIs: ○ “Conceptual” (latest) ○ “Version” (current) ● *RESTful files URLs ○ No link rot 62 Zenodo record ● Human and machine readable metadata record + file(s) ● *Typed “additional identifiers” ● *Two kinds of DOIs: ○ “Conceptual” (latest) ○ “Version” (current) ● *RESTful files URLs ○ No link rot 63 Zenodo record ● Human and machine readable metadata record + file(s) ● *Typed “additional identifiers” ● *Two kinds of DOIs: ○ “Conceptual” (latest) ○ “Version” (current) ● *RESTful files URLs ○ No link rot 64 Edition is built around records 65 66 67 68 69 70 Advantages to this system 71 ● Like OPenn ○ Human and Machine Readable ● Improve on OPenn ○ Persistent IDs (can be used RESTful) ○ FAIR ○ Not restricted to hierarchical arrangement or read only ○ Can be exported to variety of standards ○ Can be added to or rearranged by others ○ Maintained by archival specialists (i.e. commitment to preservation) ● Supports small, thick, and slow publication in a FAIR format ● What is interesting about this approach is that it is accidental ○ While most features are supported, ■ Not all are (e.g. arbitrary ontologies) ■ Those that are are inconsistent across repositories (e.g. streaming; typed other identifiers) ■ Support is often tentative or inadvertent ● Conceptual vs Record DOIs ● Restful DOI-based API ● While the ability to support Humanities data is there, the systems have not been designed with Humanities data in mind ● Supporting small, thick, and slow data is something that can be accommodated with relatively little work Disadvantages ● Next steps are to formalise this use case and feature-set ○ Build a prototype publication system within Zenodo/Github ○ Formalise and commit to the required features where they are tentative ○ Develop the few features not found specifically in Zenodo ○ Test system out on existing publications and data ○ Disseminate the model in order to encourage other systems to adopt it ● Just put together a Partnership for a SSHRC Partnership Development Grant ○ CERN/OpenAIRE ○ Toolmakers ○ Data projects ● Goal is to start prototyping this next year. Next steps Questions