Small, thick, and slow Towards an Open and FAIR data culture in the Humanities Daniel Paul O’Donnell University of Lethbridge IIT Gandhinagar December 18, 2019 DOI (this version): 10.5281/zenodo.3581520 DOI (latest version): 10.5281/zenodo.3581519 About this paper ● Going to be speaking of how data are used in the humanities and implications for infrastructure design ○ How infrastructure currently interacts with typical humanities research practices ○ Why humanities researchers have been slow to adopt such infrastructure ○ How this infrastructure can be adapted to support (and improve) humanities research without requiring it to abandon its primary features/strengths ■ “Small” — focussed on very small number of data points or sets ■ “Thick” — involves intense curation and analysis of these few data ■ “Slow” — the same data points can be subject to years (generations) of subsequent, alternate, and supplementary analysis About this paper ● Important to recognise that I’m dealing in generalities ○ Not all humanities data are small or “representational” in focus ○ Not all humanities work is about thick description ○ Not all humanities work is about reworking old material ● But much is and these are the ones that are least well catered to in current infrastructure About me ● Traditionally trained medieval philologist and textual critic ● Means history of “big” and small data techniques ○ Thesis (1996) was analysis of (unpublished) database of textual variation in the Old English poetic canon ■ Letter-by-letter differences in about 20 poems surviving in more than one copy from the pre-conquest period ○ Later (2005) did 100,000 word edition of 9-line Cædmon’s Hymn (s. viii) ○ Now working on 5 object “edition” of the cross in pre-conquest England ● But ○ Coming from a textual/linguistic/literary approach ○ Focus on “editing” (i.e. the development and publication of “Primary Source” material — mediated representational data) Traditionally, humanists resist speaking of data ● “Primary sources” = Texts, artifacts, objects of study ○ Can be originals (i.e. the artifact itself) ○ More often mediated and contextualised in some way (i.e. an edition, transcription, or similar) ● “Secondary sources” = Works of other scholars (often based on “Primary sources”) ● “Readings” (1) = Passages, extracts, quotations for interpretation or support ● “Readings” (2) = Interpretation, the end product of research (literary study) Traditionally, humanists resist speaking of data ● These definitions are highly contingent ○ “Primary source” in one context can be the “secondary source” in another (and vice versa) ○ Or simultaneously “Primary” and “Secondary” (e.g. a critical edition) ● Also hard to constrain “[a]lmost any document, physical artifact, or record or human activity can be used to study culture” and arguments proposing previously unrecognised sources (“high school yearbooks, cookbooks, or wear patterns in the floors of public places”) are valued acts of scholarship” (Borgman 2007) How does data work in other fields? ● Resistance makes sense, because Humanities data are different from other forms of data ● In other domains, “data” (“given things”) are often more properly “capta” (“taken”): generated through experiment, observation, and measurement, then analysed ● Think about Darwin and his work in the Galapagos Islands ○ What are his data? How does data work in other fields? ● Resistance makes sense, because Humanities data are different from other forms of data ● In other domains, “data” (“given things”) are often more properly “capta” (“taken”): generated through experiment, observation, and measurement, then analysed ● Think about Darwin and his work in the Galapagos Islands ○ What are his data? The finches? How does data work in other fields? ● Resistance makes sense, because Humanities data are different from other forms of data ● In other domains, “data” (“given things”) are often more properly “capta” (“taken”): generated through experiment, observation, and measurement, then analysed ● Think about Darwin and his work in the Galapagos Islands ○ What are his data? The notes about the finches? How does data work in other fields? “represent[ation of] information in a formalized manner suitable for communication, interpretation, or processing” (NASA 2012); “the facts, numbers, letters, and symbols that describe an object, idea, condition, situation, or other factors” (NRC 1999) The finches. But in the humanities? ● Can be both “data” and “capta”, but very often “data”: ○ Very specific and often provisional (i.e. small); ○ Dialogic in nature — defined and changed by the interpretative context (i.e. thick); ○ Frequently revisited to reanalyse or recontextualise (i.e. slow). ● Not produced so much as found. But in the humanities? Usually the Finch. ● Can be both “data” and “capta”, but very often “data”: ○ Very specific and often provisional (i.e. small); ○ Dialogic in nature — defined and changed by the interpretative context (i.e. thick); ○ Frequently revisited to reanalyse or recontextualise (i.e. slow). ● Not produced so much as found. But in the humanities? Usually the Finch. Sometimes the notes. ● Can be both “data” and “capta”, but very often “data”: ○ Very specific and often provisional (i.e. small); ○ Dialogic in nature — defined and changed by the interpretative context (i.e. thick); ○ Frequently revisited to reanalyse or recontextualise (i.e. slow). ● Not produced so much as found. But in the humanities? Usually the Finch. Sometimes the notes. And sometimes Darwin. ● Can be both “data” and “capta”, but very often “data”: ○ Very specific and often provisional (i.e. small); ○ Dialogic in nature — defined and changed by the interpretative context (i.e. thick); ○ Frequently revisited to reanalyse or recontextualise (i.e. slow). ● Not produced so much as found. But in the Humanities? ● Some evidence: 1. Humanities “data,” unlike science “capta,” are almost always practically and theoretically non-rivalrous. ■ Humanities researchers rarely have an incentive (or capability) to prevent others from accessing their raw material. ■ 200 years of Jane Austen studies based on five main pieces of data. 2. No traditional methods around the preservation of those “facts, numbers, letters, and symbols that describe an object, idea, condition, situation, or other factors.” ■ Nobody cares if your notes are well-preserved, unlike lab books. The “Digital Humanities” don’t change this ● DH adds to this basic distinction, but doesn’t change it: ○ We can now have “capta” (intermediate “observations” extracted algorithmically to form large data sets that then require interpretation) ○ We can now work across complete historical or geographic corpora: ■ E.g. all known nineteenth-century English periodicals; every surviving tract from the U.S. Civil War ○ We can do deductive, hypothesis-driven work ○ Introduces issue of reproducibility: ■ Since deductive work based on algorithmically derived data-sets can be checked (for completeness, opposing examples, etc), it can be more important to publish method. ■ People might care about your notes. The “Digital Humanities” don’t change this ● But the distinction between “capta” and “data” is not teleological — DH (and especially not “big data” DH) is not the perfection of the Humanities ○ “Big data” (“big capta”) DH is not better than “small data” Humanities — it allows different kinds of questions and approaches to questions ○ Not all DH is “big capta” — you can answer traditional humanities questions with the assistance of computation ● Sometimes (many times) “big capta” simply misses the point ○ Intensive curation and analysis of small data sets remains a major function of humanities research (small) ○ Dialogic definition of data remains a major method (thick) ○ Revisiting data to reflect current concerns remains a core purpose (slow) Why does this matter? ● Although much humanities research is (appropriately) “small, thick, and slow,” it is also, in theory, useful for “big capta” work ○ Collectively, traditional humanists produce a lot of very high quality data ■ Intensely curated datasets and data points; ■ Broadly compatible with each other (i.e. each generation re-edits and reconsiders the canon; reconsiders historical events, etc.); ● If we could find a way to capture the value of this traditional data in a way that would allow them to be reused, ○ We’d have extremely useful material to repurpose ○ We’d be maximising the benefit of the traditional work that has been done on it FAIR data ● In many sciences, there is a technically similar opportunity (though it involves a very different purpose): ○ Traditionally, scientists have not been good at publishing their data — they’ve published the analysis and conclusions (i.e. the relevant bits) ○ Reasons have included ■ Lack of fora: no means to distribute data; ■ Lack of will: publishing data means exposing real world messiness ■ Lack of reward: data are not “first class research objects” (credit is meaningless) ○ This hinders both reproducibility (can’t easily check your method) and development of new science (can’t build on your results) FAIR data ● FAIR (Findable, Accessible, Interoperable, Reusable) data and data citation principles are attempt to address this by establishing data as “first class research objects” — i.e. objects for which scientists can get credit ● Goal is to standardise and formalise the way in which data is published in order to ensure its entry into the scientific record and reuse. ● Encourage data publication, solving the forum problem ● FAIR data can be ○ Found and accessed by others without negotiation ○ Used in new use-cases or to reexamine old ones FAIR data F1. (meta)data are assigned a globally unique and eternally persistent identifier. F2. data are described with rich metadata. F3. (meta)data are registered or indexed in a searchable resource. F4. metadata specify the data identifier. A1 (meta)data are retrievable by their identifier using a standardized communications protocol. A2 metadata are accessible, even when the data are no longer available. I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles. I3. (meta)data include qualified references to other (meta)data. R1. meta(data) have a plurality of accurate and relevant attributes. R1.1. (meta)data are released with a clear and accessible data usage license. R1.2. (meta)data are associated with their provenance. R1.3. (meta)data meet domain-relevant community standards. FORCE11 (2014): http://bit.ly/IITGN-FAIR FAIR Data Bosman and Kramer 2016 FAIR Humanities? ● You might think this would work well for Humanists ○ Publication of Data as “First class research object” is inherent in several traditional strands of humanities research ■ Editing ■ GLAM ■ Concordances and Corpora FAIR Humanities? ● But FAIR small data is by-and-large uneconomical (and uninteresting) for small data researchers — whether digital or traditional ○ Reproducibility is still generally not a priority ■ Disagreement about the role of women in 19th C factories more likely a difference of emphasis or interpretation than falsification ■ You don’t need my precise dataset to critique my argument ○ Even data publishers focus on small, contextualised, datasets ■ e.g. an edition of Jane Austen’s Pride and Prejudice is intended to support secondary work on that novel — not work on novels generally FAIR Humanities? ● The features that are required for reuse require (in essence) a separate, standalone, publication ○ Deposit in repository ○ Standardised metadata ○ (Potentially) loss of key interpretative context and information ● None of this is (necessarily) required for success of original publication ● Much of it (may) require additional work and costs that detract from without improving the original research ○ Unlike STEM, reproducibility is simply generally not an issue of importance FAIR Humanities ● Instead of a modular workflow as in Bosman and Kramer 2016, in the Humanities, (data) publication (and processing) still tends to be done on a project-by-project basis ○ Individual websites ○ Larger “clubs” that you have to join (Europeana) ● The purpose in each case is what you might call “local” publication: ○ Making the data accessible for project-specific purposes, without concern for Interoperability or Reusability ● Has longer term implications as well: ○ Projects die (URLs rot) ○ Not machine readable ● Since the mid-1990s, there have been hundreds if not thousands of digital editions published of medieval and renaissance texts. ● Almost all of these contain high quality digital photographs of the original artifacts, often with very detailed, research-based expert commentary and analysis (transcriptions, bibliographic and other descriptions, etc.) ● Represents, in theory, a potentially huge, extremely rich, dataset for new cross-project work ○ Automatic scribe identification ○ Dating training sets ○ History of the Book The case of manuscript photography The case of manuscript photography ● Because the purpose of these photographs has been to support the contextual analysis and/or supply users with representations of the individual objects in question, very few are easily recovered or used by machines: ○ Few/no standards for metadata, APIs, etc ○ Very few explicitly connected to expert description ○ Relationship to other images and publication status not machine readable ● Result is a lost opportunity to create a “big capta” dataset of thickly described data from hundred of individual “small data” projects ● Reason was that it was in nobody’s interest to contribute to the Commons So what to do? ● The solution to this is to accept the traditional nature and use-case involved in the production and consumption of Humanities research data ○ I.e. recognise that FAIR must accommodate the small, thick, and slow as easily as it does the big stand-alone examples from STEM ● That means that we have to either ○ Work within the traditional Humanities research workflow ○ Encourage traditional Humanities researchers to work within ours ● As long as FAIR data publication means, in essence, publishing small, thick, and slow data twice (once in context and once without), we will never fully reap the benefit of these important and potentially huge cultural datasets We’ve been here before ● The New English Dictionary provides a non-digital model for this ○ Based on “historical principles” (i.e. definitions from and supported by historical quotations) ○ Massive crowd-sourced big-data collection, involving thousands collecting 1.8 million quotation slips from thousands of books prepared by generations of authors, scholars, and publishers (i.e. small data datasets) ○ In essence, an analogue version of what we want to do digitally We’ve been here before ● They had the same problem ○ Discovered almost immediately after setting up the reading programme that the texts they were planning to use were unsuitable ■ Not available in modern editions ■ Poor or difficult-to-determine quality ○ In other words, they discovered that they needed to improve and standardise the small datasets from which they were going to draw their big data records. We’ve been here before ● Solution? Create platform for new editions ○ Text societies and publishers to publish editions that met the NED’s requirements ○ Encouraged leading scholars to edit (and later reedit) the texts they needed ○ Very symbiotic relationship between what was going on in historical textual research at the time and the needs of this big-data dictionary ● Result was an increase in high quality small-data editions and better big-data data set for NED We’ve been here before ● What we need is something similar for the digital age ○ A workflow that encourages small-data researchers to prepare their datasets in a way that ■ Respects their traditional requirements for the intensive curation and analysis of individual data points or small datasets ■ Opens these small, thick, and slow datasets up to big data analysis ■ Does not increase (and preferably reduces) the cost of production, publication, and maintenance ● A workflow in which suitability for “big capta” research is inherent in the publication “small data” workflow rather than a separate step. What can I do? ● Not going to solve this problem in this paper ○ FAIR Data was the result of cross-disciplinary team working over several years through many rounds of consultation: ■ Largely focussed on a field in which data is abs/extractable ■ In which practitioners were in theory committed to data-sharing ■ In which “data” was a concept they were comfortable with ○ NED/OED was a major, society-sponsored, national effort focussed on a single data set. ● Also need to get away from idea that things would improve if only people would use my solution ○ Major issue in Humanities data is ring-fencing: my project, our club What can I do? ● Next paper I’m going to show a conceptual prototype of one possible approach to FAIR data publication in the Humanities ○ Not the solution but a solution to something we really haven’t thought about that much. ● But real change is going to require system-wide cultural shift ○ Development of tools, systems, and practices that are themselves FAIR ○ Development of culture that considers non-FAIR data to be infra dig ○ But that does both of these while understanding and respecting ■ The value and purpose of Humanities research ■ The nature and purpose of Humanities data collection (including reluctance to call data “data” Questions to ask yourself ● Are my data findable? F1. (meta)data are assigned a globally unique and eternally persistent identifier. F2. data are described with rich metadata. F3. (meta)data are registered or indexed in a searchable resource. F4. metadata specify the data identifier. Questions to ask yourself ● Are my data findable? ● Are my data accessible? A1 (meta)data are retrievable by their identifier using a standardized communications protocol [that is] A1.1 open, free, and universally Implementable. A1.2 allows for an authentication and authorization procedure, where necessary. A2 metadata are accessible, even when the data are no longer available. Questions to ask yourself ● Are my data findable? ● Are my data accessible? ● Are my data interoperable? I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles. I3. (meta)data include qualified references to other (meta)data. Questions to ask yourself ● Are my data findable? ● Are my data accessible? ● Are my data interoperable? ● Are my data reusable? R1. meta(data) have a plurality of accurate and relevant attributes. R1.1. (meta)data are released with a clear and accessible data usage license. R1.2. (meta)data are associated with their provenance. R1.3. (meta)data meet domain-relevant community standards. Questions