By the People Crowdsourcing Datasets from the Library of Congress DATA PAPER CORRESPONDING AUTHOR: Victoria Van Hyning College of Information Studies, University of Maryland, College Park, USA vvh@umd.edu KEYWORDS: crowdsourcing; text; transcription; cultural heritage; accessibility; Handwritten Text Recognition TO CITE THIS ARTICLE: Van Hyning, V., Algee, L., Jones, M., Osborn, C., Owens, T., Seroka, L., & Shelton, A. (2022). By the People Crowdsourcing Datasets from the Library of Congress. Journal of Open Humanities Data, 8: 5, pp. 1–10. DOI: https://doi.org/10.5334/johd.67 By the People Crowdsourcing Datasets from the Library of Congress VICTORIA VAN HYNING LAUREN ALGEE MASON JONES CARLYN OSBORN TREVOR OWENS LAUREN SEROKA ABBY SHELTON ABSTRACT The By the People (BTP) datasets comprise text of selected collections of the Library of Congress (LOC) created by volunteers in the By the People crowdsourced transcription program, which invites public transcription of historical documents. All transcriptions are created and reviewed by volunteers in a consensus-based model in which two or more volunteers must agree on a transcription for it to be considered complete. Resulting transcriptions are added to the digital collections alongside the images to enable search and accessibility of the collections. Additionally, completed transcription “campaigns” are published as freely downloadable datasets of .CSV files containing all campaign transcriptions, as well as minimal metadata. The datasets can support a multitude of purposes including computational research in fields such as history, linguistics, economics, and political science. *Author affiliations can be found in the back matter of this article mailto:vvh@umd.edu https://doi.org/10.5334/johd.67 https://orcid.org/0000-0001-6775-2870 https://orcid.org/0000-0002-8777-2315 https://orcid.org/0000-0001-8857-388X 2Van Hyning et al. Journal of Open Humanities Data DOI: 10.5334/johd.67 (1) OVERVIEW REPOSITORY LOCATION Library of Congress, Washington, DC, USA. All By the People campaign datasets are available at this url as they are published: https://www.loc.gov/search/?fa=contributor:by+the+people+%28program%29. The seven datasets available at time of publication are detailed below. CONTEXT The By the People (BTP) datasets comprise text of selected collections of the Library of Congress (LOC) created by volunteers in the crowdsourced transcription program, which invites public transcription of historical documents. BTP has two indivisible goals: 1. Engage virtual volunteers in meaningful opportunities to deeply explore and learn from the LOC’s collections. 2. Enhance discoverability and usability of LOC collections through volunteer-created text (Shelton, 2021). As of 1 November 2021, By the People contributors have completed transcriptions for 392,000 images, and transcribed an additional 102,000 images which await peer review. Monthly updates are available at https://crowd.loc.gov/about/. Completed transcriptions are published alongside the images in loc.gov to enable image-level word search and improve readability and accessibility of the collections. These image-level transcriptions are also individually downloadable as .TXT files. Additionally, text from completed transcription “campaigns” are published in bulk as downloadable .CSV datasets. To date, seven completed campaigns, including a total of 23,316 transcriptions, have been published as datasets for bulk download, and these are the primary subject of this paper. These seven campaigns can be found in Table 1. 1 This table includes all datasets available at the time of publication. Table 1 Dataset Description.1 DATASET SUMMARY OF CAMPAIGN DATASET URL LCCN OBJECT NAME NUM. OF TRANSCRIPTIONS Wm. Oland Bourne Papers Selection from the papers of reformer, poet, editor, and clergyman William Oland Bourne (1819–1901). Includes narratives submitted by disablied Union veterans in a Left-hand Penmanship contest sponsored by Bourne as well as Civil War reminiscences by soldiers and sailors in Central Park Hospital, New York, N.Y. https://www. loc.gov/ item/2019667237 https://lccn.loc. gov/2019667237 civil-war-soldiers- disabled-but-not- disheartened_2020-12-10. csv 5,159 Branch Rickey Papers Selections from the papers of Branch Rickey, major league baseball manager and executive, consisting of scouting reports from the 1950s and 1960s. They are mostly concentrated in the years 1951–1956 and 1962–1963, while Rickey was associated, respectively, with the Pittsburgh Pirates and St. Louis Cardinals. https://www. loc.gov/ item/2019667234/ https://lccn.loc. gov/2019667234 branch-rickey-scouting- reports 1,926 (Contd.) https://doi.org/10.5334/johd.67 https://www.loc.gov/search/?fa=contributor:by+the+people+%28program%29 https://crowd.loc.gov/about/ https://loc.gov/ https://www.loc.gov/item/2019667237 https://www.loc.gov/item/2019667237 https://www.loc.gov/item/2019667237 https://lccn.loc.gov/2019667237 https://lccn.loc.gov/2019667237 https://www.loc.gov/item/2019667234/ https://www.loc.gov/item/2019667234/ https://www.loc.gov/item/2019667234/ https://lccn.loc.gov/2019667234 https://lccn.loc.gov/2019667234 3 From October 2018 to October 2021 the BTP team launched 24 thematic campaigns of content for the public to transcribe. Three campaigns for staff contribution have also launched since March 2020. These comprise over half a million images from LOC collections across four curatorial Divisions including Manuscript, American Folklife Center, Law Library of Congress, and Rare Book and Special Collections. Materials reflect the breadth and depth of the Library’s collections and include selections from the personal papers of American presidents; archives of women’s suffrage, civil rights, and abolition activists; writings of military leaders and veterans; papers of ethnomusicologist Alan Lomax; baseball scouting reports of Branch Rickey; legal documents; literary drafts of Walt Whitman; and records of occult experimentation from Harry Houdini’s collection. Many of the materials are handwritten, but many campaigns also include typed text, most of which is not amenable to extraction via optical character recognition (OCR). By the People launched on 24 October 2018 as a pilot of LC Labs, the LOC digital innovation unit, and in January 2020 became a permanent program of the Library’s Digital Content Management Section. A team of three By the People community managers run the program DATASET SUMMARY OF CAMPAIGN DATASET URL LCCN OBJECT NAME NUM. OF TRANSCRIPTIONS Samuel J. Gibson Diary and Correspondence The papers of Union soldier Samuel J. Gibson (1833–1878) consist of a letter and diary written by Gibson in 1864 while serving with Company B, 103rd Pennsylvania Infantry Regiment, and as a prisoner at Camp Sumter in Georgia, the Confederate prisoner of war camp commonly known as Andersonville Prison. https://www. loc.gov/ item/2019667238/ https://lccn.loc. gov/2019667238 hell-upon-earth- gibson_2020-12-10.csv 90 Carrie Chapman Catt Papers The papers of suffragist, political strategist, and pacifist Carrie Lane Chapman Catt (1859–1947) span the years 1848–1950, with the bulk of the material dating from 1890 to 1920. The collection consists of approximately 9,500 items (11,851 images), most of which were digitized from 18 microfilm reels. https://www. loc.gov/ item/2019667239/ https://lccn.loc. gov/2019667239 carrie-chapman-catt- papers-2020-12-07.csv 5,760 Elizabeth Cady Stanton Papers The papers of suffragist, reformer, and feminist theorist Elizabeth Cady Stanton (1815–1902) cover the years 1814 to 1946, with most of the material concentrated between 1840 and 1902. https://www. loc.gov/ item/2020445592 https://lccn.loc. gov/2020445592 elizabeth-cady-stanton- papers-2021-04-19.csv 3,456 Rosa Parks Papers Selections from the papers of Rosa Parks (1913–2005), including personal and family correspondence, personal writings and reflections, and ephemera from her speaking engagements and honors. https://www. loc.gov/ item/2020445590 https://lccn.loc. gov/2020445590 rosa-parks-in-her-own- words-2021-04-19.csv 1,769 Susan B. Anthony Papers The papers of reformer and suffragist Susan B. Anthony (1820–1906) span the period 1846–1934 with the bulk of the material dating from 1846 to 1906. https://www. loc.gov/ item/2020445591 https://lccn.loc. gov/2020445591 susan-b-anthony- papers-2021-04-19.csv 5,156 https://www.loc.gov/item/2019667238/ https://www.loc.gov/item/2019667238/ https://www.loc.gov/item/2019667238/ https://lccn.loc.gov/2019667238 https://lccn.loc.gov/2019667238 https://www.loc.gov/item/2019667239/ https://www.loc.gov/item/2019667239/ https://www.loc.gov/item/2019667239/ https://lccn.loc.gov/2019667239 https://lccn.loc.gov/2019667239 https://www.loc.gov/item/2020445592 https://www.loc.gov/item/2020445592 https://www.loc.gov/item/2020445592 https://lccn.loc.gov/2020445592 https://lccn.loc.gov/2020445592 https://www.loc.gov/item/2020445590 https://www.loc.gov/item/2020445590 https://www.loc.gov/item/2020445590 https://lccn.loc.gov/2020445590 https://lccn.loc.gov/2020445590 https://www.loc.gov/item/2020445591 https://www.loc.gov/item/2020445591 https://www.loc.gov/item/2020445591 https://lccn.loc.gov/2020445591 https://lccn.loc.gov/2020445591 4Van Hyning et al. Journal of Open Humanities Data DOI: 10.5334/johd.67 and support and encourage volunteers through newsletters, a discussion forum,2 Twitter,3 email, and virtual and in-person events including challenges and transcribe-a-thons.4 Extensive “How-to” instructions5 for transcription, review and tagging are available in English and Spanish. BTP and the underlying open source codebase Concordia6 are developed iteratively, and integrate user-centered changes in response to formal and informal user research and feedback and the requirements of different collection materials (Ferriter et al, 2019). The transcription conventions have not changed substantially, but additional guidance has been added to assist transcription of specific types of text, such as cross-writing. (2) METHOD STEPS All transcriptions are created and reviewed by volunteers using a consensus model, in which at least two volunteers must agree on a transcription for it to be marked as complete. All activity takes place online at crowd.loc.gov. Image 1 depicts a page that has been transcribed and reviewed on the BTP interface. Anyone can transcribe without an account. Registered users can review other volunteers’ transcriptions, add tags, and access their contribution history. Volunteers browse to select assets to transcribe or review. Participants can see all pages in sequential order and their current status — “Not started”, “In Progress”, “Needs Review”, or “Completed.” Volunteers are asked to preserve all original spelling, punctuation, and line breaks, except in cases where words break over lines or pages. These conventions create transcriptions that are word searchable, amenable to screen reader technology, and a good starting point for those wishing to use the data for handwritten text recognition training systems. To make the data most useful for Handwritten Text Recognition (HTR), a user would have to go through each transcription and edit words breaking over lines or pages to reflect the original layout of the page. Images are imported into BTP via the loc.gov Application Programming Interface (API). A loc.gov item, consisting of one or more images, is also called an “item” in By the People, while the individual images are called “assets.” Assets may show more than one page of documents, though in BTP outreach assets are often referred to colloquially as “pages.” 2 https://historyhub.history.gov/community/crowd-loc Accessed November 22, 2021. 3 LOC Crowdsourcing, @Crowd_LOC. Accessed November 22, 2021. https://twitter.com/Crowd_LOC. 4 https://crowd.loc.gov/resources/ Accessed November 22, 2021. 5 https://crowd.loc.gov/help-center/welcome-guide/ Accessed November 22, 2021. 6 https://github.com/LibraryOfCongress/concordia Accessed November 22, 2021. 7 https://crowd.loc.gov/campaigns/rosa-parks-in-her-own-words/writings-notes-and-statements/ mss859430227/mss859430227-3/, [Webpage]. Accessed November 22, 2021. Image 1 Screenshot of a completed transcription from the Rosa Parks Papers in By the People.7 https://doi.org/10.5334/johd.67 https://crowd.loc.gov/ https://loc.gov/ https://loc.gov/ https://historyhub.history.gov/community/crowd-loc https://twitter.com/Crowd_LOC https://crowd.loc.gov/resources/ https://crowd.loc.gov/help-center/welcome-guide/ https://github.com/LibraryOfCongress/concordia https://crowd.loc.gov/campaigns/rosa-parks-in-her-own-words/writings-notes-and-statements/mss859430227/mss859430227-3/ https://crowd.loc.gov/campaigns/rosa-parks-in-her-own-words/writings-notes-and-statements/mss859430227/mss859430227-3/ 5Van Hyning et al. Journal of Open Humanities Data DOI: 10.5334/johd.67 Campaigns consist of chronological or thematic buckets of items called “projects”. The campaign content may come from a single collection or unite materials across collections, as in the case of Walt Whitman.8 Projects can also be linked to Topics, connecting related content across campaigns. Current topics include the Civil War, presidential papers, and women’s suffrage. BTP is a pass-through application that does not automatically sync with loc.gov. Changes made to images or text in one will not automatically appear in the other. Once transcriptions go through quality control (see below) they are manually exported from BTP, ingested into long-term preservation storage,9 published as part of the items on loc.gov, and packaged as datasets. An attribution is included at the end of each transcription’s .TXT file: “Transcribed and reviewed by contributors participating in the By the People project at crowd.loc.gov” (Van Hyning, 2020). QUALITY CONTROL LOC subject specialists spot-check completed campaigns before publication of the transcriptions for quality control, as is exemplified in the completed campaigns included in Table 1. Often their review begins when the campaign does, so that early interventions can be made in the instructions or campaign context if a repeated error is spotted. The number of assets checked varies campaign-to-campaign and is determined by the specialist. During their review before the data is exported, the specialists edit significant errors they encounter such as mistranscribed words that change the meaning of the text; or supply missing words, phrases or (in rare cases) whole pages. Very few instances of vandalism have occurred to date. At least five percent of all currently available datasets were reviewed by LOC staff; Gibson and Parks were reviewed in their entirety. Two additional quality control assessments have been undertaken and are summarized below. The Branch Rickey papers contain 1,926 pages of baseball scouting reports; some are memos, others tabular. The materials are relatively human-legible, but OCR did not meet curatorial needs due to thin paper, often faint type, and other issues. Algee et al (2019) sampled 240 characters from the midpoint of all narrative Rickey transcriptions and found 98% accuracy. Typical errors included volunteers correcting spelling (against the BTP conventions), introducing typos, or not conforming to BTP formatting conventions. In 2020, Manuscript Division subject specialist Michelle Krowl analyzed all 90 images of Samuel J. Gibson’s diary transcribed in the “This Hell-upon-earth of a prison” campaign and logged all of her changes in a spreadsheet. The handwriting and spelling are representative of other nineteenth-century materials included in BTP. Krowl identified 703 character-level errors in total. Most were minor: expansions, such as changing “&” to “and”; typos or misspellings by volunteers; and correction of original spellings. Parts of some pages are so damaged or ambiguous that a definitive reading cannot be provided by the volunteers or specialists, and were not calculated in the error rate. One major source of error was an untranscribed page (281 characters) marked as “Complete.” The overall character error rate for Gibson before Krowl made edits was 703/152,017 or .0046%. Table 2 presents an excerpt of the Rosa Parks data and abridged transcriptions. We provide the Rosa Parks10 dataset description here as a representative example of the seven currently available datasets, and template for future releases: This dataset includes: .ZIP file containing a .CSV file and a README file. - rosa- parks-in-her-own-words-2021-04-19.csv- a .CSV containing campaign, project, item, itemID, asset, and asset status metadata, as well as an image link, and the volunteer-generated transcription. This .CSV is the direct export of the “Rosa Parks: In Her Own Words” campaign. 8 https://crowd.loc.gov/campaigns/walt-whitman/, [Webpage]. Accessed November 22, 2021. 9 For more information on Library of Congress digital collection management practices see https://www.loc.gov/ programs/digital-collections-management/about-this-program/, [Webpage]. Accessed November 22, 2021. https://www.loc.gov/item/2020445590 [Webpage]. Accessed November 22, 2021. 10 https://www.loc.gov/item/2020445590 [Webpage]. Accessed November 22, 2021. https://doi.org/10.5334/johd.67 https://loc.gov/ https://loc.gov/ https://crowd.loc.gov/ https://crowd.loc.gov/campaigns/walt-whitman/ https://www.loc.gov/programs/digital-collections-management/about-this-program/ https://www.loc.gov/programs/digital-collections-management/about-this-program/ https://www.loc.gov/item/2020445590 https://www.loc.gov/item/2020445590 6Van Hyning et al. Journal of Open Humanities Data DOI: 10.5334/johd.67 C A M PA IG N P R O JE C T IT E M IT E M ID A S S E T A ss e tS ta tu s D o w n lo a d U R L T R A N S C R IP T IO N Ro sa P a rk s: I n H e r O w n W o rd s W ri ti n g s, N o te s, a n d S ta te m e n ts Ro sa P a rk s Pa p e rs : W ri ti n g s, N o te s, a n d St a te m e n ts , 1 9 5 6 –1 9 9 8 ; N o te b o o ks ; 1 9 6 1 –1 9 6 2 , 1 9 8 5 –1 9 9 0 , u n d a te d m ss 8 5 9 4 3 0 2 3 2 m ss 8 5 9 4 3 0 2 3 2 -4 5 co m p le te d h tt p :/ /t ile .lo c. a o v/ im a a e- se rv ic es /i iif / se rv ic e: m ss :m ss 8 5 9 4 3 :0 0 1 8 :1 6 :0 0 4 5 /f u ll/ o ct :5 0 /0 /d ef a u lt .jp g W e d n e sd a y 1 /2 5 /8 9 , 1 PM V iv ia n ’s d e a th F u n e ra l s e rv ic e S a t. 1 0 A M 1 /2 8 /8 9 U n io n B a p ti st C h u rc h N e w fi e ld A ve . H o m e A d d re ss , 5 3 B o n n e r St re e t St a m fo rd Ro sa P a rk s: I n H e r O w n W o rd s W ri ti n g s, N o te s, a n d S ta te m e n ts Ro sa P a rk s Pa p e rs : W ri ti n g s, N o te s, a n d St a te m e n ts , 1 9 5 6 –1 9 9 8 ; N o te b o o ks ; 1 9 6 1 –1 9 6 2 , m ss 8 5 9 4 3 0 2 3 2 m ss 8 5 9 4 3 0 2 3 2 -4 4 co m p le te d h tt p :/ /t ile .lo c. a o v/ im a a e- se rv ic es /i iif / se rv ic e: m ss :m ss 8 5 9 4 3 :0 0 1 8 :1 6 :0 0 4 4 /f u ll/ o ct :5 0 /0 /d ef a u lt .jp g 1 /2 6 /8 9 S ta r M a ke rs K e lly & C o A g e n ts C h e ry l K a g a n M ic h B a n k Ro sa P a rk s: I n H e r O w n W o rd s W ri ti n g s, N o te s, a n d S ta te m e n ts Ro sa P a rk s Pa p e rs : W ri ti n g s, N o te s, a n d St a te m e n ts , 1 9 5 6 –1 9 9 8 ; N o te b o o ks ; 1 9 6 1 –1 9 6 2 . m ss 8 5 9 4 3 0 2 3 2 m ss 8 5 9 4 3 0 2 3 2 -4 3 co m p le te d h tt p :/ /t ile .lo c. a o v/ im a a e- se rv ic es /i iif / se rv ic e: m ss :m ss 8 5 9 4 3 :0 0 1 8 :1 6 :0 0 4 3 /f u ll/ o ct :1 0 0 /0 /d ef a u lt .ip g D r. A n d e rs o n 2 4 5 3 5 N . C a ro lin a So u th fi e ld M l 4 8 0 7 5 m e t h e r a t Fa rm e r Ro sa P a rk s: I n H e r O w n W o rd s W ri ti n g s, N o te s, a n d S ta te m e n ts Ro sa P a rk s Pa p e rs : W ri ti n g s, N o te s, a n d St a te m e n ts , 1 9 5 6 –1 9 9 8 ; N o te b o o ks ; 1 9 6 1 –1 9 6 2 , 1 9 8 5 –1 9 9 0 , u n d a te d m ss 8 5 9 4 3 0 2 3 2 m ss 8 5 9 4 3 0 2 3 2 -4 2 co m p le te d h tt p :/ /t ile .lo c. a o v/ im a a e- se rv ic es /i iif / se rv ic e: m ss :m ss 8 5 9 4 3 :0 0 1 8 :1 6 :0 0 4 2 /f u ll/ o ct :5 0 /0 /d ef a u lt .jp g N e w B ri d e + G ro o m M r. & M rs . A n d e rs o n B o w le s R ev . B yn u m ’s g ra n d d a u g h te r K e lly 7 4 5 –4 7 9 5 h a ve b e t h e re a t a . a r Ta b le 2 R o sa P a rk s .C SV F ile E xc e rp t. T h is t a b le d e m o n st ra te s th e fi le s tr u ct u re a n d c o n te n t. https://doi.org/10.5334/johd.67 http://tile.loc.aov/imaae-services/iiif/service:mss:mss85943:0018:16:0045/full/oct:50/0/default.jpg http://tile.loc.aov/imaae-services/iiif/service:mss:mss85943:0018:16:0045/full/oct:50/0/default.jpg http://tile.loc.aov/imaae-services/iiif/service:mss:mss85943:0018:16:0045/full/oct:50/0/default.jpg http://tile.loc.aov/imaae-services/iiif/service:mss:mss85943:0018:16:0044/full/oct:50/0/default.jpg http://tile.loc.aov/imaae-services/iiif/service:mss:mss85943:0018:16:0044/full/oct:50/0/default.jpg http://tile.loc.aov/imaae-services/iiif/service:mss:mss85943:0018:16:0044/full/oct:50/0/default.jpg http://tile.loc.aov/imaae-services/iiif/service:mss:mss85943:0018:16:0043/full/oct:100/0/default.ipg http://tile.loc.aov/imaae-services/iiif/service:mss:mss85943:0018:16:0043/full/oct:100/0/default.ipg http://tile.loc.aov/imaae-services/iiif/service:mss:mss85943:0018:16:0043/full/oct:100/0/default.ipg http://tile.loc.aov/imaae-services/iiif/service:mss:mss85943:0018:16:0042/full/oct:50/0/default.jpg http://tile.loc.aov/imaae-services/iiif/service:mss:mss85943:0018:16:0042/full/oct:50/0/default.jpg http://tile.loc.aov/imaae-services/iiif/service:mss:mss85943:0018:16:0042/full/oct:50/0/default.jpg 7Van Hyning et al. Journal of Open Humanities Data DOI: 10.5334/johd.67 The README file provides more detailed information about each data field. These descriptions appear in the “Summary” field for each dataset. CSV files are all structured in the same manner, and named according to this formula: Campaign-name-YYYY-MM-DD.csv (e.g. carrie-chapman- catt-papers-2020-11-16.csv). The Branch Rickey datasets include an additional version of the data (V1) with two additional .CSVs in which the data are sorted by document format based on the categories established for the quality analysis described above. The workflow for transcription and dataset publication is outlined in Figure 1. FORMAT NAMES AND VERSIONS README and .CSV. CREATION DATES 2018-10-24 – 2021-04-19 Figure 1 Transcription workflow diagram: The flow of transcription data from creation to publication and the different constituents who make this work possible. https://doi.org/10.5334/johd.67 8Van Hyning et al. Journal of Open Humanities Data DOI: 10.5334/johd.67 DATASET CREATORS The content of the datasets are the results of transcription and review by anonymous and registered By the People volunteers, with minor intervention by Library of Congress staff, including the By the People team. LANGUAGE English is the primary language for each of the current datasets, though they contain small amounts of other languages. LICENSE The Rights and Access information on the dataset page refers generally to Library of Congress- published datasets overall. The README provides the following information specific to the By the People datasets: “All contributions to the By the People application are released into the public domain as they are created. Anyone is free to use and re-use the datasets.” REPOSITORY NAME Library of Congress PUBLICATION DATE • Rickey V1, 2019-03-22, Rickey V2 2020-06-16 • Bourne, 2020-12-10 • Gibson, 2020-12-10 • Catt, 2020-11-16 • Parks, 2021-04-19 • Anthony, 2021-04-19 • Stanton, 2021-04-19 (4) REUSE POTENTIAL In addition to powering search and accessibility on loc.gov, BTP transcriptions have many potential research uses. Algee et al. (2019) modeled possibilities using Voyant tools on the Rickey dataset, including word clouds and semantic analysis. The authors found that while Rickey’s most frequently used word was “good”, the usage was most often critical of a player’s abilities, as in “I think [Bob Wakefield] is a good man to get rid of” (30). Computational linguistics, including semantic, sentiment, and word frequency analysis, are well-supported by the data, as are traditional close-reading practices in the humanities (Seroka and Shelton 2021). Speeches, diaries, and other personal writings in the suffragist and civil rights papers are ripe for deeper study, while the extensive body of letters in most collections, particularly among the interrelated suffragists, would lend themselves to new network analyses. These datasets offer significant opportunities to study the accuracy and quality of crowdsourced transcriptions, and could be used in combination with the BTP transcription conventions and scaffolding, BTP discussion forum analysis, user-surveys, and other qualitative and quantitative approaches to probe the efficacy of platform design and community engagement work in helping people learn about history, primary source use, paleogeography and more. Finally, these data have clear potential as training sets for improving machine-learning, HTR of manuscripts, and OCR of various materials. ACKNOWLEDGEMENTS Support for the collections, engagement, and technology of BTP is the result of extensive collaboration between the BTP team, LC Labs, IT Design and Development, the Digital Collections Management and Services Division, collection curators, and many others working across the Library of Congress. https://doi.org/10.5334/johd.67 https://loc.gov/ 9Van Hyning et al. Journal of Open Humanities Data DOI: 10.5334/johd.67 FUNDING INFORMATION Staffing for BTP and the development of the Concordia platform is supported in part by the National Digital Library Trust Fund. COMPETING INTERESTS Victoria Van Hyning serves on the editorial board of JOHD. Mason Jones serves as a copy editor for JOHD. AUTHOR CONTRIBUTIONS • Victoria Van Hyning (corresponding author): Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Writing – original draft. • Lauren Algee: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Software, Supervision, Visualization, Writing – original draft. • Mason Jones: Writing – review & editing • Carlyn Osborn: Conceptualization, Data Curation, Investigation, Methodology, Project Administration, Writing – review & editing • Trevor Owens: Project administration, Resources, Supervision • Lauren Seroka: Conceptualization, Software, Data Curation, Investigation, Methodology, Project Administration, Validation, Writing – review & editing • Abigail Shelton: Conceptualization, Data Curation, Investigation, Methodology, Project Administration, Writing – review & editing AUTHOR AFFILIATIONS Victoria Van Hyning orcid.org/0000-0001-6775-2870 College of Information Studies, University of Maryland, College Park, USA Lauren Algee Digital Content Management Section, Library of Congress, Washington, DC, USA Mason Jones orcid.org/0000-0002-8777-2315 College of Information Studies, University of Maryland, College Park, USA Carlyn Osborn Digital Content Management Section, Library of Congress, Washington, DC, USA Trevor Owens orcid.org/0000-0001-8857-388X Digital Content Management Section, Library of Congress, Washington, DC, USA Lauren Seroka Digital Content Management Section, Library of Congress, Washington, DC, USA Abby Shelton Digital Content Management Section, Library of Congress, Washington, DC, USA REFERENCES Algee, L., Ferriter, M., & Van Hyning, V. (2019). “‘And the Crowd Goes Wild!’: Crowdsourcing Baseball History at the Library of Congress.” Archival Outlook, 4–5, 30, https://mydigitalpublication.com/ publication/?i=623810&article_id=3494347&view=articleBrowser Ferriter, M., Zwaard, K., Kamlley, E., Storey, R., Adams, C., Algee, L., Van Hyning, V., Bresner, J., Potter, A., Jakeway, E., & Brunton, D. (2019). “With One Heart”: Agile approaches for developing Concordia and crowdsourcing at the Library of Congress. The Code4Lib Journal, 46. https://journal.code4lib.org/ articles/14901 Seroka, L., & Shelton, A. (2021, June 10). “Diving into Branch Rickey: Using a dataset of crowdsourced transcriptions as a tool for open research”. The Signal [Webpage]. Accessed November 22, 2021. https://blogs.loc.gov/thesignal/2021/06/diving-into-branch-rickey/ Shelton, A. (2021, October 21). “Using Crowdsourced Lincoln Transcriptions: An Interview with Jon White”. The Signal [Webpage]. Accessed November 22, 2021. https://blogs.loc.gov/thesignal/2021/10/ jon-white/ Van Hyning, V. (2020, July 9). “Finding By the People Transcriptions in the Library’s Digital Collections”. The Signal [Webpage]. Accessed November 22, 2021. https://blogs.loc.gov/thesignal/2020/07/finding- by-the-people-transcriptions-in-the-librarys-digital-collections/ https://doi.org/10.5334/johd.67 https://orcid.org/0000-0001-6775-2870 https://orcid.org/0000-0001-6775-2870 https://orcid.org/0000-0002-8777-2315 https://orcid.org/0000-0002-8777-2315 https://orcid.org/0000-0001-8857-388X https://orcid.org/0000-0001-8857-388X https://mydigitalpublication.com/publication/?i=623810&article_id=3494347&view=articleBrowser https://mydigitalpublication.com/publication/?i=623810&article_id=3494347&view=articleBrowser https://journal.code4lib.org/articles/14901 https://journal.code4lib.org/articles/14901 https://blogs.loc.gov/thesignal/2021/06/diving-into-branch-rickey/ https://blogs.loc.gov/thesignal/2021/10/jon-white/ https://blogs.loc.gov/thesignal/2021/10/jon-white/ https://blogs.loc.gov/thesignal/2020/07/finding-by-the-people-transcriptions-in-the-librarys-digital-collections/ https://blogs.loc.gov/thesignal/2020/07/finding-by-the-people-transcriptions-in-the-librarys-digital-collections/ 10Van Hyning et al. Journal of Open Humanities Data DOI: 10.5334/johd.67 TO CITE THIS ARTICLE: Van Hyning, V., Algee, L., Jones, M., Osborn, C., Owens, T., Seroka, L., & Shelton, A. (2022). By the People Crowdsourcing Datasets from the Library of Congress. Journal of Open Humanities Data, 8: 5, pp. 1–10. DOI: https://doi.org/10.5334/johd.67 Published: 04 February 2022 COPYRIGHT: © 2022 The Author(s). This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See http://creativecommons.org/ licenses/by/4.0/. Journal of Open Humanities Data is a peer-reviewed open access journal published by Ubiquity Press. https://doi.org/10.5334/johd.67 https://doi.org/10.5334/johd.67 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/