Mining oral history collections using music information  retrieval methods Article (Accepted Version) http://sro.sussex.ac.uk Webb, Sharon, Kiefer, Chris, Jackson, Ben, Baker, James and Eldridge, Alice (2017) Mining oral history collections using music information retrieval methods. Music Reference Services Quarterly, 20 (3-4). pp. 168-183. ISSN 1058-8167 This version is available from Sussex Research Online: http://sro.sussex.ac.uk/id/eprint/71250/ This document is made available in accordance with publisher policies and may differ from the published version or from the version of record. If you wish to cite this item you are advised to consult the publisher’s version. Please see the URL above for details on accessing the published version. Copyright and reuse: Sussex Research Online is a digital repository of the research output of the University. Copyright and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners. To the extent reasonable and practicable, the material made available in SRO has been checked for eligibility before being made available. Copies of full text items generally can be reproduced, displayed or performed and given to third parties in any format or medium for personal research or study, educational, or not-for-profit purposes without prior permission or charge, provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way. http://sro.sussex.ac.uk/ The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 Abstract Recent work at the Sussex Humanities Lab, a digital humanities research program at the University of Sussex, has sought to address an identified gap in the provision and use of audio feature analysis for spoken word collections. Traditionally, oral history methodologies and practices have placed emphasis on working with transcribed textual surrogates, rather than the digital audio files created during the interview process. This provides a pragmatic access to the basic semantic content, but obviates access to other potentially meaningful aural information; our work addresses the potential for methods to explore this extra-semantic information, by working with the audio directly. Audio analysis tools, such as those developed within the established field of Music Information Retrieval (MIR), provide this opportunity. This paper describes the application of audio analysis techniques and methods to spoken word collections. We demonstrate an approach using freely available audio and data analysis tools, which have been explored and evaluated in two workshops. We hope to inspire new forms of content analysis which complement semantic analysis with investigation into the more nuanced properties carried in audio signals. http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 1 Mining oral history collections using information retrieval methods Webb, S., Kiefer, C., Jackson, B., Baker, J. & Eldridge, A. Music Reference Services Quarterly (Taylor and Francis) 1 Introduction The Sussex Humanities Lab is a multidisciplinary research program tasked with embedding digital humanities into research and teaching practices across the University of Sussex. As a multidisciplinary team we have unique access to varied expertise and skills that enable us to carry out experimental work in an agile and proficient manner. One experimental project problematized the predominant approach within digital humanities – a largely text based domain – to treat digital audio files as text.1 We applied Music Information Retrieval (hereafter MIR) techniques to oral history interviews in order to develop new, complementary, approaches to text based methods of extracting semantic information from spoken word collections. As an established field, with established methods, the MIR community provides open source tools, code and libraries to work through our hypothesis, to treat audio as audio, and to help us work through and establish its practical application to spoken word collections. Having established the potential utility of MIR techniques to problems in both oral history and the digital humanities, we developed a workshop framework that aimed at exploring the utility of this approach for a variety of humanities scholars. 1 Notable exceptions include, Tanya Clement and Stephen McLaughlin, “Measured Applause: Toward a Cultural Analysis of Audio Collections,” Cultural Analytics 1, no .1 (2016), http://culturalanalytics.org/2016/05/measured- applause-toward-a-cultural-analysis-of-audio-collections/; Tanya Clement, Kari Kraus, Jentery Sayers and Whitney Trettien, “: The Intersections of Sound and Method,” Proceedings of Digital Humanities 2014. Lausanne, Switzerland, https://terpconnect.umd.edu/~oard/pdf/dh14.pdf http://dx.doi.org/10.1080/10588167.2017.1404307 http://culturalanalytics.org/2016/05/measured-applause-toward-a-cultural-analysis-of-audio-collections/ http://culturalanalytics.org/2016/05/measured-applause-toward-a-cultural-analysis-of-audio-collections/ https://terpconnect.umd.edu/~oard/pdf/dh14.pdf The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 2 Taking oral history collections from the University of Sussex ‘Archive of Resistance’ as a test case, we led two distinct groups, at two separate workshops, through the process of using MIR approaches to categorize, sort and discover audio collections. This process enabled us to: - Build a set of python workbooks that provide a conceptual and practical introduction to the application of MIR techniques (e.g. feature extraction and clustering) to spoken word collections. - Work through, develop and amend use cases. - Learn lessons, from two distinct communities and perspectives, about the potential benefits – or otherwise – of our approach. Both workshops, the first at Digital Humanities 2016 (Krakow, July 2016) and the second at London College of Communication (March 2017),2 provided points of clarification and discussion that enabled us to identify areas that require work. This article is therefore not a final report on our findings, instead it is an attempt to capture the hypothesis and problem statement, the experimentation and methodology used, and our preliminary findings. It also describes a method for workshop facilitation that utilizes a) virtual environments to reduce setup time for participants and facilitators and b) Jupyter Notebooks to enable participants to run sophisticated and complex code in a supported, learning environment.3 This article proceeds in five parts. First, in order to provide some context to this work, we provide some background information on the Sussex Humanities Lab. Parts two, three, and four 2 ‘Data-Mining the Audio of Oral History: A Workshop in Music Information Retrieval’ at London College of Communication (March 2017) https://web.archive.org/web/20171003144121/http://www.techne.ac.uk/for- students/techne-events/apr-2015/data-mining-the-audio-of-oral-history-a-workshop-in-music-information-retrieval (accessed 3 Oct. 2017) 3 Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, Carol Willing, Jupyter Development Team, “Jupyter Notebooks – a publishing format for reproducible computational workflows,” Positioning and Power in Academic Publishing: Players, Agents and Agendas (2016). 87-90. doi: 10.3233/978-1-61499-649-1-87 http://dx.doi.org/10.1080/10588167.2017.1404307 https://web.archive.org/web/20171003144121/http:/www.techne.ac.uk/for-students/techne-events/apr-2015/data-mining-the-audio-of-oral-history-a-workshop-in-music-information-retrieval https://web.archive.org/web/20171003144121/http:/www.techne.ac.uk/for-students/techne-events/apr-2015/data-mining-the-audio-of-oral-history-a-workshop-in-music-information-retrieval The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 3 consider our hypothesis and motivations, the workshops we developed and the technologies we used. The fifth and final part outlines our preliminary findings, both from mining oral history collections using audio feature analysis and from delivering workshops on MIR in a digital humanities context. 1.2 Background The authors are current or former members of the Sussex Humanities Lab (hereafter SHL): a four-year university program, launched in 2015 at the University of Sussex, which seeks to intervene in the digital humanities. It is a team of 31 faculty, researchers, PhD students and technical and management staff, working in a state of the art space – the Digital Humanities Lab. SHL collaborates with a network of associates across and beyond the university nationally and internationally and is radically cross-disciplinary in its approach. The aim of SHL is to engage with the myriad of new and developing technologies to explore the benefits these offer to humanities research and to ask what will technology do to the arts and humanities? To achieve this, SHL is divided into four named strands of activity: digital history and digital archiving; digital media and computational culture; digital technologies and digital performance; digital lives and digital memory. However, the intention is to make sure that our research crosses and links these strands, to develop fruitful methodological and conceptual intersections. The work described here grows from this multidisciplinary ethos, since the project combines the diverse interests and expertise of the authors. It stems from the inherently collaborative environment facilitated by SHL and is influenced by two strands in particular: digital history and digital archiving, and digital technologies and digital performance. http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 4 2 Hypothesis: problem statement and motivation Oral history best practice publications and resources often focus on the application and use of digital methods and tools to create, store and manage audio, audio-visual, and subsequent text files. They recommend, for example, standards for file formats, metadata and text encoding, software for audio to text conversion, and database and content management systems. And whilst the privileged position of text has been challenged,4 the majority of oral history projects still rely on the creation of transcripts to carry out analysis using digital tools and methods. This focus on textual surrogates rather than audio sources denies – according to Alessandro Portellii – the ‘orality of the oral source’.5 It also denies – or at least underplays – the inherently interpretative nature of transcription. Of course, textual encodings or transcripts of oral history interviews do have advantages: they are easier to anonymize, distribute, store and retrieve than digital audio files, and there are established techniques for analyzing them as text and/or data. But as a consequence of this privileging of the “text”, a significant proportion of oral history collections and the tools provided to navigate and analyze them do not support navigation or analysis of the digital audio files captured during interviews. Instead, they focus on how to record oral history interviews, the management of digital files and the creation of transcripts using both semi-automated audio to text tools and manual transcription.6 4 For example, Doug Boyd, "OHMS: Enhancing Access to Oral History for Free," Oral History Review, Winter- Spring, 40 no.1 (2013). doi:10.1093/ohr/oht031 5 Ronald Grele, “Oral History as Evidence,” in History of Oral History: Foundations and Methodology. Edited by Thomas L. Carlton, T.L., Lois E. Myers, L.E., & Sharpless, R. (UK, 2007), 69. 6 For example, ‘Oral History in a Digital Age’ http://ohda.matrix.msu.edu/ http://dx.doi.org/10.1080/10588167.2017.1404307 http://ohda.matrix.msu.edu/ The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 5 While using text surrogates is an established tradition, the oral history community is beginning to question the privilege of this text based approach. This is evident, for example, in the UK's Oral History Society’s 2016 conference call for papers, which stated the ‘auditory dimension of oral history [has] for decades [been] notoriously underused’.7 While this move is welcome, it is true that, as per Clement et al’s 2014 survey of the field, currently ‘there are few means for humanists interested in accessing and analyzing [spoken] word audio collections to use and to understand how to use advanced technologies for analyzing sound’.8 Moreover, these technologies have the potential to help resolve some of the backlog in archives and libraries of ‘un-described, though digitized, audio collections’.9 It is from this context, therefore, that we decided to explore the potential for direct audio analysis of oral history interviews. This work represents a move towards analyzing oral content in the context in which they were created. It also challenges the privilege of text, as it focuses on extracting information from the audio signal directly. We are particularly interested in how such techniques could complement the semantic content obtained through manual or automated transcription. On the basis that comparable methods have been developed for digital recordings of music for some time, we explored the field of MIR for possible solutions. We are explicitly carrying out a study of computational techniques for the analysis of oral history records with the aim of extracting quantitative results to assist research. The MIR techniques that we use create quantitative 7 “Beyond Text in the Digital Age? Oral history, images and the Written Word” Oral History Society, 2016 Conference CFP: https://web.archive.org/web/20161214045140/http://www.ohs.org.uk/conferences/2016- conference-beyond-text-in-the-digital-age/ accessed 27th June 2017 8 Tanya E. Clement, David Tcheng, Loretta Auvil and Tony Borries, “High Performance Sound Technologies for Access and Scholarship (HiPSTAS) in the Digital Humanities,” Proceedings of the Association for Information Science and Technology 51 (2014) 1–10 doi:10.1002/meet.2014.14505101042 9 Ibid http://dx.doi.org/10.1080/10588167.2017.1404307 https://web.archive.org/web/20161214045140/http:/www.ohs.org.uk/conferences/2016-conference-beyond-text-in-the-digital-age/ https://web.archive.org/web/20161214045140/http:/www.ohs.org.uk/conferences/2016-conference-beyond-text-in-the-digital-age/ The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 6 information (i.e. timestamps that locate specific features and/or events within the audio) that could enhance and stimulate new directions in the qualitative research of others. MIR draws from digital audio signal processing, pattern recognition, psychology of perception, software system design, and machine learning to develop algorithms that enable computers to ‘listen’ to and abstract high-level, musically meaningful information from low- level audio signals. Just as human listeners can recognize pitch, tempo, chords, genre, song structure, etc., MIR algorithms – to a greater and lesser degree – are capable of recognizing and extracting similar information, enabling systems to perform extensive sorting, searching, music recommendation, metadata generation, transcription on vast data sets. Deployed initially in musicology research and more recently for automatic recommender systems, the research potential for MIR tools in non-musical audio data mining is being recognized but yet to be fully explored in the humanities. We chose to develop a one-day workshop related to this topic because the approach allowed us to explore our hypothesis and methods on different users, both expert and novice, from different disciplines, digital humanities and oral history, and garner important, domain specific feedback. 3 Experimentation: workshops and method The workshops were intentionally experimental in nature (especially from a content analysis perspective), but were developed and delivered with a number of use cases in mind. We framed these use cases around three distinct contexts: the digital object, the content of the interview and the environment in which the interviews were carried out. Upon completion of the http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 7 workshops we revisited these use cases. The following questions represent a series of potential applications for the use of MIR in the context of analyzing oral history collections. They are based on a synthesis of both our initial scoping work and our interactions with workshop attendees. A known problem within oral history and digital humanities is the time and resources intensive process of cataloguing and analysis oral history collections. Therefore, although for practical reasons small collections were used in the workshops, the use cases developed and methods adopted are fully scalable: 1. Context - the object: 1.1. What technical metadata or technical information can we automatically extract from a digital audio file? 1.2. Can this new information enhance what we know about an object and improve search and discoverability? 1.3. Can we detect the use of different recording devices as a means of clustering and classifying two temporally distinct data sets? 2. Context - the content (i.e. what type of content analysis can we carry out): 2.1. What descriptive metadata can we automatically extract from the digital audio file? For example, can we create a feature which distinguishes interviewer from interviewee? Could we use this to automatically detect a specific voice within a collection? 2.2. Can we reveal anything about the relationship or dynamic between interviewer and interviewee? For example, can we detect overlaps or interruptions by the interviewer? Can this reveal anything about gender roles and/or behaviors? http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 8 2.3. Can we augment our ability to detect emotion by analyzing changes in rhythm, timbre, tone, tempo? Is it therefore possible to identify song, poetry, speech, crying, laughter, etc.? 2.4. Can we automatically cluster acoustically similar audio/material/objects? For which properties might this be most robust? 2.5. Can we use techniques from musical analysis to reveal structure in spoken audio, for example to pull apart different voices, and how might this be useful for oral history collections? 3. Context- the environment: 3.1. Can we detect any environmental features in the audio stream? What might this tell us about where the interview took place. 3.2. Can we use source separation, developed to separate parts (e.g. drums, vocals, keyboards in pop music), to pull apart intertwined ‘voices’ or ‘noises’. Can we use this to remove background noise that provides context to recordings? How might this affect the analysis of interviews? Enabling these kind of preprocessing and descriptive orientated steps affords new possibilities in oral history research and archival management. For example, these enable access to under described repositories such as the wealth of content created by the YouTube generation. This will enable new opportunities of empirical analysis and supporting qualitative research (e.g. gender studies). The first workshop, ‘Music Information Retrieval Algorithms for Oral History Collections’, was facilitated by the authors in July 2016 at the Digital Humanities 2016 http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 9 conference.10 The workshops introduced participants to specialist software libraries and applications used by MIR researchers.11 All tools used in the workshop are freely available and cross-platform, meaning our examples are extendable, reusable and shareable. We used open source Python libraries for audio feature analysis, maths and machine learning. Additionally we packaged all dependencies in a Virtual Machine (VM) for ease of accessibility (see section 3.3). All Python code was written and executed in the Jupyter Notebook environment. Jupyter enabled us to develop pre-written examples that supported participants from various backgrounds: those new to coding could immediately engage in working examples, whilst those with more technical experience could edit and explore the code as they wished. In these exploratory sessions we worked with digital audio files from the ‘Archive of Resistance’: a growing collection of oral history content related to forms of resistance in history (for example, British Special Operations Executive operations during World War II) that is held at the Keep, an archive located near the University of Sussex. 3.1 Introduction to MIR and technologies used During the last decade content-based music information retrieval has moved from a small field of study, to a vibrant international research community12 whose work has increasing application across music and sound industries.13 Driven by the growth of digital and online audio 10 See http://dh2016.adho.org/workshops/ 11 See ‘Listening for Oral History’ https://github.com/algolistening/MachineListeningforOralHistory and ‘Music Information Retrieval Algorithms for Oral History Collections’ in Zenodo (July 2016) available at https://zenodo.org/record/58336#.WdOghLzyt24 12 http://www.ismir.net/ 13 http://the.echonest.com/ http://dx.doi.org/10.1080/10588167.2017.1404307 http://dh2016.adho.org/workshops/ https://github.com/algolistening/MachineListeningforOralHistory https://zenodo.org/record/58336#.WdOghLzyt24 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 10 archives, tools developed in MIR enable musically-meaningful information to be extracted directly from a digital audio signal by computational means, offering an automated, audio-based alternative to text-based tagging (the latter of which is common to both spoken word and music collections).14 For example, digital audio files can be automatically described using high level musical features such as melodic contour, tempo or even “danceability”.15 These features are designed to enable automatic genre recognition or instrument classification, which in turn support archive management and recommender services. Applications of these methods in musical research and in industry include:16 - Music identification (commonly associated with software applications such as Shazam and SoundHound), plagiarism detection and copyright monitoring to ensure correct attribution of musical rights, identification of live vs studio recordings, for database normalization and near-duplicate results elimination. - Mood, style, genre, composer or instrumental matching for search, recommender and organization of musical archives. - Music vs speech detection for radio broadcast segmentation and archive cataloguing. Techniques are numerous and rapidly evolving, but most methods work by extracting low-level audio features and combining these with domain specific knowledge (for example, that hip-hop generally has less beats per minute than dubstep) to create models from which more musically-meaningful descriptors can be built and – in turn – tempo, or melody, and ultimately 14 Downie, J. Stephen. "Music information retrieval." Annual review of information science and technology 37, no. 1 (2003): 295-340. 15 http://the.echonest.com/app/danceability-index/ 16 Michael A. Casey, Remco Veltkamp, Masataka Goto, Marc Leman, Christophe Rhodes and Malcolm Slaney, "Content-based music information retrieval: Current directions and future challenges," Proceedings of the IEEE96 4 (2008): 668-696. http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 11 genre, composer, etc. might be identified. Low level features are essentially statistical summaries of audio in terms of distribution of energy across time, or frequency range. Some features might equate to perceptual characteristics such as pitch, brightness or loudness of a sound; others, such as MFCC (Mel-frequency Cepstral Coefficients), provide computationally powerful timbral descriptions but have less obvious direct perceptual correlates. Such low level features can then be used to create methods to find sonically-salient events, such as an onset detector, to identify when an instrument or voice starts playing. This low level information can then be combined with domain specific knowledge – such as the assumption that note onsets occur at regular intervals in most music – to create a tempo detector. In turn, this might be used to inform musical genre recognition, in the knowledge – as above – that hip-hop generally has less beats per minute than dubstep. Just as these low level features can be combined and constrained to create high-level, information with many applications in engaging with and managing music archives, we are interested in the possibility that information of interest to historians and digital humanists might be discoverable in a digital audio file, that would be missed by the analysis of semantic, textual surrogates alone. Whilst no off-the-shelf tools exist for such analysis yet, the open, experimental ethos of digital audio and machine learning research cultures means that there are many accessible software tool kits available which enable rapid experimentation. 3.2 Learning MIR in Jupyter Workbooks Content based MIR combines methods of digital signal processing, machine learning and music theory which in turn draw upon significant perceptual, mathematical, programming http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 12 knowledge and experience. Together these are skills that can take years to acquire. We wished to provide sufficient insight into the core concepts and techniques so as to inspire the imaginations of humanities researchers – with very mixed technical experience and interests – in a single day workshop. Fortunately, many of the complex technical and conceptual underpinnings can be readily grasped with audio-visual illustration, especially if they can be interactively explored. We therefore chose a constructionist approach in the form of hands-on workshops where participants learned through exploring interactive workbooks containing a mix of text-based information, audio-visual illustration and executable, editable code. This meant participants could work through carefully designed examples and learn by editing and exploring the code, all without having to grasp the mathematical bases of the ideas. Figure 1: Screenshot of a workbook that introduces participants to some basic methods (reading and loading digital audio files). Jupyter Notebooks were used to present example code in interactive workbooks which combined formatted (mark-down) text, executable code and detailed, dynamic audio-visual illustration. Jupyter provides a rich and supportive architecture for interactive computing, including a powerful interactive shell, support for interactive data visualization and GUI toolkits, flexible, embeddable interpreters and high performance tools for parallel computing. For novice and expert users alike it offers an interactive coding environment ideal for teaching, learning and rapidly experimenting with and sharing ideas. Executable code was written in python. Python is a human-readable general purpose programming language which is fast becoming the primary choice in data science, as well as computer science education in general. A vibrant, active community of users contribute to well-maintained open-source libraries which we used in the workbooks. These include: librosa (for music and audio analysis), matplotlib and ipython display http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 13 (for visualisation), scikit-learn (for machine learning), and SciPy and NumPy mathematical libraries (see Resources for a full list). 3.3 Sharing workbooks in a Virtual Machine: Reducing barriers to participation Workshop participants were humanities scholars from a range of backgrounds, with differing levels of programming experience and computing knowledge. The requirement to install and configure the necessary collection of developer tools on a disparate selection of participant laptops had the potential to consume significant amounts of workshop time, increase the difference in participant progression through the schedule of activities, as well as diminish the amount of time available to explore MIR techniques. To avoid this, we created server and virtual machine (VM) based Python development environments for the workshop sessions. This approach reduced technological barriers to participation. VM images were created and distributed on USB memory sticks. Installation of the developer tools, sample digital audio files and a minimal host operating system (Lubuntu 32 bit) resulted in a VM image size of about 8GB. Oracle VM VirtualBox was selected as the technology to implement the VM as the software; it is free and cross platform.17 The main drawbacks of the approach were the large amount of storage needed on user machines, and the requirement for authors to create content far enough ahead of the event so that it could be distributed with the VM image. In response to the first drawback – the requirement of 8GB of available disk space is a barrier to adoption for some users – a server-based alternative was also developed. The local 17 https://www.virtualbox.org/ http://dx.doi.org/10.1080/10588167.2017.1404307 https://www.virtualbox.org/ The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 14 computing requirement for the server-based approach is a modern web browser to run the Jupyter Notebooks and a terminal program capable of implementing the network communication method used for the service, such as a Secure Shell (SSH) tunnel. The reason for using SSH is that it simplifies security concerns relating to the provision of unrestricted access to a Python development environment across the internet. Tunneling through SSH is not necessary for secure access, in our case it provided a technique that did not impose restrictions on contributor code development methods. This is a trade-off between simplicity and restriction of user behavior. 4 Workbook content The workbooks were designed to be taught across a full day. In the morning session, they were used to introduce participants to the key concepts of coding, digital audio and audio features. In the afternoon, they were used by participants to apply these ideas and methods to an illustrative example. Workbook One introduced basic Python and the Jupyter notebook, with interactive exercises to familiarize participants with navigating the environment, executing code, carrying out basic mathematical operations and getting help. Workbook Two introduced the fundamental practical tools and ideas necessary to work with digital audio. These included loading, playing and visualizing digital audio files and introducing both ways of understanding how audio is represented digitally and ways to visualize and analyze frequency content of audio files. Workbook Three used plotting and listening to develop an intuitive understanding of audio features, as well as introducing practical tools and existing libraries used to inspect digital audio files and extract audio features. The worked through example, in Workbook Three, demonstrates how simple, low-level audio features (spectral bandwidth and the average number of zero- http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 15 crossings) can be used to distinguish between recordings of female and male interviewees. The two interviews provided were 10 minute interviews from the ‘Archive of Resistance’: one of a French woman and one an English man. The participants used the workbook to split the digital audio files from these interviews into one second chunks and then extracted a range of illustrative audio features. Finally, unsupervised clustering (k-means) was applied and the results of different pairs of features plotted to see which most successfully separated the two files. We found that even without clustering, the files could be separated with just two audio features. Figure 2: Scatter plot showing spectral bandwidth versus zero-crossing for all 1200 one second chunks of two 10 minute interviews. Figure 2 is a scatter plot that shows spectral bandwidth versus zero-crossing for all 1200 one second chunks of two, ten-minute interviews. Segments from recordings of the male speaker are colored blue, the female speaker segments are red. The two clusters are quite distinct, making it simple to automatically separate the segments of the two files. This demonstrates how low-level features can be used to identify recordings according to distinct characteristics of speaker’s voice. Note that both recordings also contain a male interviewer. In this example, only two files were used, but the approach is scalable to large data sets, demonstrating how audio feature analysis might be used to sort and explore unlabeled archives. Large scale tests would be necessary to prove the generalizability of these results. Nevertheless, this example illustrates that simple feature analysis holds promise for meeting several of the use cases listed in Section 3. Because different recording devices create digital audio files with differing acoustic profiles, this approach has potential – for example – to reveal information about the content (use case 1.3). Identifying interviewee-specific characteristics suggests a route http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 16 to automated content analysis: the identification of gender provides useful metadata (use case 2.1) that could underpin further gender-specific analyses (use case 2.2) and be potentially extended other personal characteristics (use case 2.4). The final workbook explored how changes in textural information within a sound could be analyzed to identify the points at which the speaker changed from interviewer to interviewee. In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. The mel scale spaces frequency bands to approximate the human auditory system. The coefficients of the MFC (MFCCs) do not intuitively correlate with perceptual properties of sound, however they do consistently reflect timbral characteristics. If we calculate the MFCCs for short segments, or frames, of audio throughout the files, changes in values throughout the file reveal points of timbral, or textural change. By plotting a two- dimensional self-similarity matrix - in which the difference between each frame is compared to all other frames - we can visualize periods of similarity and change. In musical applications, this technique is used to identify structures such as changes between and repetitions of verse and chorus; in spoken word interviews, this allows us to observe changes in texture which reflect transitions between interviewer and interviewee. This example demonstrates how changes within a single file might be used to reveal changes in speaker characteristics: in this case who the speaker is (use-case 2.4). In combination with successful gender identification, this could be applied to use case 2.2, enabling large scale analysis of interviewer-interviewee dynamics. A similar method could potentially be developed to identify changes in rhythm or timbre (use case 2.3). http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 17 5 Findings and lessons learned The examples in the workbooks illustrated that relatively simple audio analysis could be used to provide useful extra-semantic insights into oral history interviews. However, the degree to which participants grasped the possibilities was directly related to their familiarity with firstly digital audio and secondly digital methods in general. Those participants who were familiar with basic digital audio concepts and programming techniques (as was the case with many participants at the Digital Humanities 2016 workshop), recognized the potential of this approach, particularly those who worked with large audio archives. Other participants, those who had not previously engaged with computational methods, or done any coding of any kind, found it more difficult to imagine wider usage. This was particularly true for those who worked with very small sets of recording, for which this type of analysis is largely irrelevant. Whilst the process of developing and facilitating both workshops indicated that MIR methods and technologies can be usefully applied to digital audio files that contain spoken word, adoption of these techniques is likely to be amongst existing computationally literate communities. For some, understanding how to interpret the visual display of audio files (e.g. the spectrogram) was challenging: ‘I found it hard to translate spectrograms and plots to observations about the interviews’.18 Our workbooks allowed participants to carry out sophisticated, complex analysis, yet many participants found it difficult to envisage or imagine questions beyond those that we posed or included in the workbooks. This difficulty is linked to a number of factors. First, the workbooks in effect hide the complexities of a number of different tasks, so that while 18 MIR for Oral History Collections (feedback) http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 18 participants could execute a piece of code and get results, this reduced understanding of the methods and the capabilities of the software libraries used and the code developed. Second, while the workbooks scaffolded learning, some participants – especially those new to programming – experienced a steep learning curve. This was especially evident in the second cohort/workshop, which mostly consisted of PhD students and researchers using oral history as a method in their work. Indeed, a portion of this cohort had little to no experience of the term “digital humanities” , or indeed the methods used in this domain. From our perspective it was interesting to discover that many oral history practitioners in this session still use the audio to text method as standard procedure. Therefore, working with the audio or digital files in a computational manner was completely unfamiliar territory. However, even though some participants found the workbooks challenging, they indicated to us that by working through the concepts, ideas and indeed the use cases, they felt inspired to re-think their current forms of analysis and to investigate how they might incorporate new forms of computational analysis. One participant, working on an historical collection of oral history interviews, reported that they could see how using audio analysis might help to reverse engineer the methodology or order of the original interviews. Other participants noted that from an archival perspective, techniques used to cluster “related” content might help with cataloguing collections by creating new forms of metadata. Many participants remarked that although they did not envisage learning the skills necessary to carry out this kind of analysis, having seen the potential, they could see value in collaborating with data scientists to explore new approaches, something they had not previously considered. These remarks made us reflect on how best such methods could be introduced into research communities with little or no prior engagement with computational methods. A http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 19 common solution is to create a package with a graphical user interface and presets, which users can employ without conceptual or technical knowledge, yet the real potential of such methods can only be realized through hands-on, bespoke experimentation with specific real-world research questions. Our decision to present participants with pre-written executable code was intended as a compromise between these two positions. In terms of the technological set up, with further time spent on preparation it would be preferable to develop a service to support the workshop exercises without tunneling the connection, which would result in a more reliable delivery of the service. HTTP(S) communication is resilient to the fluctuating quality that is common in public Wi-Fi networks as it does not require an uninterrupted connection, instead connections are created and destroyed with every interaction. Provision of server based development environments is a good fit for cloud based computing infrastructures. The cost of running the cloud servers used for the workshops was less than £1 (or $1.30 approx.) for each event. In hindsight this means that server allocation should have been increased to improve service reliability. During both workshops broken connections were experienced and servers crashed; however, user experience was preserved by monitoring the cloud servers, supporting the participants and reconnecting broken connections as quickly as possible. The lesson learned with regards to connection and server stability was the extent of the variation of computing resource requirements across the activities and participants. The lesson learned with regards to the provision of virtual machines is the choice to use container technology (Docker) instead. Docker overcomes variations in user software configuration without the need to distribute a full operating system to every user. http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 20 Conclusion Our overall aim for this experimental project was to help the digital humanities and oral history community explore alternatives to the use of textual surrogates in oral history. Using off- the-shelf tools, we created and disseminated online interactive workbooks which demonstrate how generic audio analysis methods can be used to extract extra-semantic information from digital audio files. This approach might be used to complement traditional semantic analyses, providing automation of existing methods (metadata) or potentially new levels of analysis, such as interviewee-interviewer dynamics. By running participatory workshops, we tested the response of a wide range of humanists interested in oral history collections. The workshops demonstrated that this approach might be of great interest to DH researchers working with large audio databases, but are unlikely to be rapidly taken up by those working with small data sets, or with preference for manual methods. Our work suggest great potential for audio-analysis in oral history. Refinement of methods to meet the use cases outlined in Section 3 will require systematic research on a wide range of large oral history archives in order to establish how well this work can be generalized and extended. In terms of future adoption in digital humanities communities, as with all computational analyses, a balance must then be sought between providing ready-to-use tools with a low barrier to entry, or nurturing a wider understanding technically and conceptually, such that members of the community may build and develop their own methods. As computational literacy grows amongst research communities, we see potential for novel applications of these methods in the future. http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 21 http://dx.doi.org/10.1080/10588167.2017.1404307 The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 22 Bibliography Bertin-Mahieux, T., Ellis, D.P., Whitman, B. and Lamere, P. “The Million Song Dataset.” ISMIR 2, no. 9 (2011). Boyd, Doug. "OHMS: Enhancing Access to Oral History for Free." Oral History Review 40, no. 1 (2013): 95–106 doi:10.1093/ohr/oht031 Casey, Michael A., Veltkamp, Remco, Goto, Masataka, Leman, Marc, Rhodes, Christophe and Slaney, Malcolm "Content-based music information retrieval: Current directions and future challenges." Proceedings of the IEEE96 4 (2008): 668-696. Clement, Tanya, Kraus, Kari, Sayers, J. and Trettien, Whitney “: The Intersections of Sound and Method.” Proceedings of Digital Humanities 2014. Lausanne, Switzerland https://terpconnect.umd.edu/~oard/pdf/dh14.pdf (accessed August 14, 2017) Clement, Tanya E., Tcheng, David, Auvil, Loretta and Borries, Tony “High Performance Sound Technologies for Access and Scholarship (HiPSTAS) in the Digital Humanities.” Proceedings of the Association for Information Science and Technology 51 (2014):1–10 doi:10.1002/meet.2014.14505101042 Downie, J. Stephen. "Music information retrieval." Annual review of information science and technology 37, no. 1 (2003): 295-340. Grele, Ronald J., “Oral History as Evidence.” In History of Oral History: Foundations and Methodology Thomas L. Carlton, Lois E. Myers and Rebecca Sharpless (eds) (UK, 2007), 69. Kluyver, Thomas, Ragan-Kelley, Benjamin, Pérez, Fernando, Granger, Brian, Bussonnier, Matthias, Frederic, Jonathan, Kelley, Kyle, Hamrick, Jessica, Grout, Jason, Corlay, Sylvain, Ivanov, Paul, Avila, Damián, Abdalla, Safia, and Willing, Carol (Jupyter Development Team) “Jupyter Notebooks – a publishing format for reproducible computational workflows.” Positioning and Power in Academic Publishing: Players, Agents and Agendas (2016):87-90. doi: 10.3233/978-1-61499-649-1-87 Tzanetakis, George and Cook, Perry “Musical genre classification of audio signals.” IEEE Transactions on speech and audio processing 10, no. 5 (2002):293-302. http://dx.doi.org/10.1080/10588167.2017.1404307 https://terpconnect.umd.edu/~oard/pdf/dh14.pdf The Version of Record of this manuscript has been published and is available in the Journal, Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307 23 Resources All workbooks, data, slides from the workshops are deposited in both GitHub and Zenodo:  ‘Machine Listening for Oral History’, GitHub ● Eldridge, A., Kiefer, C., Webb, S., Jackson, B., & Baker, J. (2016, July). Music Information Retrieval Algorithms for Oral History Collections. Zenodo. http://doi.org/10.5281/zenodo.58336 Python libraries used: https://www.scipy.org/ http://scikit-learn.org/ https://matplotlib.org/ https://ipython.org/ipython-doc/3/api/generated/IPython.display.html https://librosa.github.io/librosa/ Other: https://www.docker.com/ http://dx.doi.org/10.1080/10588167.2017.1404307 http://doi.org/10.5281/zenodo.58336 https://github.com/algolistening/MachineListeningforOralHistory https://github.com/algolistening/MachineListeningforOralHistory https://www.scipy.org/ http://scikit-learn.org/ https://matplotlib.org/ https://ipython.org/ipython-doc/3/api/generated/IPython.display.html https://librosa.github.io/librosa/ https://www.docker.com/