Mining oral history collections using music information
retrieval methods
Article (Accepted Version)
http://sro.sussex.ac.uk
Webb, Sharon, Kiefer, Chris, Jackson, Ben, Baker, James and Eldridge, Alice (2017) Mining oral
history collections using music information retrieval methods. Music Reference Services
Quarterly, 20 (3-4). pp. 168-183. ISSN 1058-8167
This version is available from Sussex Research Online: http://sro.sussex.ac.uk/id/eprint/71250/
This document is made available in accordance with publisher policies and may differ from the
published version or from the version of record. If you wish to cite this item you are advised to
consult the publisher’s version. Please see the URL above for details on accessing the published
version.
Copyright and reuse:
Sussex Research Online is a digital repository of the research output of the University.
Copyright and all moral rights to the version of the paper presented here belong to the individual
author(s) and/or other copyright owners. To the extent reasonable and practicable, the material
made available in SRO has been checked for eligibility before being made available.
Copies of full text items generally can be reproduced, displayed or performed and given to third
parties in any format or medium for personal research or study, educational, or not-for-profit
purposes without prior permission or charge, provided that the authors, title and full bibliographic
details are credited, a hyperlink and/or URL is given for the original metadata page and the
content is not changed in any way.
http://sro.sussex.ac.uk/
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
Abstract
Recent work at the Sussex Humanities Lab, a digital humanities research program at the
University of Sussex, has sought to address an identified gap in the provision and use of audio
feature analysis for spoken word collections. Traditionally, oral history methodologies and
practices have placed emphasis on working with transcribed textual surrogates, rather than the
digital audio files created during the interview process. This provides a pragmatic access to the
basic semantic content, but obviates access to other potentially meaningful aural information; our
work addresses the potential for methods to explore this extra-semantic information, by working
with the audio directly. Audio analysis tools, such as those developed within the established field
of Music Information Retrieval (MIR), provide this opportunity. This paper describes the
application of audio analysis techniques and methods to spoken word collections. We
demonstrate an approach using freely available audio and data analysis tools, which have been
explored and evaluated in two workshops. We hope to inspire new forms of content analysis
which complement semantic analysis with investigation into the more nuanced properties carried
in audio signals.
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
1
Mining oral history collections using information retrieval methods
Webb, S., Kiefer, C., Jackson, B., Baker, J. & Eldridge, A.
Music Reference Services Quarterly (Taylor and Francis)
1 Introduction
The Sussex Humanities Lab is a multidisciplinary research program tasked with embedding
digital humanities into research and teaching practices across the University of Sussex. As a
multidisciplinary team we have unique access to varied expertise and skills that enable us to
carry out experimental work in an agile and proficient manner. One experimental project
problematized the predominant approach within digital humanities – a largely text based domain
– to treat digital audio files as text.1 We applied Music Information Retrieval (hereafter MIR)
techniques to oral history interviews in order to develop new, complementary, approaches to text
based methods of extracting semantic information from spoken word collections. As an
established field, with established methods, the MIR community provides open source tools,
code and libraries to work through our hypothesis, to treat audio as audio, and to help us work
through and establish its practical application to spoken word collections. Having established the
potential utility of MIR techniques to problems in both oral history and the digital humanities,
we developed a workshop framework that aimed at exploring the utility of this approach for a
variety of humanities scholars.
1 Notable exceptions include, Tanya Clement and Stephen McLaughlin, “Measured Applause: Toward a Cultural
Analysis of Audio Collections,” Cultural Analytics 1, no .1 (2016), http://culturalanalytics.org/2016/05/measured-
applause-toward-a-cultural-analysis-of-audio-collections/; Tanya Clement, Kari Kraus, Jentery Sayers and Whitney
Trettien, “: The Intersections of Sound and Method,” Proceedings of Digital
Humanities 2014. Lausanne, Switzerland, https://terpconnect.umd.edu/~oard/pdf/dh14.pdf
http://dx.doi.org/10.1080/10588167.2017.1404307
http://culturalanalytics.org/2016/05/measured-applause-toward-a-cultural-analysis-of-audio-collections/
http://culturalanalytics.org/2016/05/measured-applause-toward-a-cultural-analysis-of-audio-collections/
https://terpconnect.umd.edu/~oard/pdf/dh14.pdf
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
2
Taking oral history collections from the University of Sussex ‘Archive of Resistance’ as
a test case, we led two distinct groups, at two separate workshops, through the process of using
MIR approaches to categorize, sort and discover audio collections. This process enabled us to:
- Build a set of python workbooks that provide a conceptual and practical introduction to
the application of MIR techniques (e.g. feature extraction and clustering) to spoken word
collections.
- Work through, develop and amend use cases.
- Learn lessons, from two distinct communities and perspectives, about the potential benefits
– or otherwise – of our approach.
Both workshops, the first at Digital Humanities 2016 (Krakow, July 2016) and the second at
London College of Communication (March 2017),2 provided points of clarification and
discussion that enabled us to identify areas that require work. This article is therefore not a final
report on our findings, instead it is an attempt to capture the hypothesis and problem statement,
the experimentation and methodology used, and our preliminary findings. It also describes a
method for workshop facilitation that utilizes a) virtual environments to reduce setup time for
participants and facilitators and b) Jupyter Notebooks to enable participants to run sophisticated
and complex code in a supported, learning environment.3
This article proceeds in five parts. First, in order to provide some context to this work, we
provide some background information on the Sussex Humanities Lab. Parts two, three, and four
2 ‘Data-Mining the Audio of Oral History: A Workshop in Music Information Retrieval’ at London College of
Communication (March 2017) https://web.archive.org/web/20171003144121/http://www.techne.ac.uk/for-
students/techne-events/apr-2015/data-mining-the-audio-of-oral-history-a-workshop-in-music-information-retrieval
(accessed 3 Oct. 2017)
3 Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan
Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla,
Carol Willing, Jupyter Development Team, “Jupyter Notebooks – a publishing format for reproducible
computational workflows,” Positioning and Power in Academic Publishing: Players, Agents and Agendas (2016).
87-90. doi: 10.3233/978-1-61499-649-1-87
http://dx.doi.org/10.1080/10588167.2017.1404307
https://web.archive.org/web/20171003144121/http:/www.techne.ac.uk/for-students/techne-events/apr-2015/data-mining-the-audio-of-oral-history-a-workshop-in-music-information-retrieval
https://web.archive.org/web/20171003144121/http:/www.techne.ac.uk/for-students/techne-events/apr-2015/data-mining-the-audio-of-oral-history-a-workshop-in-music-information-retrieval
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
3
consider our hypothesis and motivations, the workshops we developed and the technologies we
used. The fifth and final part outlines our preliminary findings, both from mining oral history
collections using audio feature analysis and from delivering workshops on MIR in a digital
humanities context.
1.2 Background
The authors are current or former members of the Sussex Humanities Lab (hereafter
SHL): a four-year university program, launched in 2015 at the University of Sussex, which seeks
to intervene in the digital humanities. It is a team of 31 faculty, researchers, PhD students and
technical and management staff, working in a state of the art space – the Digital Humanities Lab.
SHL collaborates with a network of associates across and beyond the university nationally and
internationally and is radically cross-disciplinary in its approach. The aim of SHL is to engage
with the myriad of new and developing technologies to explore the benefits these offer to
humanities research and to ask what will technology do to the arts and humanities? To achieve
this, SHL is divided into four named strands of activity: digital history and digital archiving;
digital media and computational culture; digital technologies and digital performance; digital
lives and digital memory. However, the intention is to make sure that our research crosses and
links these strands, to develop fruitful methodological and conceptual intersections. The work
described here grows from this multidisciplinary ethos, since the project combines the diverse
interests and expertise of the authors. It stems from the inherently collaborative environment
facilitated by SHL and is influenced by two strands in particular: digital history and digital
archiving, and digital technologies and digital performance.
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
4
2 Hypothesis: problem statement and motivation
Oral history best practice publications and resources often focus on the application and
use of digital methods and tools to create, store and manage audio, audio-visual, and subsequent
text files. They recommend, for example, standards for file formats, metadata and text encoding,
software for audio to text conversion, and database and content management systems. And whilst
the privileged position of text has been challenged,4 the majority of oral history projects still rely
on the creation of transcripts to carry out analysis using digital tools and methods. This focus on
textual surrogates rather than audio sources denies – according to Alessandro Portellii – the
‘orality of the oral source’.5 It also denies – or at least underplays – the inherently interpretative
nature of transcription.
Of course, textual encodings or transcripts of oral history interviews do have advantages:
they are easier to anonymize, distribute, store and retrieve than digital audio files, and there are
established techniques for analyzing them as text and/or data. But as a consequence of this
privileging of the “text”, a significant proportion of oral history collections and the tools
provided to navigate and analyze them do not support navigation or analysis of the digital audio
files captured during interviews. Instead, they focus on how to record oral history interviews, the
management of digital files and the creation of transcripts using both semi-automated audio to
text tools and manual transcription.6
4 For example, Doug Boyd, "OHMS: Enhancing Access to Oral History for Free," Oral History Review, Winter-
Spring, 40 no.1 (2013). doi:10.1093/ohr/oht031
5 Ronald Grele, “Oral History as Evidence,” in History of Oral History: Foundations and Methodology. Edited by
Thomas L. Carlton, T.L., Lois E. Myers, L.E., & Sharpless, R. (UK, 2007), 69.
6 For example, ‘Oral History in a Digital Age’ http://ohda.matrix.msu.edu/
http://dx.doi.org/10.1080/10588167.2017.1404307
http://ohda.matrix.msu.edu/
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
5
While using text surrogates is an established tradition, the oral history community is
beginning to question the privilege of this text based approach. This is evident, for example, in
the UK's Oral History Society’s 2016 conference call for papers, which stated the ‘auditory
dimension of oral history [has] for decades [been] notoriously underused’.7 While this move is
welcome, it is true that, as per Clement et al’s 2014 survey of the field, currently ‘there are few
means for humanists interested in accessing and analyzing [spoken] word audio collections to
use and to understand how to use advanced technologies for analyzing sound’.8 Moreover, these
technologies have the potential to help resolve some of the backlog in archives and libraries of
‘un-described, though digitized, audio collections’.9 It is from this context, therefore, that we
decided to explore the potential for direct audio analysis of oral history interviews.
This work represents a move towards analyzing oral content in the context in which they
were created. It also challenges the privilege of text, as it focuses on extracting information from
the audio signal directly. We are particularly interested in how such techniques could
complement the semantic content obtained through manual or automated transcription. On the
basis that comparable methods have been developed for digital recordings of music for some
time, we explored the field of MIR for possible solutions. We are explicitly carrying out a study
of computational techniques for the analysis of oral history records with the aim of extracting
quantitative results to assist research. The MIR techniques that we use create quantitative
7 “Beyond Text in the Digital Age? Oral history, images and the Written Word” Oral History Society, 2016
Conference CFP: https://web.archive.org/web/20161214045140/http://www.ohs.org.uk/conferences/2016-
conference-beyond-text-in-the-digital-age/ accessed 27th June 2017
8 Tanya E. Clement, David Tcheng, Loretta Auvil and Tony Borries, “High Performance Sound Technologies for
Access and Scholarship (HiPSTAS) in the Digital Humanities,” Proceedings of the Association for Information
Science and Technology 51 (2014) 1–10 doi:10.1002/meet.2014.14505101042
9 Ibid
http://dx.doi.org/10.1080/10588167.2017.1404307
https://web.archive.org/web/20161214045140/http:/www.ohs.org.uk/conferences/2016-conference-beyond-text-in-the-digital-age/
https://web.archive.org/web/20161214045140/http:/www.ohs.org.uk/conferences/2016-conference-beyond-text-in-the-digital-age/
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
6
information (i.e. timestamps that locate specific features and/or events within the audio) that
could enhance and stimulate new directions in the qualitative research of others.
MIR draws from digital audio signal processing, pattern recognition, psychology of
perception, software system design, and machine learning to develop algorithms that enable
computers to ‘listen’ to and abstract high-level, musically meaningful information from low-
level audio signals. Just as human listeners can recognize pitch, tempo, chords, genre, song
structure, etc., MIR algorithms – to a greater and lesser degree – are capable of recognizing and
extracting similar information, enabling systems to perform extensive sorting, searching, music
recommendation, metadata generation, transcription on vast data sets. Deployed initially in
musicology research and more recently for automatic recommender systems, the research
potential for MIR tools in non-musical audio data mining is being recognized but yet to be fully
explored in the humanities.
We chose to develop a one-day workshop related to this topic because the approach
allowed us to explore our hypothesis and methods on different users, both expert and novice,
from different disciplines, digital humanities and oral history, and garner important, domain
specific feedback.
3 Experimentation: workshops and method
The workshops were intentionally experimental in nature (especially from a content
analysis perspective), but were developed and delivered with a number of use cases in mind. We
framed these use cases around three distinct contexts: the digital object, the content of the
interview and the environment in which the interviews were carried out. Upon completion of the
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
7
workshops we revisited these use cases. The following questions represent a series of potential
applications for the use of MIR in the context of analyzing oral history collections. They are
based on a synthesis of both our initial scoping work and our interactions with workshop
attendees. A known problem within oral history and digital humanities is the time and resources
intensive process of cataloguing and analysis oral history collections. Therefore, although for
practical reasons small collections were used in the workshops, the use cases developed and
methods adopted are fully scalable:
1. Context - the object:
1.1. What technical metadata or technical information can we automatically extract
from a digital audio file?
1.2. Can this new information enhance what we know about an object and improve
search and discoverability?
1.3. Can we detect the use of different recording devices as a means of clustering and
classifying two temporally distinct data sets?
2. Context - the content (i.e. what type of content analysis can we carry out):
2.1. What descriptive metadata can we automatically extract from the digital audio
file? For example, can we create a feature which distinguishes interviewer from
interviewee? Could we use this to automatically detect a specific voice within a
collection?
2.2. Can we reveal anything about the relationship or dynamic between interviewer
and interviewee? For example, can we detect overlaps or interruptions by the
interviewer? Can this reveal anything about gender roles and/or behaviors?
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
8
2.3. Can we augment our ability to detect emotion by analyzing changes in rhythm,
timbre, tone, tempo? Is it therefore possible to identify song, poetry, speech,
crying, laughter, etc.?
2.4. Can we automatically cluster acoustically similar audio/material/objects? For
which properties might this be most robust?
2.5. Can we use techniques from musical analysis to reveal structure in spoken audio,
for example to pull apart different voices, and how might this be useful for oral
history collections?
3. Context- the environment:
3.1. Can we detect any environmental features in the audio stream? What might this
tell us about where the interview took place.
3.2. Can we use source separation, developed to separate parts (e.g. drums, vocals,
keyboards in pop music), to pull apart intertwined ‘voices’ or ‘noises’. Can we
use this to remove background noise that provides context to recordings? How
might this affect the analysis of interviews?
Enabling these kind of preprocessing and descriptive orientated steps affords new
possibilities in oral history research and archival management. For example, these enable access
to under described repositories such as the wealth of content created by the YouTube generation.
This will enable new opportunities of empirical analysis and supporting qualitative research (e.g.
gender studies).
The first workshop, ‘Music Information Retrieval Algorithms for Oral History
Collections’, was facilitated by the authors in July 2016 at the Digital Humanities 2016
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
9
conference.10 The workshops introduced participants to specialist software libraries and
applications used by MIR researchers.11 All tools used in the workshop are freely available and
cross-platform, meaning our examples are extendable, reusable and shareable. We used open
source Python libraries for audio feature analysis, maths and machine learning. Additionally we
packaged all dependencies in a Virtual Machine (VM) for ease of accessibility (see section 3.3).
All Python code was written and executed in the Jupyter Notebook environment. Jupyter enabled
us to develop pre-written examples that supported participants from various backgrounds: those
new to coding could immediately engage in working examples, whilst those with more technical
experience could edit and explore the code as they wished. In these exploratory sessions we
worked with digital audio files from the ‘Archive of Resistance’: a growing collection of oral
history content related to forms of resistance in history (for example, British Special Operations
Executive operations during World War II) that is held at the Keep, an archive located near the
University of Sussex.
3.1 Introduction to MIR and technologies used
During the last decade content-based music information retrieval has moved from a small
field of study, to a vibrant international research community12 whose work has increasing
application across music and sound industries.13 Driven by the growth of digital and online audio
10 See http://dh2016.adho.org/workshops/
11 See ‘Listening for Oral History’ https://github.com/algolistening/MachineListeningforOralHistory and ‘Music
Information Retrieval Algorithms for Oral History Collections’ in Zenodo (July 2016) available at
https://zenodo.org/record/58336#.WdOghLzyt24
12 http://www.ismir.net/
13 http://the.echonest.com/
http://dx.doi.org/10.1080/10588167.2017.1404307
http://dh2016.adho.org/workshops/
https://github.com/algolistening/MachineListeningforOralHistory
https://zenodo.org/record/58336#.WdOghLzyt24
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
10
archives, tools developed in MIR enable musically-meaningful information to be extracted
directly from a digital audio signal by computational means, offering an automated, audio-based
alternative to text-based tagging (the latter of which is common to both spoken word and music
collections).14 For example, digital audio files can be automatically described using high level
musical features such as melodic contour, tempo or even “danceability”.15 These features are
designed to enable automatic genre recognition or instrument classification, which in turn
support archive management and recommender services. Applications of these methods in
musical research and in industry include:16
- Music identification (commonly associated with software applications such as Shazam and
SoundHound), plagiarism detection and copyright monitoring to ensure correct attribution
of musical rights, identification of live vs studio recordings, for database normalization and
near-duplicate results elimination.
- Mood, style, genre, composer or instrumental matching for search, recommender and
organization of musical archives.
- Music vs speech detection for radio broadcast segmentation and archive cataloguing.
Techniques are numerous and rapidly evolving, but most methods work by extracting
low-level audio features and combining these with domain specific knowledge (for example, that
hip-hop generally has less beats per minute than dubstep) to create models from which more
musically-meaningful descriptors can be built and – in turn – tempo, or melody, and ultimately
14 Downie, J. Stephen. "Music information retrieval." Annual review of information science and technology 37, no.
1 (2003): 295-340.
15 http://the.echonest.com/app/danceability-index/
16 Michael A. Casey, Remco Veltkamp, Masataka Goto, Marc Leman, Christophe Rhodes and Malcolm Slaney,
"Content-based music information retrieval: Current directions and future challenges," Proceedings of the IEEE96 4
(2008): 668-696.
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
11
genre, composer, etc. might be identified. Low level features are essentially statistical
summaries of audio in terms of distribution of energy across time, or frequency range. Some
features might equate to perceptual characteristics such as pitch, brightness or loudness of a
sound; others, such as MFCC (Mel-frequency Cepstral Coefficients), provide computationally
powerful timbral descriptions but have less obvious direct perceptual correlates. Such low level
features can then be used to create methods to find sonically-salient events, such as an onset
detector, to identify when an instrument or voice starts playing. This low level information can
then be combined with domain specific knowledge – such as the assumption that note onsets
occur at regular intervals in most music – to create a tempo detector. In turn, this might be used
to inform musical genre recognition, in the knowledge – as above – that hip-hop generally has
less beats per minute than dubstep.
Just as these low level features can be combined and constrained to create high-level,
information with many applications in engaging with and managing music archives, we are
interested in the possibility that information of interest to historians and digital humanists might
be discoverable in a digital audio file, that would be missed by the analysis of semantic, textual
surrogates alone. Whilst no off-the-shelf tools exist for such analysis yet, the open, experimental
ethos of digital audio and machine learning research cultures means that there are many
accessible software tool kits available which enable rapid experimentation.
3.2 Learning MIR in Jupyter Workbooks
Content based MIR combines methods of digital signal processing, machine learning and
music theory which in turn draw upon significant perceptual, mathematical, programming
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
12
knowledge and experience. Together these are skills that can take years to acquire. We wished
to provide sufficient insight into the core concepts and techniques so as to inspire the
imaginations of humanities researchers – with very mixed technical experience and interests – in
a single day workshop. Fortunately, many of the complex technical and conceptual
underpinnings can be readily grasped with audio-visual illustration, especially if they can be
interactively explored. We therefore chose a constructionist approach in the form of hands-on
workshops where participants learned through exploring interactive workbooks containing a mix
of text-based information, audio-visual illustration and executable, editable code. This meant
participants could work through carefully designed examples and learn by editing and exploring
the code, all without having to grasp the mathematical bases of the ideas.
Figure 1: Screenshot of a workbook that introduces participants to some basic methods
(reading and loading digital audio files).
Jupyter Notebooks were used to present example code in interactive workbooks which
combined formatted (mark-down) text, executable code and detailed, dynamic audio-visual
illustration. Jupyter provides a rich and supportive architecture for interactive computing,
including a powerful interactive shell, support for interactive data visualization and GUI toolkits,
flexible, embeddable interpreters and high performance tools for parallel computing. For novice
and expert users alike it offers an interactive coding environment ideal for teaching, learning and
rapidly experimenting with and sharing ideas. Executable code was written in python. Python is
a human-readable general purpose programming language which is fast becoming the primary
choice in data science, as well as computer science education in general. A vibrant, active
community of users contribute to well-maintained open-source libraries which we used in the
workbooks. These include: librosa (for music and audio analysis), matplotlib and ipython display
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
13
(for visualisation), scikit-learn (for machine learning), and SciPy and NumPy mathematical
libraries (see Resources for a full list).
3.3 Sharing workbooks in a Virtual Machine: Reducing barriers to participation
Workshop participants were humanities scholars from a range of backgrounds, with
differing levels of programming experience and computing knowledge. The requirement to
install and configure the necessary collection of developer tools on a disparate selection of
participant laptops had the potential to consume significant amounts of workshop time, increase
the difference in participant progression through the schedule of activities, as well as diminish
the amount of time available to explore MIR techniques. To avoid this, we created server and
virtual machine (VM) based Python development environments for the workshop sessions. This
approach reduced technological barriers to participation.
VM images were created and distributed on USB memory sticks. Installation of the
developer tools, sample digital audio files and a minimal host operating system (Lubuntu 32 bit)
resulted in a VM image size of about 8GB. Oracle VM VirtualBox was selected as the
technology to implement the VM as the software; it is free and cross platform.17 The main
drawbacks of the approach were the large amount of storage needed on user machines, and the
requirement for authors to create content far enough ahead of the event so that it could be
distributed with the VM image.
In response to the first drawback – the requirement of 8GB of available disk space is a
barrier to adoption for some users – a server-based alternative was also developed. The local
17 https://www.virtualbox.org/
http://dx.doi.org/10.1080/10588167.2017.1404307
https://www.virtualbox.org/
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
14
computing requirement for the server-based approach is a modern web browser to run the
Jupyter Notebooks and a terminal program capable of implementing the network communication
method used for the service, such as a Secure Shell (SSH) tunnel. The reason for using SSH is
that it simplifies security concerns relating to the provision of unrestricted access to a Python
development environment across the internet. Tunneling through SSH is not necessary for secure
access, in our case it provided a technique that did not impose restrictions on contributor code
development methods. This is a trade-off between simplicity and restriction of user behavior.
4 Workbook content
The workbooks were designed to be taught across a full day. In the morning session, they
were used to introduce participants to the key concepts of coding, digital audio and audio
features. In the afternoon, they were used by participants to apply these ideas and methods to an
illustrative example. Workbook One introduced basic Python and the Jupyter notebook, with
interactive exercises to familiarize participants with navigating the environment, executing code,
carrying out basic mathematical operations and getting help. Workbook Two introduced the
fundamental practical tools and ideas necessary to work with digital audio. These included
loading, playing and visualizing digital audio files and introducing both ways of understanding
how audio is represented digitally and ways to visualize and analyze frequency content of audio
files. Workbook Three used plotting and listening to develop an intuitive understanding of audio
features, as well as introducing practical tools and existing libraries used to inspect digital audio
files and extract audio features. The worked through example, in Workbook Three, demonstrates
how simple, low-level audio features (spectral bandwidth and the average number of zero-
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
15
crossings) can be used to distinguish between recordings of female and male interviewees. The
two interviews provided were 10 minute interviews from the ‘Archive of Resistance’: one of a
French woman and one an English man. The participants used the workbook to split the digital
audio files from these interviews into one second chunks and then extracted a range of
illustrative audio features. Finally, unsupervised clustering (k-means) was applied and the results
of different pairs of features plotted to see which most successfully separated the two files. We
found that even without clustering, the files could be separated with just two audio features.
Figure 2: Scatter plot showing spectral bandwidth versus zero-crossing for all 1200 one
second chunks of two 10 minute interviews.
Figure 2 is a scatter plot that shows spectral bandwidth versus zero-crossing for all 1200 one
second chunks of two, ten-minute interviews. Segments from recordings of the male speaker are
colored blue, the female speaker segments are red. The two clusters are quite distinct, making it
simple to automatically separate the segments of the two files. This demonstrates how low-level
features can be used to identify recordings according to distinct characteristics of speaker’s
voice. Note that both recordings also contain a male interviewer.
In this example, only two files were used, but the approach is scalable to large data sets,
demonstrating how audio feature analysis might be used to sort and explore unlabeled archives.
Large scale tests would be necessary to prove the generalizability of these results. Nevertheless,
this example illustrates that simple feature analysis holds promise for meeting several of the use
cases listed in Section 3. Because different recording devices create digital audio files with
differing acoustic profiles, this approach has potential – for example – to reveal information
about the content (use case 1.3). Identifying interviewee-specific characteristics suggests a route
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
16
to automated content analysis: the identification of gender provides useful metadata (use case
2.1) that could underpin further gender-specific analyses (use case 2.2) and be potentially
extended other personal characteristics (use case 2.4).
The final workbook explored how changes in textural information within a sound could
be analyzed to identify the points at which the speaker changed from interviewer to interviewee.
In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term
power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a
nonlinear mel scale of frequency. The mel scale spaces frequency bands to approximate the
human auditory system. The coefficients of the MFC (MFCCs) do not intuitively correlate with
perceptual properties of sound, however they do consistently reflect timbral characteristics. If we
calculate the MFCCs for short segments, or frames, of audio throughout the files, changes in
values throughout the file reveal points of timbral, or textural change. By plotting a two-
dimensional self-similarity matrix - in which the difference between each frame is compared to
all other frames - we can visualize periods of similarity and change. In musical applications, this
technique is used to identify structures such as changes between and repetitions of verse and
chorus; in spoken word interviews, this allows us to observe changes in texture which reflect
transitions between interviewer and interviewee.
This example demonstrates how changes within a single file might be used to reveal
changes in speaker characteristics: in this case who the speaker is (use-case 2.4). In combination
with successful gender identification, this could be applied to use case 2.2, enabling large scale
analysis of interviewer-interviewee dynamics. A similar method could potentially be developed
to identify changes in rhythm or timbre (use case 2.3).
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
17
5 Findings and lessons learned
The examples in the workbooks illustrated that relatively simple audio analysis could be
used to provide useful extra-semantic insights into oral history interviews. However, the degree
to which participants grasped the possibilities was directly related to their familiarity with firstly
digital audio and secondly digital methods in general. Those participants who were familiar with
basic digital audio concepts and programming techniques (as was the case with many
participants at the Digital Humanities 2016 workshop), recognized the potential of this approach,
particularly those who worked with large audio archives. Other participants, those who had not
previously engaged with computational methods, or done any coding of any kind, found it more
difficult to imagine wider usage. This was particularly true for those who worked with very small
sets of recording, for which this type of analysis is largely irrelevant.
Whilst the process of developing and facilitating both workshops indicated that MIR
methods and technologies can be usefully applied to digital audio files that contain spoken word,
adoption of these techniques is likely to be amongst existing computationally literate
communities. For some, understanding how to interpret the visual display of audio files (e.g. the
spectrogram) was challenging: ‘I found it hard to translate spectrograms and plots to
observations about the interviews’.18
Our workbooks allowed participants to carry out sophisticated, complex analysis, yet
many participants found it difficult to envisage or imagine questions beyond those that we posed
or included in the workbooks. This difficulty is linked to a number of factors. First, the
workbooks in effect hide the complexities of a number of different tasks, so that while
18 MIR for Oral History Collections (feedback)
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
18
participants could execute a piece of code and get results, this reduced understanding of the
methods and the capabilities of the software libraries used and the code developed. Second,
while the workbooks scaffolded learning, some participants – especially those new to
programming – experienced a steep learning curve. This was especially evident in the second
cohort/workshop, which mostly consisted of PhD students and researchers using oral history as a
method in their work. Indeed, a portion of this cohort had little to no experience of the term
“digital humanities” , or indeed the methods used in this domain. From our perspective it was
interesting to discover that many oral history practitioners in this session still use the audio to
text method as standard procedure. Therefore, working with the audio or digital files in a
computational manner was completely unfamiliar territory. However, even though some
participants found the workbooks challenging, they indicated to us that by working through the
concepts, ideas and indeed the use cases, they felt inspired to re-think their current forms of
analysis and to investigate how they might incorporate new forms of computational analysis.
One participant, working on an historical collection of oral history interviews, reported that they
could see how using audio analysis might help to reverse engineer the methodology or order of
the original interviews. Other participants noted that from an archival perspective, techniques
used to cluster “related” content might help with cataloguing collections by creating new forms
of metadata. Many participants remarked that although they did not envisage learning the skills
necessary to carry out this kind of analysis, having seen the potential, they could see value in
collaborating with data scientists to explore new approaches, something they had not previously
considered.
These remarks made us reflect on how best such methods could be introduced into
research communities with little or no prior engagement with computational methods. A
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
19
common solution is to create a package with a graphical user interface and presets, which users
can employ without conceptual or technical knowledge, yet the real potential of such methods
can only be realized through hands-on, bespoke experimentation with specific real-world
research questions. Our decision to present participants with pre-written executable code was
intended as a compromise between these two positions.
In terms of the technological set up, with further time spent on preparation it would be
preferable to develop a service to support the workshop exercises without tunneling the
connection, which would result in a more reliable delivery of the service. HTTP(S)
communication is resilient to the fluctuating quality that is common in public Wi-Fi networks as
it does not require an uninterrupted connection, instead connections are created and destroyed
with every interaction.
Provision of server based development environments is a good fit for cloud based
computing infrastructures. The cost of running the cloud servers used for the workshops was less
than £1 (or $1.30 approx.) for each event. In hindsight this means that server allocation should
have been increased to improve service reliability. During both workshops broken connections
were experienced and servers crashed; however, user experience was preserved by monitoring
the cloud servers, supporting the participants and reconnecting broken connections as quickly as
possible. The lesson learned with regards to connection and server stability was the extent of the
variation of computing resource requirements across the activities and participants. The lesson
learned with regards to the provision of virtual machines is the choice to use container
technology (Docker) instead. Docker overcomes variations in user software configuration
without the need to distribute a full operating system to every user.
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
20
Conclusion
Our overall aim for this experimental project was to help the digital humanities and oral
history community explore alternatives to the use of textual surrogates in oral history. Using off-
the-shelf tools, we created and disseminated online interactive workbooks which demonstrate
how generic audio analysis methods can be used to extract extra-semantic information from
digital audio files. This approach might be used to complement traditional semantic analyses,
providing automation of existing methods (metadata) or potentially new levels of analysis, such
as interviewee-interviewer dynamics. By running participatory workshops, we tested the
response of a wide range of humanists interested in oral history collections. The workshops
demonstrated that this approach might be of great interest to DH researchers working with large
audio databases, but are unlikely to be rapidly taken up by those working with small data sets, or
with preference for manual methods.
Our work suggest great potential for audio-analysis in oral history. Refinement of
methods to meet the use cases outlined in Section 3 will require systematic research on a wide
range of large oral history archives in order to establish how well this work can be generalized
and extended. In terms of future adoption in digital humanities communities, as with all
computational analyses, a balance must then be sought between providing ready-to-use tools
with a low barrier to entry, or nurturing a wider understanding technically and conceptually, such
that members of the community may build and develop their own methods. As computational
literacy grows amongst research communities, we see potential for novel applications of these
methods in the future.
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
21
http://dx.doi.org/10.1080/10588167.2017.1404307
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
22
Bibliography
Bertin-Mahieux, T., Ellis, D.P., Whitman, B. and Lamere, P. “The Million Song Dataset.”
ISMIR 2, no. 9 (2011).
Boyd, Doug. "OHMS: Enhancing Access to Oral History for Free." Oral History Review
40, no. 1 (2013): 95–106 doi:10.1093/ohr/oht031
Casey, Michael A., Veltkamp, Remco, Goto, Masataka, Leman, Marc, Rhodes, Christophe
and Slaney, Malcolm "Content-based music information retrieval: Current directions and
future challenges." Proceedings of the IEEE96 4 (2008): 668-696.
Clement, Tanya, Kraus, Kari, Sayers, J. and Trettien, Whitney “: The Intersections of Sound and Method.” Proceedings of Digital
Humanities 2014. Lausanne, Switzerland https://terpconnect.umd.edu/~oard/pdf/dh14.pdf
(accessed August 14, 2017)
Clement, Tanya E., Tcheng, David, Auvil, Loretta and Borries, Tony “High Performance
Sound Technologies for Access and Scholarship (HiPSTAS) in the Digital Humanities.”
Proceedings of the Association for Information Science and Technology 51 (2014):1–10
doi:10.1002/meet.2014.14505101042
Downie, J. Stephen. "Music information retrieval." Annual review of information science
and technology 37, no. 1 (2003): 295-340.
Grele, Ronald J., “Oral History as Evidence.” In History of Oral History: Foundations and
Methodology Thomas L. Carlton, Lois E. Myers and Rebecca Sharpless (eds) (UK, 2007),
69.
Kluyver, Thomas, Ragan-Kelley, Benjamin, Pérez, Fernando, Granger, Brian, Bussonnier,
Matthias, Frederic, Jonathan, Kelley, Kyle, Hamrick, Jessica, Grout, Jason, Corlay,
Sylvain, Ivanov, Paul, Avila, Damián, Abdalla, Safia, and Willing, Carol (Jupyter
Development Team) “Jupyter Notebooks – a publishing format for reproducible
computational workflows.” Positioning and Power in Academic Publishing: Players,
Agents and Agendas (2016):87-90. doi: 10.3233/978-1-61499-649-1-87
Tzanetakis, George and Cook, Perry “Musical genre classification of audio signals.” IEEE
Transactions on speech and audio processing 10, no. 5 (2002):293-302.
http://dx.doi.org/10.1080/10588167.2017.1404307
https://terpconnect.umd.edu/~oard/pdf/dh14.pdf
The Version of Record of this manuscript has been published and is available in the Journal,
Music Reference Services Quarterly, November 2017, http://dx.doi.org/10.1080/10588167.2017.1404307
23
Resources
All workbooks, data, slides from the workshops are deposited in both GitHub and Zenodo:
‘Machine Listening for Oral History’, GitHub
● Eldridge, A., Kiefer, C., Webb, S., Jackson, B., & Baker, J. (2016, July). Music
Information Retrieval Algorithms for Oral History Collections.
Zenodo. http://doi.org/10.5281/zenodo.58336
Python libraries used:
https://www.scipy.org/
http://scikit-learn.org/
https://matplotlib.org/
https://ipython.org/ipython-doc/3/api/generated/IPython.display.html
https://librosa.github.io/librosa/
Other:
https://www.docker.com/
http://dx.doi.org/10.1080/10588167.2017.1404307
http://doi.org/10.5281/zenodo.58336
https://github.com/algolistening/MachineListeningforOralHistory
https://github.com/algolistening/MachineListeningforOralHistory
https://www.scipy.org/
http://scikit-learn.org/
https://matplotlib.org/
https://ipython.org/ipython-doc/3/api/generated/IPython.display.html
https://librosa.github.io/librosa/
https://www.docker.com/