Aalborg Universitet

The State of the Art Ten Years After a State of the Art

Future Research in Music Information Retrieval

Sturm, Bob L.

Published in:
Journal of New Music Research

DOI (link to publication from Publisher):
10.1080/09298215.2014.894533

Publication date:
2014

Document Version
Early version, also known as pre-print

Link to publication from Aalborg University

Citation for published version (APA):
Sturm, B. L. (2014). The State of the Art Ten Years After a State of the Art: Future Research in Music
Information Retrieval. Journal of New Music Research, 43(2), 147-172.
https://doi.org/10.1080/09298215.2014.894533

General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

            ? Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
            ? You may not further distribute the material or use it for any profit-making activity or commercial gain
            ? You may freely distribute the URL identifying the publication in the public portal ?

Take down policy
If you believe that this document breaches copyright please contact us at vbn@aub.aau.dk providing details, and we will remove access to
the work immediately and investigate your claim.

Downloaded from vbn.aau.dk on: April 06, 2021

https://doi.org/10.1080/09298215.2014.894533
https://vbn.aau.dk/en/publications/b110c051-ef3a-4902-b0dc-ee48f8923b4c
https://doi.org/10.1080/09298215.2014.894533


D
R
A
FT

The State of the Art Ten Years After a State of the Art:
Future Research in Music Information Retrieval

Bob L. Sturm∗

Abstract

A decade has passed since the first review of research on a “flagship application” of music
information retrieval (MIR): the problem of music genre recognition (MGR). During this time,
about 500 works addressing MGR have been published, and at least 10 campaigns have been
run to evaluate MGR systems, which makes MGR one of the most researched areas of MIR.
So, where does MGR lie now? We show that in spite of this massive amount of work, MGR
does not lie far from where it began, and the paramount reason for this is that most evaluation
in MGR lacks validity. We perform a case study of all published research using the most-
used benchmark dataset in MGR during the past decade: GTZAN. We show that none of the
evaluations in these many works is valid to produce conclusions with respect to recognizing
genre, i.e., that a system is using criteria relevant to recognize genre. In fact, the problems of
validity in evaluation also affect research in music emotion recognition and autotagging. We
conclude by discussing the implications of our work for MGR and MIR in the next ten years.

1 Introduction

“Representing Musical Genre: A State of the Art” (Aucouturier and Pachet, 2003) was
published a decade ago now, a few years after the problem of music genre recognition (MGR)
was designated a “flagship application” of music information retrieval (MIR) (Aucouturier
and Pampalk, 2008). During that time, there have been at least four other reviews of research
on MGR (Scaringella et al., 2006; Dannenberg, 2010; Fu et al., 2011; Sturm, 2012b), nearly
500 published works considering MGR, and at least 10 organized campaigns to evaluate
systems proposed for MGR: ISMIR 2004,1 ISMIS 2011,2 and MIREX 2005, 2007–13.3 Figure
1 shows the massive amount of published work on MGR since that of Matityaho and Furst
(1995).4 So, where does MGR now lie? How much progress has been made?

One might find an indication by looking at how the construction and performance of
systems designed for MGR have changed during the past decade. The reviews by Aucouturier
and Pachet (2003), Scaringella et al. (2006) and Fu et al. (2011) all describe a variety of
approaches to feature extraction and machine learning that have been explored for MGR,
and provide rough comparisons of how MGR systems perform on benchmark datasets and in
evaluation campaigns. Conclusions from these evaluations as a whole have been drawn about
research progress on the problem of MGR. For instance, Bergstra et al. (2006) — authors
of the MGR system having the highest accuracy in MIREX 2005 — write, “Given the

∗Audio Analysis Lab, AD:MT, Aalborg University Copenhagen, A.C. Meyers Vænge 15, DK-2450 Copen-
hagen, Denmark, (+45) 99407633, e-mail: bst@create.aau.dk. BLS is supported in part by Independent
Postdoc Grant 11-105218 from Det Frie Forskningsr̊ad.

1http://ismir2004.ismir.net/genre_contest/index.htm
2http://tunedit.org/challenge/music-retrieval
3http://www.music-ir.org/mirex/wiki/MIREX_HOME
4The bibliography and spreadsheet that we use to generate this figure are available here: http://imi.

aau.dk/~bst/software.

1


D
R
A
FT

0

10

20

30

40

50

60

70

1
9
9
5

1
9
9
6

1
9
9
7

1
9
9
8

1
9
9
9

2
0
0
0

2
0
0
1

2
0
0
2

2
0
0
3

2
0
0
4

2
0
0
5

2
0
0
6

2
0
0
7

2
0
0
8

2
0
0
9

2
0
1
0

2
0
1
1

2
0
1
2

P
u

b
li
c
a

ti
o

n
s

 
ISMIR

Conf. not ISMIR

Journal Article

In Book

Master’s

PhD

Figure 1: Annual numbers of published works in MGR with experimental components,
divided into publication venues.

steady and significant improvement in classification performance ... we wonder if automatic
methods are not already more efficient at learning genres than some people.” In Fig. 2, we
plot the highest reported classification accuracies of about 100 MGR systems all evaluated
in the benchmark dataset GTZAN (Tzanetakis and Cook, 2002).5 We see there to be many
classification accuracies higher than the 61% reported by Tzanetakis and Cook (2002). The
maximum of each year shows progress up to 2009, which then appears to stop. Similarly,
Humphrey et al. (2013) look at the best accuracies in the MGR task of MIREX from 2007
to 2012 and suggest that progress “is decelerating, if not altogether stalled.”

In spite of all these observations, however, might it be that the apparent progress in MGR
is an illusion? Our exhaustive survey (2012b) of nearly 500 publications about MGR shows:
of ten evaluation designs used in MGR, one in particular (which we call Classify) appears
in 91% of work having an experimental component; and the most-used public dataset is
GTZAN, appearing in the evaluations of about 100 published works (23%). These findings
are not promising. First, Classify using a dataset having independent variables that are not
controlled cannot provide any valid evidence to conclude upon the extent to which an MGR
system is recognizing the genres used by music (2013a; 2013b). In other words, just because an
MGR system reproduces all labels of a dataset does not then mean it is making decisions by
using criteria relevant to genre (e.g., instrumentation, composition, subject matter). In fact,
we have clearly shown (2012c; 2013g) that an MGR system can produce a high accuracy using
confounded factors; and when these confounds break, its performance plummets. Second,
we have found (2012a; 2013d) GTZAN has several faults, namely, repetitions, mislabelings,
and distortions. Because all points in Fig. 2 use Classify in GTZAN, their meaningfulness is
thus “doubly questionable.” What valid conclusions, then, can one draw from all this work?

Our “state of the art” here attempts to serve a different function than all previous re-
views of MGR (Aucouturier and Pachet, 2003; Scaringella et al., 2006; Dannenberg, 2010;
Fu et al., 2011). First, it aims not to summarize the variety of features and machine learn-
ing approaches used in MGR systems over the past ten years, but to look closely at how
to interpret Fig. 2. Indeed, we have a best case scenario for drawing conclusions: these
evaluations of many different systems use the same same kind of evaluation (Classify) and
the same dataset (GTZAN). Unfortunately, we find the faults in GTZAN impose an insur-
mountable impediment to interpreting Fig. 2. It is tempting to think that since each of
these MGR systems faces the same faults in GTZAN, and that its faults are exemplary of

5The dataset can be downloaded from here: http://marsyas.info/download/data_sets

2


D
R
A
FT

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
40

50

60

70

80

90

100

Year of Publication

H
ig

h
e
s
t 
R

e
p
o
rt

e
d
 A

c
c
u
ra

c
y
 (

%
)

Figure 2: Highest classification accuracies (%) using all GTZAN as a function of publication
year. Solid gray line is our estimate of the “perfect” accuracy in Table 2. Six “×” denote
incorrect results, discussed in Section 4.5.

real world data, then the results in Fig. 2 are still meaningful for comparing systems. We
show this argument to be wrong: the faults simply do not affect the performance of MGR
systems in the same ways. Second, this article aims not to focus on MGR, but to look more
broadly at how the practice of evaluation in the past decade of MGR research can inform the
practice of evaluation in the next decade of MIR research. An obvious lesson from GTZAN
is that a researcher must know their data, know real data has faults, and know faults have
real impacts on evaluation; but, we also distill five other important “guidelines”: 1) Define
problems with use cases and formalism; 2) Design valid and relevant experiments; 3) Perform
deep system analysis; 4) Acknowledge limitations and proceed with skepticism; and 5) Make
reproducible work reproducible.

In the next section, we briefly review “the problem of music genre recognition.” The third
section provides a comprehensive survey of how GTZAN has been and is being used in MGR
research. In the fourth section, we analyze GTZAN, and identify several of its faults. In the
fifth section, we test the real effects of its faults on the evaluation of several categorically
different MGR systems. Finally, we conclude with a discussion of the implications of this
work on future work in MGR, and in MIR more broadly.

2 The Problem of Music Genre Recognition (in brief )

It is rare to find in any reference of our survey (2012b) a formal or explicit definition of
MGR; and few works explicitly define “genre,” instead deferring to describing why genre
is useful, e.g., to explore music collections (Aucouturier and Pachet, 2003). Aucouturier
and Pachet (2003) describe MGR as “extracting genre information automatically from the
audio signal.” Scaringella et al. (2006) and Tzanetakis and Cook (2002) both mention
automatically arranging music titles in genre taxonomies. Since 91% of MGR evaluation
employs Classify in datasets with uncontrolled independent variables (Sturm, 2012b), the
majority of MGR research implicitly interprets MGR as reproducing by any means possible
the “ground truth” genre labels of a music dataset (Sturm, 2013b). Almost all work in our
survey (2012b) thus treats genre in an Aristotelean way, assuming music belongs to categories
like a specimen belongs to a species, which belongs to a genus, and so on.

3


D
R
A
FT

So, what is genre? The work of Fabbri (1980, 1999) — dealt with in depth by Kemp
(2004), and cited by only a few MGR works, e.g., McKay and Fujinaga (2006) and Craft
(2007) — essentially conceives of music genre as “a set of musical events (real or possible)
whose course is governed by a definite set of socially accepted rules” (1980). This “course”
applies to the “musical events (real or possible)”, where a “musical event” Fabbri defines as
being a “type of activity ... involving sound” (1980). Adopting the language of set theory,
Fabbri speaks of “genres” and “sub-genres” as unions and subsets of genres, each having
well-defined boundaries, at least for some “community.” He goes as far to suggest one can
construct a matrix of rules crossed with subgenres of a genre, with each entry showing the
applicability of a particular rule for a particular subgenre. By consequence, producing and
using such a master list amounts to nothing more than Aristotelean categorization: to which
set does a piece of music belong?

A view of genre alternative to that of Fabbri is provided by Frow (2005), even though his
perspective is of literature and not music. Frow argues that “genre” is a dynamic collection
of rules and criteria specifying how a person approaches, interprets, describes, uses, judges,
and so on, forms of human communication in particular contexts. These rules and criteria
consequently precipitate from human communication as an aid for people to interact with
information and with each other in the world.6 For instance, genre helps a person to read,
use, and judge a published work as a scientific article, as opposed to a newspaper column.
Genre helps an author write and sell a written work as a publish-worthy scientific article, as
opposed to a newspaper column. This is not just to say that the “medium is the message”
(McLuhan, 1964), but also that a person reading, using, and judging a published scientific
journal article is using cues — both intrinsic and extrinsic — to justify their particular
treatment of it as a scientific journal article, and not as a newspaper column (although it
could be read as a newspaper column in a different context). Frow’s view of genre, then,
suggests that music does not belong to genre, but that the human creation and consumption
of music in particular contexts necessarily use genre. Hence, instead of “to which genre does
a piece of music belong,” the meaningful question for Frow is, “what genres should I use
to interpret, listen to, and describe a piece of music in a particular context?” In Frow’s
conception of genre, it becomes clear why any attempt at building a taxonomy or a list of
characteristics for categorizing music into musical genres will fail: they lack the human and
the context, both necessary aspects of genre.

Essentially unchallenged in MGR is the base assumption that the nature of “genre” is
such that music belongs to categories, and that it makes sense to talk about “boundaries”
between these categories. Perhaps it is a testament to the pervasiveness of the Aristotelean
categorization of the world (Bowker and Star, 1999) that alternative conceptions of music
genre like Frow’s have by and large gone unnoticed in MGR. Of course, some have argued
that MGR is ill-defined (Pachet and Cazaly, 2000; Aucouturier and Pachet, 2003; McKay and
Fujinaga, 2006), that some genre categories are arbitrary and unique to each person (Craft,
2007; Sordo et al., 2008), that they are motivated by industry (Aucouturier and Pachet,
2003), and/or they come in large part from domains outside the purview of signal processing
(Craft, 2007; Wiggins, 2009). Processing only sampled music signals necessarily ignores the
“extrinsic” properties that contribute to judgements of genre (Aucouturier and Pachet, 2003;

6Fabbri (1999) also notes that genre helps “to speed up communication.”

4


D
R
A
FT

McKay and Fujinaga, 2006; Wiggins, 2009). Hence, it is of little surprise when subjectivity
and commercial motivations of genre necessarily wreak havoc for any MGR system.

The value of MGR has also been debated. Some have argued that the value of producing
genre labels for music already labeled by record companies is limited. Some have suggested
that MGR is instead encompassed by other pursuits, e.g., music similarity (Pampalk, 2006),
music autotagging (Aucouturier and Pampalk, 2008; Fu et al., 2011), or extracting “musically
relevant data” (Serra et al., 2013). Others have argued that, in spite of its subjective nature,
there is evidence that genre is not so arbitrary (Lippens et al., 2004; Gjerdingen and Perrott,
2008; Sordo et al., 2008), that some genres can be defined by specific criteria (Barbedo and
Lopes, 2007), and that research on MGR is still a worthy pursuit (McKay and Fujinaga,
2006; Scaringella et al., 2006). Many works also suggest that MGR provides a convenient
testbed for comparing new audio features and/or machine learning approaches, e.g., Andén
and Mallat (2011); Andén and Mallat (2013).

While most of the work we survey (2012b) implicitly poses MGR as an Aristotelean
categorization, the goal posed by Matityaho and Furst (1995) comes closer to the conception
of Frow, and to one that is much more useful than the reproduction of “ground truth”
genre labels: “[building] a phenomenological model that [imitates] the human ability to
distinguish between music [genres].” Although it is missing the context essential to genre for
Frow, and although the empirical work of Matityaho and Furst (1995) still treats genre in an
Aristotelean manner, this goal places humans at the center, establishes the goal as imitation,
and prescribes the use of humans for evaluation. This has led us to define the “principal
goals of MGR” (2013b): to imitate the human ability to organize, recognize, distinguish
between, and imitate genres used by music. “Organization” implies finding and expressing
characteristics used by particular genres, e.g., “blues is strophic”; “recognition” implies
identification of genres, e.g., “that sounds like blues because ...”; “distinguishing” implies
describing why, or the extents to which, some music uses some genres but not others, e.g.,
“that does not sound like blues because ...”; and “imitation” implies being able to exemplify
the use of particular genres, e.g., “play the Bach piece as if it is blues.” Stated in such a
way, MGR becomes much richer than generating genre-indicative labels for music signals.

3 A Survey and Analysis of GTZAN

In this section, we look closely at GTZAN since its creation a decade ago: how it has been
used, of what it is composed, what its faults are, how its faults affect evaluation, and what
this implies for MGR in the past decade, and GTZAN in the next decade.

3.1 How has GTZAN been used?

Figure 3 shows how the number of publications that use GTZAN has increased since its
creation in 2002. We see that it has been used more in the past three years than in its first
eight years. The next most-used public dataset is ISMIR2004,7 which was created for the
MGR task of ISMIR 2004. That dataset appears in 75 works, 30 of which use GTZAN as
well. Of the 100 works that use GTZAN, 49 of them use only GTZAN.8

7http://kom.aau.dk/~jhj/files/ismir2004genre/
8All relevant references are available at: http://imi.aau.dk/~bst/software.

5


D
R
A
FT

0

10

20

30

40

50

60

70

20
02

11

2

20
03

18

2

20
04

27

1

20
05

32

5

20
06

41

8

20
07

30

3

20
08

38

6

20
09

49

9

20
10

65

22

20
11

55

19

20
12

47

20P
u
b
li
c
a
ti
o
n
s

 
using GTZAN

not using GTZAN

Figure 3: Annual numbers of published works in MGR with experimental components,
divided into ones that use and do no use GTZAN.

In our review of evaluation in MGR (2012b), we delimit ten different evaluation designs.
Of the 100 works using GTZAN, 96 employ the evaluation design Classify (an excerpt is
assigned labels, which are compared against a “ground truth”). In seven works, GTZAN is
used with the evaluation design Retrieve (a query is used to find similar music, and the labels
of the retrieved items are compared); and one work (Barreira et al., 2011) uses GTZAN with
the evaluation design Cluster (clustering of dataset and then a comparison of the labels in
the resulting clusters). Our work (2012c) uses Compose, where an MGR system generates
new music it scores as highly representative of each GTZAN category, which we then test
for identifiability using a formal listening test. We find a few other uses of GTZAN. Markov
and Matsui (2012a,b) learn bases from GTZAN, and then apply these codebooks to classify
the genres of ISMIR2004. In our work (2013e), we train classifiers using ISMIR2004, and
then attempt to detect all excerpts in GTZAN Classical (and two in GTZAN Jazz). As
described above, Fig. 2 shows the highest classification accuracies reported in the papers
that consider the 10-class problem of GTZAN (we remove duplicated experiments, e.g., Lidy
(2006) contains the results reported in Lidy and Rauber (2005)).

Among the 100 published works using GTZAN, we find only five outside ours (2012c;
2013a; 2013f; 2013e) that indicate someone has listened to at least some of GTZAN. The first
appears to be Li and Sleep (2005), who find that “two closely numbered files in each genre
tend to sound similar than the files numbered far [apart].” Bergstra et al. (2006) note that,
“To our ears, the examples are well-labeled ... our impression from listening to the music is
that no artist appears twice.” This is contradicted by Seyerlehner et al. (2010), who predict
“an artist effect ... as listening to some of the songs reveals that some artists are represented
with several songs.” In his doctoral dissertation, Seyerlehner (2010) infers there to be a
significant replication of artists in GTZAN because of how classifiers trained and tested with
GTZAN perform as compared to other artist-filtered datasets. Furthermore, very few people
have mentioned specific faults in GTZAN: Hartmann (2011) notes finding seven duplicates;
and Li and Chan (2011) — who have manually estimated keys for all GTZAN excerpts9 —
remember hearing some repetitions.10 Hence, it appears that GTZAN has by and large been
assumed to have satisfactory integrity for MGR evaluation.

9Available here: http://visal.cs.cityu.edu.hk/downloads/#gtzankeys
10Personal communication.

6


D
R
A
FT

by metadata from last.fm
Label ENMFP self # songs (# tags) # artists (# tags)
Blues 63 100 75 (2904) 25 (2061)

Classical 63 97 12 (400) 85 (4044)
Country 54 95 81 (2475) 13 (585)

Disco 52 90 82 (5041) 8 (194)
Hip hop 64 96 90 (6289) 5 (263)

Jazz 65 80 54 (1127) 26 (2102)
Metal 65 83 73 (5579) 10 (786)

Pop 59 96 89 (7173) 7 (665)
Reggae 54 82 73 (4352) 9 (616)

Rock 67 100 99 (7227) 1 (100)

Total 60.6% 91.9% 72.8% (42567) 18.9% (11416)

Table 1: For each category of GTZAN: number of excerpts we identify by fingerprint (EN-
MFP); then searching manually (by self); number of songs tagged in last.fm (and number
of those tags having “count” larger than 0); for songs not found, number of artists tagged in
last.fm (and number of tags having “count” larger than 0). Retrieved Dec. 25, 2012, 21h.

3.2 What is in GTZAN?

Until our work (2012a), GTZAN has never had metadata identifying its contents because
those details were not assembled during compilation. Hence, every MGR evaluation using
GTZAN, save ours (2012c; 2013a; 2013e; 2013d), has not been able to take into account its
contents. We now identify the excerpts in GTZAN, determine how music by specific artists
compose each category, and survey the tags people have applied to the music and/or artist
in order to obtain an idea of what each GTZAN category means.

We use the Echo Nest Musical Fingerprinter (ENMFP)11 to generate a fingerprint of an
excerpt in GTZAN and then to query the Echo Nest database having over 30,000,000 songs
(at the time of this writing). The second column of Table 1 shows that this identifies only
606 of the 1000 excerpts. We manually correct titles and artists as much as possible, e.g.,
we reduce “River Rat Jimmy (Album Version)” to “River Rat Jimmy”; and “Bach - The
#1 Bach Album (Disc 2) - 13 - Ich steh mit einem Fuss im Grabe, BWV 156 Sinfonia” to
“Ich steh mit einem Fuss im Grabe, BWV 156 Sinfonia;” and we correct “Leonard Bernstein
[Piano], Rhapsody in Blue” to “George Gershwin” and “Rhapsody in Blue.” We find four
misidentifications: Country 1512 is misidentified as being by Waylon Jennings (it is by George
Jones); Pop 65 is misidentified as being Mariah Carey (it is Prince); Disco 79 is misidentified
as “Love Games” by Gazeebo (it is “Love Is Just The Game” by Peter Brown); and Metal
39 is Metallica playing “Star Wars Imperial March,” but ENMFP identifies it as a track on
a CD for improving sleep.13 We then manually identify 313 more excerpts, but have yet to
identify the remaining 81 excerpts.14

Figure 4 shows how each GTZAN category is composed of music by particular artists. We
see only nine artists are represented in the GTZAN Blues. GTZAN Reggae is the category

11http://developer.echonest.com
12This is the file “country.00015.wav” in GTZAN.
13“Power Nap” by J. S. Epperson (Binaural Beats Entrainment), 2010.
14Our machine-readable index of this metadata is available at: http://imi.aau.dk/~bst/software.

7


D
R
A
FT

with the most excerpts from a single artist: 35 excerpts of Bob Marley. The category with
the most artist diversity appears to be Disco, where we find at least 55 different artists.
From this, we can bound the number of artists in GTZAN, which has until this time been
unknown (Seyerlehner, 2010; Seyerlehner et al., 2010). Assuming each unidentified excerpt
comes from different artists than those we have already identified, the total number of artists
represented in GTZAN cannot be larger than 329. If all unlabeled excerpts are from the
artists we have already identified, then the smallest this number can be is 248.

We now wish to determine the content composing each GTZAN category. In our previous
analysis of GTZAN (2012a), we assume that since there is a category called, e.g., “Country,”
then GTZAN Country excerpts should possess typical and distinguishing characteristics of
music using the country genre (Ammer, 2004): stringed instruments such as guitar, mandolin,
banjo; emphasized “twang” in playing and singing; lyrics about patriotism, hard work and
hard times; and so on. This led us to the claim that at least seven GTZAN Country excerpts
are mislabeled because they exemplify few of these characteristics. We find other work that
assumes the genres of music datasets overlap because they share the same genre labels, e.g.,
the taxonomies of Moerchen et al. (2006) and Guaus (2009). In this work, however, we
do not make the assumption that the excerpts in GTZAN Country possess the typical and
distinguishing characteristics of music using the country genre. In other words, we now
consider a GTZAN category name as “short hand” for the collection of consistent and/or
contradictory concepts and criteria, both objective and subjective, that Tzanetakis employed
in assembling music excerpts to form that GTZAN category.

To obtain an idea of the content composing each GTZAN category, we query the appli-
cation programming interface provided by last.fm, and retrieve the “tags” that users of
the service (whom we will call “taggers”) have entered for each song or artist we identify in
GTZAN. A tag is a word or phrase a tagger associates with an artist, song, album, and so
on, for any number of reasons (Bertin-Mahieux et al., 2010), e.g., to make a music collection
more useful to themselves, to help others discover new music, or to promote their own music
or criticize music they do not like. A past analysis of the last.fm tags (Bertin-Mahieux
et al., 2008) finds that they are most often genre labels (e.g., “blues”), but they can also
be instrumentation (“female vocalists”), tempo (e.g., “120 bpm”), mood (e.g., “happy”),
how they use the music (e.g., “exercise”), lyrics (e.g., “fa la la la la”), the band (e.g., “The
Rolling Stones”), or something else (e.g., “favorite song of all time”) (Law, 2011).

8


D
R
A
FT

Blues Classical Country Disco Hip hop
0

10

20

30

40

50

60

70

80

90

100

P
e
rc

e
n
ta

g
e

Robert Johnson

Hot Toddy

John Lee Hooker

Clifton Chenier

Buckwheat Zydeco

Kelly Joe Phelps

Magic Slim & The Teardrops

Stevie Ray Vaughan

Mozart

JS Bach

Vivaldi

Franz Joseph Haydn

Percy Grainger
Franz Schubert
Claude Debussy
Maurice Ravel
Henri Dutilleux

Leonard Bernstein
Beethoven

Willie Nelson

Vince Gill

Brad Paisley

George Strait

KC and The Sunshine Band
The Gibson Brothers

Gloria Gaynor
Ottawan

A Tribe Called Quest

Beastie Boys

Public Enemy

Cypress Hill
Wu−Tang Clan

Jazz Metal Pop Reggae Rock
0

10

20

30

40

50

60

70

80

90

100

P
e
rc

e
n
ta

g
e

Coleman Hawkins

Joe Lovano

James Carter

Branford Marsalis Trio

Miles Davis
Joe Henderson
Dexter Gordon

Dio

New Bomb Turks

Metallica

Iron Maiden
Dark Tranquillity
Black Sabbath

Rage Against The Machine

Britney Spears

Mandy Moore

Destiny’s Child

Christina Aguilera

Alanis Morissette

Janet Jackson
Jennifer Lopez
Mariah Carey
Celine Dion

Bob Marley

Dennis Brown

Prince Buster
Burning Spear
Gregory Isaacs

Queen

Led Zeppelin

Morphine

The Stone Roses

Simple Minds

Simply Red

The Rolling Stones

Sting

Jethro Tull

Survivor

Ani DiFranco

Figure 4: Artist composition of each GTZAN category. We do not include unidentified excerpts.

Blues Classical Country Disco Hip hop
0

10

20

30

40

50

60

70

80

90

100

P
e
rc

e
n
ta

g
e

bl
ue

s

zy
de

co

bl
ue

sr
oc

k

sw
in
g

sw
in
gb

lu
es

ca
ju
n

cl
as

si
ca

l

co
m

po
se

r

20
th

ce
nt

ur
yc

la
ss

ic
al

co
un

tr
y

cl
as

si
cc

ou
nt

ry

ol
di
es

w
ill
ie
ne

ls
on

di
sc

o

70
s

80
s

po
p

so
ul

fu
nk

hi
ph

op

ra
p

80
s

Jazz Metal Pop Reggae Rock
0

10

20

30

40

50

60

70

80

90

100

P
e

rc
e
n

ta
g

e

ja
zz

sa
xo

ph
on

e

co
nt

em
po

ra
ry

ja
zz

sm
oo

th
ja
zz

ja
m

es
ca

rt
er

fipgi
m

m
es

um
fr
ie
nd

s

gi
m

m
es

um
fr
ie
nd

s0
4

he
av

ym
et

al

ro
ck

ha
rd

ro
ck

m
et

al

th
ra

sh
m

et
al

cl
as

si
cr

oc
k

pu
nk

pu
nk

ro
ck

al
te

rn
at

iv
er

oc
k

80
s
m

el
od

ic
de

at
hm

et
al

ga
ra

ge
pu

nk

ga
ra

ge
ro

ck

po
p

fe
m

al
ev

oc
al
is
ts

rn
b

da
nc

e

90
s

br
itn

ey
sp

ea
rs

ro
ck
de

st
in
ys

ch
ild

al
te

rn
at

iv
e

so
un

dt
ra

ck

ga
ra

ge
ro

ck

re
gg

ae

sk
a
de

st
in
ys

ch
ild

ro
ot

sr
eg

ga
e

da
nc

eh
al
l

al
te

rn
at

iv
e

ga
ra

ge
ro

ck

ro
ck

cl
as

si
cr

oc
k

80
s

in
di
e

po
p

ha
rd

ro
ck

al
te

rn
at

iv
e

ne
w
w
av

e

br
itp

op

m
ad

ch
es

te
r

qu
ee

n

pr
og

re
ss

iv
er

oc
k

fo
lk
fe

m
al
ev

oc
al
is
ts

st
in
g

60
s
si
m

pl
yr

ed

Figure 5: Category top tags for GTZAN from last.fm.

9


D
R
A
FT

The collection of tags by last.fm is far from being a controlled process. Any given
tagger is not necessarily well-versed in musicology, or knows the history of musical styles,
or can correctly recognize particular instruments, or is even acting in a benevolent manner;
and any group of taggers is not necessarily using the same criteria when they all decide on
tagging a song with a particular tag appearing indicative of, e.g., genre or emotion. For
these reasons, there is apprehension in using such a resource for music information research
(Aucouturier, 2009). However, last.fm tags number in the millions and come from tens
of thousands of users; and, furthermore, each tag is accompanied by a “count” parameter
reflecting the percentage of taggers of a song or artist that choose that particular tag (Levy
and Sandler, 2009). A tag for a song having a count of 100 means that tag is selected by
all taggers of that song (even if there is only one tagger), and 0 means the tag is applied
by the fewest (last.fm rounds down all percentages less than 1). We cannot assume that
each tag selected by a tagger for a given song is done independent of those given by previous
taggers, but it is not unreasonable to interpret a tag with a high count as suggesting a kind
of consensus that last.fm taggers find the tag very relevant for that song, whatever that
may mean. Though last.fm tags have found use in other MIR research (Bertin-Mahieux
et al., 2008; Barrington et al., 2008; Levy and Sandler, 2009), we proceed cautiously about
interpreting the meanings behind the last.fm tags retrieved for the identified contents in
GTZAN. With respect to our goals here, we assume some tags are meant by taggers to be
descriptive of a song or artist.

Using the index of GTZAN we create above, the fourth column of Table 1 shows the
number of songs in GTZAN we identify that have last.fm tags, and the number of tags
with non-zero count (we keep only those tags with counts greater than 0). When we do
not find tags for a song, we request instead the tags for the artist. For instance, though we
identify all 100 excerpts in Blues, only 75 of the songs are tagged on last.fm. Of these, we
get 2,904 tags with non-zero counts. For the remaining 25 songs, we retrieve 2,061 tags from
those given to the artists. We thus assume the tags that taggers give to an artist would also
be given to the particular song. Furthermore, since each GTZAN excerpt is a 30 s excerpt
of a song, we assume the tags given the song or artist would also be given to the excerpt.

With our considerations and reservations clearly stated, we now attempt to shed light
on the content of each GTZAN category using the last.fm tags. First, we define the top
tags of a song or artist as those that have counts above 50. To do this, we remove spaces,
hyphens, and capitalization from all tags. For example, the tags “Hip hop”, “hip-hop” and
“hip hop” all become “hiphop.” Then, we find for each unique top tag in a GTZAN category
the percentage of identified excerpts in that category having that top tag. We define category
top tags all those unique top tags of a GTZAN category that are represented by at least 5%
of the excerpts identified in that category. With the numbers of excerpts we have identified
in each category, this means a category top tag is one appearing as a top tag in at least 4
excerpts of a GTZAN category.

Figure 5 shows the resulting category top tags for GTZAN. We can see large differences in
the numbers of category top tags, where GTZAN Rock has the most, and GTZAN Classical
and GTZAN Hiphop have the least. As seen in Table 1, most of the tags for excerpts in
GTZAN Classical come from those that taggers have entered for artists, which explains why
the tag “composer” appears. Most of the category top tags appear indicative of genre, and
those appearing most in each GTZAN category is the category label, except for Metal. We

10


D
R
A
FT

48 51 54 57 60 63 66 69 72 75

33

36

39

42

45

Excerpt Number
E

x
c
e
rp

t 
N

u
m

b
e
r

Figure 6: Taken from GTZAN Jazz, exact repetitions appear clearly with pair-wise compar-
isons of their fingerprint hashes. The darker a square, the higher the number of matching
hashes.

also see signs of a specific tagger in three category top tags of GTZAN Jazz. With our
considerations above, and by listening to entire dataset, it does not seem unreasonable to
make some claims about the content of each GTZAN category. For instance, the category
top tags of GTZAN Blues reveal its contents to include a large amount of music tagged
“cajun” and “zydeco” by a majority of taggers. GTZAN Disco appears more broad than
dance music from the late seventies (Shapiro, 2005), but also includes a significant amount
of music that has been tagged “80s”, “pop” and “funk” by a majority of taggers. Listening
to the excerpts in these categories confirms these observations.

3.3 What faults does GTZAN have?

We now delimit three kinds of faults in GTZAN: repetitions, mislabelings, and distortions.
Table 2 summarizes these, which we reproduce online with sound examples.15

3.3.1 Repetitions

We consider four kinds of repetition, from high to low specificity: exact, recording, artist,
and version. We define an exact repetition as when two excerpts are the same to such a
degree that their time-frequency fingerprints are the same. This means the excerpts are not
only extracted from the same recording, they are essentially the same excerpt, either sample
for sample (up to a multiplicative factor), or displaced in time by only a small amount. To
find exact repetitions, we implement a simplified version of the Shazam fingerprint (Wang,
2003). This means we compute sets of anchors in the time-frequency plane, compare hashes
of the anchors of every pair of excerpts, and then listen to those excerpts sharing many of the
same anchors to confirm them to be exact replicas. Figure 6 shows the clear appearance of
exact repetitions in GTZAN Jazz. The second column of Table 2 lists these. Our comparison
of hashes across categories reveal one exact repetition in two categories: the same excerpt
of “Tie Your Mother Down” by Queen appears as Rock 58 and Metal 16. In total, we find
50 exact repetitions in GTZAN.

15http://imi.aau.dk/~bst/research/GTZANtable2

11


D
R
A
FT

We define a recording repetition as when two excerpts come from the same recording, but
are not detected with the fingerprint comparison detailed above. We find these by artist
name and song repetitions in the index we create above, and by listening. For instance,
Country 8 and 51 are both from “Never Knew Lonely” by Vince Gill, but excerpt 8 comes
from later in the recording than excerpt 51. The second and third columns of Table 2 shows
the excerpts we suspect as coming from the same recording. We find GTZAN Pop has the
most exact and suspected recording repetitions (16): “Lady Marmalade” sung by Christina
Aguilera et al., as well as “Bootylicious” by Destiny’s Child, each appear four times. In
total, we find 21 suspected recording repetitions in GTZAN.

The last two kinds of repetitions are not necessarily “faults”, but we show in Section 3.4
why they must be taken into consideration when using GTZAN. We define artist repetition
as excerpts performed by the same artist. We find these easily using the index we create
above. Figure 4 and Table 2 show how every GTZAN category has artist repetition. Finally,
we define a version repetition as when two excerpts are of the same song but performed
differently. This could be a studio version, a live version, performed by the same or different
artists (covers), or possibly a remix. We identify these with the index we create above, and
then confirm by listening. For instance, Classical 44 and 48 are from “Rhapsody in Blue”
by George Gershwin, but presumably performed by different orchestras. Metal 33 is “Enter
Sandman” by Metallica, and Metal 74 is a parody of it. Pop 16 and 17 are both “I can’t
get no satisfaction” by Britney Spears, but the latter is performed live. In total, we find 13
version repetitions in GTZAN.

3.3.2 Potential mislabelings

The collection of concepts and criteria Tzanetakis used to assemble GTZAN is of course
unobservable; and in some sense, the use of GTZAN for training and testing an MGR system
aims to reproduce or uncover it by reverse engineering. Regardless of whether GTZAN was
constructed in a well-definable manner, or whether it makes sense to restrict the membership
of an excerpt of music to one GTZAN category, we are interested here in a different question:
are any excerpts mislabeled? In other words, we wish to determine whether the excerpts
might be arranged in a way such that their membership to one GTZAN category, as well
as their exclusion from every other GTZAN category, is in some sense “less debatable” to a
system, human or artificial?

Toward this end, we previously (2012a) considered two kinds of mislabelings in GTZAN:
“contentious” and “conspicuous.” We based these upon non-concrete criteria formed loosely
around musicological principles associated with the names of the categories in GTZAN. Now
that we have shown in Section 3.2 that those names do not necessarily reflect the content of
the categories, we instead consider here the content of each GTZAN category. In summary,
we identify excerpts that might be “better” placed in another category, placed across several
categories, or excluded from the dataset altogether, by comparing their tags to the category
top tags of each GTZAN category.

We consider an excerpt potentially mislabeled if not one of its tags match the category top
tags of its category, or if the “number of votes” for its own category is equal to or less than the
“number of votes” for another category. We define the number of votes in a category as the
sum of the counts of tags for an excerpt matching category top tags. (We consider a match

12


D
R
A
FT

as when the category top tag appears in a tag, e.g., “blues” appears in “blues guitar”.) As an
example, consider the category top tags of GTZAN Country in Fig. 5, and the pairs of tags
and counts we find for Country 39 (“Johnnie Can’t Dance” by Wayne Toups & Zydecajun):
{(“zydeco”, 100), (“cajun”, 50), (“folk”, 33), (“louisiana”, 33), (“bayou”, 16), (“cruise”, 16),
(“Accordeon”, 16), (“New Orleans”, 16), (“accordion”, 16), (“everything”, 16), (“boogie”,
16), (“dance”, 16), (“swamp”, 16), (“country”, 16), (“french”, 16), (“rock”, 16)}. We find
its two first tags among the category top tags of GTZAN Blues, and so the number of votes
for GTZAN Blues is, 100 + 50 = 150. We only find one of its tags (“country”) among the
category top tags of GTZAN Country, and so the number of votes there is 16. Hence, we
argue that Country 39 is potentially mislabeled. Listening to the excerpt in relation to all
the others in GTZAN Country, as well as the music tagged “cajun” and “zydeco” in GTZAN
Blues, also supports this conclusion. It is important to emphasize that we are not claiming
Country 39 should be labeled GTZAN Blues, but only that if a MGR system learning from
GTZAN Blues labels Country 39 as “blues”, then that might not be considered a mistake.

Of the excerpts we have identified in GTZAN, we find 13 with no tags that match the
category top tags of their category, 43 that have more votes in the category top tags of
a different category, and 18 that have the same number of votes in its own category and
at least one other. Of these, 20 appear to be due to a lack of tags, and two appears to
be a result of bad tags. For instance, among the tags with count of 100 for “Can’t Do
Nuttin’ For Ya, Man!” by Public Enemy (Hip hop 93) are, “glam rock”, “Symphonic Rock”,
“instrumental rock”, “New York Punk” and “Progressive rock.” We list the remaining 52
potential misclassifications in Table 2.

As we have pointed out (2012a; 2012c), GTZAN has other potential problems with ex-
cerpts in its categories. Disco 47 is of Barbra Streisand and Donna Summer singing the
introduction to “No More Tears”, but an excerpt from a later point this song might bet-
ter exemplify GTZAN Disco. Hip hop 44 is of “Guantanamera” by Wyclef Jean, but the
majority of the excerpt is a sample of musicians playing the traditional Cuban song “Guan-
tanamera.” Hence, GTZAN Hip hop might or might not be appropriate. Hip hop 5 and 30
are Drum and Bass dance music that are quite unique among the rest of the excerpts; and
Reggae 51 and 55 are electronic dance music that are again quite contrasting to the other
excerpts. Finally, Reggae 73 and 74 are exact replicas of “Hip-Hopera” by Bounty Killer,
which might be better more appropriate in GTZAN Hip hop. We do not include these as
potential mislabelings in Table 2.

3.3.3 Identifying faults: Distortions

The last column of Table 2 lists some distortions we find by listening to every excerpt in
GTZAN. This dataset was purposely created to have a variety of fidelities in the excerpts
(Tzanetakis and Cook, 2002); however, one of the excerpts (Reggae 86) is so severely dis-
torted that the value of its last 25 seconds is debatable.

13


D
R
A
FT

GTZAN Repetitions Potential Mislabelings Distortions
Category Exact Recording Artist Version

Blues John Lee Hooker (0-11); Robert Johnson (12-28); Kelly
Joe Phelps (29-39); Stevie Ray Vaughn (40-49); Magic
Slim (50-60); Clifton Chenier (61-72); Buckwheat Zydeco
(73-84); Hot Toddy (85-97); Albert Collins (98, 99)

Classical (42,53)
(51,80)

J. S. Bach (00-10); Mozart (11-29); Debussy (30-33);
Ravel (34-37); Dutilleux (38-41); Schubert (63-67);
Haydn (68-76); Grainger (82-88); Vivaldi (89-99); and
others

(44,48) static (49)

Country (08,51)
(52,60)

Willie Nelson (19,26,65-80); Vince Gill (50-64); Brad
Paisley (81-93); George Strait (94-99); and others

(46,47) Ray Peterson “Tell Laura I Love Her” (20);
Burt Bacharach “Raindrops Keep Falling on my
Head” (21); Karl Denver “Love Me With All
Your Heart” (22); Wayne Toups & Zydecajun
“Johnnie Can’t Dance” (39); Johnny Preston
“Running Bear” (48)

static distor-
tion (2)

Disco (50,51,70)
(55,60,89)
(71,74)
(98,99)

(38,78) Gloria Gaynor (1,21, 38,78); Ottawan (17,24,44,45); The
Gibson Brothers (22,28,30,35,37); KC and The Sunshine
Band (49-51,70,71,73,74); ABBA (67,68,72); and others

(66,69) Boz Scaggs “Lowdown” (00); Cheryl Lynn “En-
core” (11); The Sugarhill Gang “Rapper’s De-
light” (27); Evelyn Thomas “Heartless” (29);
Barbra Streisand and Donna Summer “No
More Tears (Enough Is Enough)” (47); Tom
Tom Club “Wordy Rappinghood” (85); Blondie
“Heart Of Glass” (92); Bronski Beat “WHY?”
(94)

clipping dis-
tortion (63)

Hip hop (39,45)
(76,78)

(01,42)
(46,65)
(47,67)
(48,68)
(49,69)
(50,72)

Wu-Tang Clan (1,7,41,42); Beastie Boys (8-25); A Tribe
Called Quest (46-51,62-75); Cypress Hill (55-61); Public
Enemy (81-98); and others

(02,32) 3LW “No More (Baby I’ma Do Right)” (26);
Aaliyah “Try Again” (29); Pink “Can’t Take
Me Home” (31); Lauryn Hill “Ex-Factor” (40)

clipping dis-
tortion (3,5);
skip at start
(38)

Jazz (33,51)
(34,53)
(35,55)
(36,58)
(37,60)
(38,62)
(39,65)
(40,67)
(42,68)
(43,69)
(44,70)
(45,71)
(46,72)

James Carter (2-10); Joe Lovano (11-24); Bran-
ford Marsalis Trio (25-32); Coleman Hawkins (33-
46,51,53,55,57, 58,60,62,65,67-72); Dexter Gordon (73-
77); Miles Davis (87-92); Joe Henderson (94-99); and
others

Leonard Bernstein “On the Town: Three Dance
Episodes, Mvt. 1” (00) and “Symphonic dances
from West Side Story, Prologue” (01)

clipping
distortion
(52,54,66)

Metal (04,13)
(34,94)
(40,61)
(41,62)
(42,63)
(43,64)
(44,65)
(45,66)
(58) is
Rock (16)

Dark Tranquillity (12-15); Dio (40-45,61-66); The New
Bomb Turks (46-57); Queen (58-60); Metallica (33,38,73,
75,78,83,87); Iron Maiden (2,5,34,92-94); Rage Against
the Machine (95-99); and others

(33,74)
(85) is
Ozzy
Os-
bourne
covering
Disco
(14)

Creed “I’m Eighteen” (21); Living Colour
“Glamour Boys” (29); The New Bomb Turks
“Hammerless Nail” (46), “Jukebox Lean” (50),
“Jeers of a Clown” (51); Queen “Tie Your
Mother Down” (58), “Tear it up” (59), “We
Will Rock You” (60)

clipping
distortion
(33,73,84)

Pop (15,22)
(30,31)
(45,46)
(47,80)
(52,57)
(54,60)
(56,59)
(67,71)
(87,90)

(68,73)
(15,21,22)
(47,48,51)
(52,54)
(57,60)

Mandy Moore (00,87-96); Mariah Carey (2,97-99); Ala-
nis Morissette (3-9); Celine Dion (11,39,40); Britney
Spears (15-38); Christina Aguilera (44-51,80); Destiny’s
Child (52-62); Janet Jackson (67-73); Jennifer Lopez (74-
78,82); Madonna (84-86); and others

(10,14)
(16,17)
(74,77)
(75,82)
(88,89)
(93,94)

Diana Ross “Ain’t No Mountain High Enough”
(63); Prince “The Beautiful Ones” (65); Kate
Bush “Couldbusting” (79); Ladysmith Black
Mambazo “Leaning On The Everlasting Arm”
(81); Madonna “Cherish” (86)

(37) is
from same
recording as
(15,21,22)
but with
sound effects

Reggae (03,54)
(05,56)
(08,57)
(10,60)
(13,58)
(41,69)
(73,74)
(80,81,82)
(75,91,92)

(07,59)
(33,44)
(85,96)

Bob Marley (00-27,54-60); Dennis Brown (46-48,64-
68,71); Prince Buster (85,94-99); Burning Spear
(33,42,44,50, 63); Gregory Isaacs (70,76-78); and others

(23,55) Pras “Ghetto Supastar (That Is What You
Are)” (52); Marcia Griffiths “Electric Boogie”
(88)

last 25 s of
(86) are use-
less

Rock (16) is
Metal (58)

Morphine (0-9); Ani DiFranco (10-15); Queen (16-
26); The Rolling Stones (28-31,33,35,37); Led Zeppelin
(32,39-48); Simple Minds (49-56); Sting (57-63); Jethro
Tull (64-70); Simply Red (71-78); Survivor (79-85); The
Stone Roses (91-99)

Morphine “Hanging On A Curtain” (9); Queen
“(You’re So Square) Baby I Don’t Care” (20);
Billy Joel “Movin’ Out” (36); Guns N’ Roses
“Knockin’ On Heaven’s Door” (38); Led Zep-
pelin “The Song Remains The Same” (39),
“The Crunge” (40), “Dancing Days” (41), “The
Ocean” (43), “Ten Years Gone” (45), “Night
Flight” (46), “The Wanton Song” (47), “Boogie
With Stu” (48); Simply Red “She’s Got It Bad”
(75), “Wonderland” (78); Survivor “Is This
Love” (80), “Desperate Dreams” (83), “How
Much Love” (84); The Tokens “The Lion Sleeps
Tonight” (90)

jitter (27)

Table 2: The repetitions, potential mislabelings and distortions we find in GTZAN. Excerpt numbers are in paren-
theses. Exact repetitions are those excerpts that are the same with respect to a comparison of their time-frequency
content. Recording repetitions are those excerpts we suspect coming from the same recording. Artist repetitions are
those excerpts featuring the same artists. Version repetitions are covers of the same song. Potential mislabelings
are excerpts we argue are misplaced with regards to the category top tags of its GTZAN category. Distortions are
those excerpts that we regard as having a significant amount of distortion. This table can be auditioned online at
http://imi.aau.dk/∼bst/research/GTZANtable2.

14


D
R
A
FT

3.4 How do the faults of GTZAN affect evaluation?

We now study how the faults of GTZAN affect the evaluation of MGR systems, e.g., the
classification accuracies in the nearly 100 published works seen in Fig. 2. Through our
experiments below, we see the following two claims are false: 1) “all MGR systems and
evaluations are affected in the same ways by the faults of GTZAN”; and 2) “the performances
of all MGR systems in GTZAN, working with the same data and faults, are still meaningfully
comparable.” Thus, regardless of how the systems are performing the task, the results in Fig.
2 cannot be meaningfully interpreted.

It is not difficult to predict how some faults can affect the evaluations of MGR systems
built using different approaches. For instance, when exact replicas are distributed across
train and test sets, the evaluation of some systems can be more biased than others: a
nearest neighbor classifier will find features in the training set with zero distance to the test
feature, while a Bayesian classifier with a parametric model may not so strongly benefit when
its model parameters are estimated from all training features. If there are replicas in the test
set only, then they will bias an estimate of a figure of merit because they are not independent
tests — if one is classified (in)correctly then its replicas are also classified (in)correctly. In
addition to exact repetitions, we show above that the number of artists in GTZAN is at
most 329. Thus, as Seyerlehner (2010); Seyerlehner et al. (2010) predict for GTZAN, its use
in evaluating systems will be biased due to the artist effect (Pampalk et al., 2005; Flexer,
2007; Flexer and Schnitzer, 2009, 2010), i.e., the observation that a music similarity system
can perform significantly worse when artists are disjoint in training and test datasets, than
when they are not. Since all results in Fig. 2 come from evaluations without using an artist
filter, they are quite likely to be optimistic.

To investigate the faults of GTZAN, we create several MGR systems using a variety of
feature extraction and machine learning approaches paired with different training data in
GTZAN. We define a system not as an abstract proposal of a machine learning method, a
feature description, and so on, but as a real and working implementation of the components
necessary to produce an output from an input (Sturm, 2013c). A system, in other words,
has already been trained, and might be likened to a “black box” operated by a customer
in an environment, according to some instructions. In our case, each system specifies the
customer input 30 s of monophonic audio data uniformly sampled at 22050 Hz, for which it
outputs one of the ten GTZAN category names.

Some systems we create by combining the same features with three classifiers (Duin et al.,
2007): nearest neighbor (NN), minimum distance (MD), and minimum Mahalanobis distance
(MMD) (Theodoridis and Koutroumbas, 2009). These systems create feature vectors from a
30 s excerpt in the following way. For each 46.4 ms frame, and a hop half that, it computes: 13
MFCCs using the approach by Slaney (1998), zero crossings, and spectral centroid and rolloff.
For each 130 consecutive frames, it computes the mean and variance of each dimension, thus
producing nine 32-dimensional feature vectors. From the feature vectors of the training data,
the system finds a normalization transformation that maps each dimension to [0, 1], i.e., the
minimum of a dimension is subtracted from all values in that dimension, and then those are
divided by the maximum magnitude difference between any two values. Each system applies
the same transformation to the feature vectors extracted from an input. Each classifier
chooses a label for an excerpt as follows: NN selects the class by majority vote, and breaks

15


D
R
A
FT

GTZAN
Category

Fold 1 Fold 2

Blues John Lee Hooker, Kelly Joe Phelps, Buckwheat Zy-
deco, Magic Slim & The Teardrops

Robert Johnson, Stevie Ray Vaughan, Clifton Che-
nier, Hot Toddy, Albert Collins

Classical J. S. Bach, Percy Grainger, Maurice Ravel, Henri Du-
tilleux, Tchaikovsky, Franz Schubert, Leonard Bern-
stein, misc.

Beethoven, Franz Joseph Haydn, Mozart, Vivaldi,
Claude Debussy, misc.

Country Shania Twain, Johnny Cash, Willie Nelson, misc. Brad Paisley, George Strait, Vince Gill, misc.
Disco Donna Summer, KC and The Sunshine Band, Ot-

tawan, The Gibson Brothers, Heatwave, Evelyn
Thomas, misc.

Carl Douglas, Village People, The Trammps, Earth
Wind and Fire, Boney M., ABBA, Gloria Gaynor,
misc.

Hip hop De La Soul, Ice Cube, Wu-Tang Clan, Cypress Hill,
Beastie Boys, 50 Cent, Eminem, misc.

A Tribe Called Quest, Public Enemy, Lauryn Hill,
Wyclef Jean

Jazz Leonard Bernstein, Coleman Hawkins, Branford
Marsalis Trio, misc.

James Carter, Joe Lovano, Dexter Gordon, Tony
Williams, Miles Davis, Joe Henderson, misc.

Metal Judas Priest, Black Sabbath, Queen, Dio, Def Lep-
pard, Rage Against the Machine, Guns N’ Roses,
New Bomb Turks, misc.

AC/DC, Dark Tranquillity, Iron Maiden, Ozzy Os-
bourne, Metallica, misc.

Pop Mariah Carey, Celine Dion, Britney Spears, Alanis
Morissette, Christina Aguilera, misc.

Destiny’s Child, Mandy Moore, Jennifer Lopez, Janet
Jackson, Madonna, misc.

Reggae Burning Spear, Desmond Dekker, Jimmy Cliff,
Bounty Killer, Dennis Brown, Gregory Isaacs, Ini
Kamoze, misc.

Peter Tosh, Prince Buster, Bob Marley, Lauryn Hill,
misc.

Rock Sting, Simply Red, Queen, Survivor, Guns N’ Roses,
The Stone Roses, misc.

The Rolling Stones, Ani DiFranco, Led Zeppelin,
Simple Minds, Morphine, misc.

Table 3: Composition of each fold of the artist filter partitioning (500 excerpts in each).
Italicized artists appear in two GTZAN categories.

ties by selecting randomly among those classes that are ties; MD and MMD both select the
label with the maximum log posterior sum over the nine feature vectors. Both MD and MMD
model the feature vectors as independent and identically distributed multivariate Gaussian.

While the above approaches provide “baseline” systems, we also use two state-of-the-art
approaches that produce MGR systems measured to have high classification accuracies in
GTZAN. The first is SRCAM — proposed by Panagakis et al. (2009b) but modified by us
(2013a) — which uses psychoacoustically-motivated features of 768 dimensions. SRCAM
classifies an excerpt by using sparse representation classification (Wright et al., 2009). We
implement this using the SPGL1 solver (van den Berg and Friedlander, 2008) with at most
200 iterations, and define �2 = 0.01. The second approach is MAPsCAT, which uses feature
vectors of “scattering transform” coefficients (Andén and Mallat, 2011). This produces
40 feature vectors of 469 dimensions. MAPsCAT models the features in the training set
as independent and identically distributed multivariate Gaussian, and computes for a test
feature the log posterior in each class. We define all classes equally likely for MAPsCAT, as
well as for MD and MMD. As for the baseline systems above, systems built using SRCAM
and MAPsCAT normalize input feature vectors according to the training set. (See Sturm
(2013a) for further details of SRCAM and MAPsCAT.)

We use four different partition strategies to create train and test datasets from GTZAN:
ten realizations of standard non-stratified 2fCV (ST); ST without the 67 exact and recording
repetitions and 2 distortions (ST’); a non-stratified 2fCV with artist filtering (AF); AF
without the 67 exact and recording repetitions and 2 distortions (AF’). Table 3 shows the
composition of each fold in AF in terms of artists. We created AF manually to ensure that:
1) each GTZAN category is approximately balanced in terms of the number of training and
testing excerpts; and 2) each fold of a GTZAN category has music tagged with category top
tags (Fig. 5). For instance, the two Blues folds have 46 and 54 excerpts, and both have

16


D
R
A
FT

music tagged “blues” and “zydeco.” Unless otherwise noted, we do not take into account
any potential mislabelings. In total, we train and test 220 MGR systems.

We look at several figures of merit computed from a comparison of the the outputs of
the systems to the “ground truth” of testing datasets: confusion, precision, recall, F-score,
and classification accuracy. Define the set of GTZAN categories G. Consider that we input
to a system N(g) number of excerpts from GTZAN category g ∈ G, and that of these the
system categorizes as r ∈G the number M(g as r) ≤ N(g). We define the confusion of GTZAN
category g as r for a system

C(g as r) := M(g as r)
/
N(g). (1)

The recall for GTZAN category g of a system is then C(g as g). We define the normalized
accuracy of a system by

A :=
1

|G|
∑
g∈G

C(g as g). (2)

We use normalized accuracy because the number of inputs in each GTZAN category may
not be equal in a test set. We define the precision of a system for GTZAN category g as

P (g) := M(g as g)
/∑

r∈G

M(r as g). (3)

Finally, we define the F-score of a system for GTZAN category g as

F (g) := 2P (g)C(g as g)
/[
P (g) + C(g as g)

]
. (4)

To test for significant differences in the performance between two systems in the same
test dataset, we build a contingency table (Salzberg, 1997). Define the random variable N
to be the number of times the two systems choose different categories, but one is correct.
Let t12 be the number for which system 1 is correct but system 2 is wrong. Thus, N − t12
is the number of observations for which system 2 is correct but system 1 is wrong. Define
the random variable T12 from which t12 is a sample. The null hypothesis is that the systems
perform equally well given N = n, i.e., E[T12|N = n] = n/2, in which case T12 is distributed
binomialy, i.e.,

pT12|N=n(t) =

(
n

t

)
(0.5)n, 0 ≤ t ≤ n. (5)

The probability we observe a particular performance given the systems actually perform
equally well is

p := P [T12 ≤ min(t12,n− t12)] + P [T12 ≥ max(t12,n− t12)]

=

min(t12,n−t12)∑
t=0

pT12|N=n(t) +
n∑

t=max(t12,n−t12)

pT12|N=n(t). (6)

We define statistical significance as α = 0.05, and reject the null hypothesis if p < α.
Figure 7 shows a summary of the normalized accuracies of all our MGR systems, with

respect to the five approaches and four partitions we use. As each partition breaks GTZAN
into two folds, the left and right end points of a segment correspond to using the first or
second fold for training, and the other for testing. For the ten random partitions of ST and

17


D
R
A
FT

MD MMD NN SRCAM MAPsCAT

0.4

0.5

0.6

0.7

0.8

N
o
rm

a
li
z
e
d
 A

c
c
u
ra

c
y

 
ST

ST’
AF

AF’

Figure 7: Normalized accuracy (2) of each approach (x-axis) for each fold (left and right) of
different partition strategies (legend). One standard deviation above and below the mean
are shown for ST and ST’.

ST’, we show the mean normalized accuracy, and a vertical line segment of two standard
deviations centered on the mean. It is immediately clear that the faults of GTZAN affect
an estimate of the classification accuracy for an MGR system. For systems created with
all approaches except NN, the differences between conditions ST and ST’ are small. As we
predict above, the performance of systems created using NN appears to benefit more than
the others from the exact and recording repetition faults of GTZAN, boosting their mean
normalized accuracy from below 0.6 to 0.63. Between conditions AF and AF’, we see that
removing the repetitions produces very little change, presumably because the artist filter
keeps exact and recording repetitions from being split across train and test datasets. Most
clearly, however, we see for all systems a large decrease in performance between conditions ST
and AF, and that each system is affected to different degrees. The difference in normalized
accuracy between conditions ST and AF appears the smallest for MD (7 points), while
it appears the most for MAPsCAT (25 points). Since we have thus found systems with
performance evaluations affected to different degrees by the faults in GTZAN — systems
using NN are hurt by removing the repetitions while those using MMD are not — we have
disproven the claim that the faults of GTZAN affect all MGR systems in the same ways.

When we test for significant differences in performance between all pairs of system in the
conditions ST and ST’, only for those created with MMD and NN do we fail to reject the null
hypothesis. In terms of classification accuracy in the condition ST, and the binomial test
described above, we can say with statistical significance: those systems created using MD
perform worse than those systems created using MMD and NN, which perform worse than
those systems created using MAPsCAT, which perform worse than those systems created
using SRCAM. The systems created using SRCAM and MAPsCAT are performing signifi-
cantly better than the baseline systems. In the conditions AF and AF’, however, we fail to
reject the null hypothesis for systems created using MAPsCAT and MD, and systems created
using MAPsCAT and MMD. MAPsCAT, measured as performing significantly better than
the baseline in the conditions ST and ST’ — and hence motivating conclusions that the
features it uses are superior for MGR (Andén and Mallat, 2011; Andén and Mallat, 2013)
— now performs no better than the baseline in the conditions AF and AF’. Therefore, this
disproves that the performances of all MGR systems in GTZAN — all results shown in Fig.
2 — are still meaningfully comparable for MGR. Some of them benefit significantly more

18


D
R
A
FT

GTZAN category
Blues Classical Country Disco Hip hop Jazz Metal Pop Reggae Rock Precision

Blues 100 0 1 0 0 0 0 0 0 0 99.0
Classical 0 100 0 0 0 2 0 0 0 0 98.0
Country 0 0 95 0 0 0 0 0 0 1 99.0

Disco 0 0 0 92 1 0 0 2 0 0 97.0
Hip hop 0 0 0 1 96 0 0 0 1 0 98.0

Jazz 0 0 0 0 0 98 0 0 0 0 100.0
Metal 0 0 0 0 0 0 92 0 0 13 87.6
Pop 0 0 0 2.5 3 0 3 95 1 3 88.3

Reggae 0 0 0 0 0 0 0 0 98 1 99.0
Rock 0 0 4 4.5 0 0 5 3 0 82 83.3

F-score 99.5 99.0 96.9 94.4 97.2 99.0 89.8 91.6 98.5 82.7 Acc: 94.8

Table 4: The worst figures of merit (×10−2) of a “perfect” classifier evaluated with GTZAN,
which takes into account the 52 potential mislabelings in Table 2, and assuming that the 81
excerpts we have yet to identify have “correct” labels.

than others due to the faults in GTZAN, but which ones they are cannot be known.
We now focus our analysis upon systems created using SRCAM. Figure 8 shows averaged

figures of merit for systems built using SRCAM and the 10 realizations of ST and ST’, as
well as for the single partition AF’. Between systems built in conditions ST and ST’, we see
very little change in the recalls for GTZAN categories with the fewest exact and recording
repetitions: Blues, Classical and Rock. However, for the GTZAN categories having the most
exact and recording repetitions, we find large changes in recall. In fact, Fig. 9 shows that the
number of exact and recording repetitions in a GTZAN category is correlated with a decrease
in the recall of SRCAM. This makes sense because SRCAM is like adaptive nearest neighbors
(Noorzad and Sturm, 2012). When evaluating the systems we created using SRCAM in the
condition AF’, Fig. 8 shows the mean classification accuracy is 22 points lower than that in
condition ST. With respect to F-score, excerpts in GTZAN Classical and Metal suffer the
least; but we see decreases for all other categories by at least 10 points, e.g., 62 points for
GTZAN Blues, 29 points for GTZAN Reggae, and 25 points for GTZAN Jazz. We see little
change in classification accuracy between conditions AF’ and AF (not shown).

The question remains, what are the worst figures of merit that a “perfect” MGR system
can obtain in GTZAN? We first assume that the 81 excerpts yet to be identified are “correct”
in their categorization, and that the last 25 seconds of Reggae 86 are ignored because of
severe distortion. Then, we assume a “perfect” system categorizes all GTZAN excerpts in
their own categories, except for the 52 potential misclassifications in Table 2. For each of
those excerpts, we consider that the system chooses the other GTZAN categories in which
its number of votes are higher than, or as high as, the number of votes in its own category.
For instance, Country 39 has its largest vote in GTZAN Blues, so we add one to the Blues
row in the GTZAN Country column. When a vote is split between K categories, we add
1/K in the relevant positions of the confusion table. For instance, Disco 11 has the same
number of votes in GTZAN Disco, Pop and Rock, so we add 0.5 to the Pop and to the Rock
rows of the GTZAN Disco column.

Table 4 shows the worst figures of merit we expect for this “perfect” system, which is
also imposed as the thick gray line in Fig. 2. If the figures of merit of an MGR system
tested in GTZAN are better that in this, it might actually be performing worse than the
“perfect” system. Indeed, we have found that the classification accuracies for six MGR

19


D
R
A
FT

(a) SRCAM, ST (b) SRCAM, ST’

(c) SRCAM, AF’

Figure 8: 100× confusion, precision (Pr), F-score (F), and normalized accuracy (bottom
right corner) for systems built using SRCAM, trained and tested in ST and ST’ (averaged
over 10 realizations), as well as in AF’ (with artist filtering and without replicas). Columns
are “true” labels; rows are predictions. Darkness of square corresponds to value. Labels:
Blues (bl), Classical (cl), Country (co), Disco (di), Hip hop (hi), Jazz (ja), Metal (me), Pop
(po), Reggae (re), Rock (ro).

approaches appearing close to this limit (each marked by an “x” in Fig. 2) are due instead
to inappropriate evaluation designs (discussed further in Section 4.5).

20


D
R
A
FT

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
−10

−8

−6

−4

−2

0

2

4

Number of exact and recording repetitions

C
h
a
n
g
e
 i
n
 F

o
M

 (
%

)

 
Recall

Precision

F−score

Figure 9: Scatter plot of percent change in mean figures of merit (FoM) for SRCAM between
the conditions ST and ST’ (Fig. 8) as a function of the number of exact and recording
repetitions in a GTZAN category.

3.5 Discussion

Though it may not have been originally designed to be the benchmark dataset for MGR,
GTZAN is currently honored by such a position by virtue of its use in some of the early
influential works (Tzanetakis and Cook, 2002), its continued wide-spread use, and its public
availability. It was, after all, one of the first publicly-available datasets in MIR. Altogether,
GTZAN appears more than any other dataset in the evaluations of nearly 100 MGR publi-
cations (Sturm, 2012b). It is thus fair to claim that researchers who have developed MGR
systems have used, or been advised to use, GTZAN for evaluating success. For instance,
high classification accuracy in GTZAN has been argued as indicating the superiority of ap-
proaches proposed for MGR (Li and Ogihara, 2006; Bergstra et al., 2006; Panagakis et al.,
2009b). Furthermore, performance in GTZAN has been used to argue for the relevance of
features to MGR, e.g., “the integration of timbre features with temporal features are impor-
tant for genre classification” (Fu et al., 2011). Despite its ten-year history spanning most
of the research in MGR since the review by Aucouturier and Pachet (2003), only five works
indicate someone has taken a look at what is in GTZAN. Of these, only one (Li and Chan,
2011) implicitly reports listening to all of GTZAN; but another completely misses the mark
in its assessment of the integrity of GTZAN (Bergstra et al., 2006). When GTZAN has been
used, it has been used for the most part without question.

GTZAN has never had metadata identifying its contents, but our work finally fills this
gap, and shows the extent to which GTZAN has faults. We find 50 exact repetitions, 21
suspected recording repetitions, and a significant amount of artist repetition. Our analysis
of the content of each GTZAN category shows the excerpts in each are more diverse than
what the category names suggest, e.g., the GTZAN Blues excerpts include music using the
blues genre, as well as the zydeco and cajun and genres. We find potential mislabelings, e.g.,
two excerpts of orchestral music by Leonard Bernstein appear in GTZAN Jazz, but might be
better placed with the other orchestral excerpts of Bernstein in GTZAN Classical; an excerpt
by Wayne Toups & Zydecajun appears in GTZAN Country, but might be better placed with
the other cajun and zydeco music in GTZAN Blues. Finally, we find some excerpts suffer
from distortions, such as jitter, clipping, and extreme degradation (Reggae 86).

Having proven that GTZAN is flawed, we then sought to measure their real effects on
the evaluation of MGR systems. Using three baseline and two state of the art approaches,

21


D
R
A
FT

we disproved the claims that all MGR systems are affected in the same ways by the faults
of GTZAN, and that the performances of MGR systems in GTZAN are still meaningfully
comparable no matter how the systems are performing the task. In other words, evaluating
systems in GTZAN with Classify without taking into consideration its contents provides
numbers that may be precise and lend themselves to a variety of formal statistical tests, but
that are nonetheless irrelevant for judging which system can satisfy the success criteria of
some use case in the real world that requires musical intelligence, not to mention whether
the system demonstrates a capacity for recognizing genre at all. This lack of validity in
standard MGR evaluation means one cannot simply say, “94.8 in GTZAN is the new 100,”
and proceed with “business as usual.”

Since all results in Fig. 2 come from experimental conditions where independent variables
are not controlled, e.g., by artist filtering, the “progress” in MGR seen over the years is
very suspect. Systems built from MAPsCAT and SRCAM, previously evaluated in GTZAN
to have classification accuracies of 83% without considering the faults of GTZAN (Sturm,
2012c, 2013a), now lay at the bottom in Fig. 2 below 56%. Where the performances of all
the other systems lie, we do not yet know. The incredible influence of the other uncontrolled
independent variables confounded with the GTZAN category names becomes clear with
irrelevant transformations (Sturm, 2013g).

Some may argue that our litany of faults in GTZAN ignores what they say is the most se-
rious problem with a dataset like GTZAN: that it is too small to produce meaningful results.
In some respects, this is justified. While personal music collections may number thousands
of pieces of music, commercial datasets and library archives number in the millions. The
1000 excerpts of GTZAN is most definitely an insufficient random sample of the population
of excerpts “exemplary” of the kinds of music between which one may wish a MGR sys-
tem to discriminate if it is to be useful for some use case. Hence, one might argue, it is
unreasonably optimistic to assume an MGR system can learn from a fold of GTZAN those
rules and characteristics people use (or at least Tzanetakis used) when describing music as
belonging to or demonstrating aspects of the particular genres used by music in GTZAN.
Such warranted skepticism, compounded with the pairing of well-defined algorithms and an
ill-defined problem, highlights the absurdity of interpreting the results in Fig. 2 as indicating
real progress is being made in suffusing computers with musical intelligence.

One might hold little hope that GTZAN could ever be useful for tasks such as evaluating
systems for MGR, audio similarity, autotagging, and the like. Some might call for GTZAN
to be “banished” — although we have yet to find any paper that says so. There are, however,
many ways to evaluate an MGR system using GTZAN. Indeed, its faults are representative of
data in the real-world, and they can be used in the service of evaluation (Sturm, 2013a,g). For
instance, by using the GTZAN dataset, we (2012c; 2013a; 2013g) perform several different
experiments to illuminate the (in)sanity of a system’s internal model of music genre. In one
experiment (2012c; 2013a), we look at the kinds of pathological errors of an MGR system
rather than what it labels “correctly.” We design a reduced Turing test to determine how well
the “wrong” labels it selects imitate human choices. In another experiment (2012c; 2013g),
we attempt to fool a system into selecting any genre label for the same piece of music by
changing factors that are irrelevant to genre, e.g., by subtle time-invariant filtering. In
another experiment (2012c), we have an MGR system compose music excerpts it hears as
highly representative of the “genres of GTZAN”, and then we perform a formal listening

22


D
R
A
FT

test to determine if those genres are recognizable. The lesson is not to banish GTZAN,
but to use it with full consideration of its musical content. It is currently not clear what
makes a dataset “better” than another for MGR, and whether any are free of the kinds of
faults in GTZAN; but at least now with GTZAN, one has a manageable, public, and finally
well-studied dataset.

4 Implications for future work in MGR and MIR

The most immediate point to come from this work is perhaps that one should not take for
granted the integrity of any given dataset, even when it has been used in a large amount of
research. Any researcher evaluating an MIR system using a dataset must know the data,
know real data has faults, and know such faults have real impacts. However, there are five
other points important for the next ten years of MIR. First, problems in MIR should be un-
ambiguously defined through specifying use cases and with formalism. Second, experiments
must be designed, implemented, and analyzed such that they have validity and relevance
with respect to the intended scientific questions. Third, system analysis should be deeper
than just evaluation. Fourth, the limitations of all experiments should be acknowledged, and
appropriate degrees of skepticism should be applied. Finally, as a large part of MIR research
is essentially algorithm- and data-based, it is imperative to make all such work reproducible.

4.1 Define problems with use cases and formalism.

As mentioned in Section 2, few works of our survey (2012b) define “the problem of music
genre recognition,” but nearly all of them treat genre as categories to which objects belong,
separated by boundaries, whether crisp or fuzzy, objective or subjective, but recoverable by
machine learning algorithms. The monopoly of this Aristotelean viewpoint of genre is evinced
by the kind of evaluation performed in MGR work (Sturm, 2012b): 397 of 435 published
works pose as relevant the comparison of labels generated by an MGR system with a “ground
truth” (Classify). All but ten of these works consider MGR as a single label classification
problem. We have posed (2013b) “the principal goal of MGR,” but the discussion of what
its principal goal is might be premature without the consideration of well-defined use cases.

The parameters and requirements of a music recommendation service are certainly dif-
ferent from those of a person wanting to organize their home media collection, which are
different from those of a music library looking to serve the information needs of musicologists.
The light provided by a use case can help define a problem, and show how the performance
of a solution should be measured; however, only a few works in the MGR literature consider
use cases. In the same direction, the MIReS Roadmap (Serra et al., 2013) recommends
the development of “meaningful evaluation tasks.” A use case helps specify the solutions,
and what is to be tested by evaluations. One good example is provided by one of the first
works in MGR: Dannenberg et al. (1997) seeks to address the specific scenario of interactive
music performance between a musician and machine listener. It poses several requirements:
the musician must play styles consistently; the classification must take no longer than five
seconds; and the number of false positives should be minimized. Dannenberg et al. evaluate
solutions not only in the laboratory, but also in the intended situation. This work shows
how a use case can help define a problem, and the success criteria for the solution.

23


D
R
A
FT

To fully define a problem with no uncertain terms, it must be formalized. Formaliza-
tion is a tool that can disambiguate all parts of a problem in order to clarify assumptions,
highlight the existence of contradictions and other problems, suggest and analyze solutions,
and provide a path for designing, implementing and analyzing evaluations. By and large,
such formalization remains implicit in much evaluation of MIR, which has been uncritically
and/or unknowingly borrowed from other disciplines, like machine learning and text infor-
mation retrieval (Sturm, 2013c; Urbano et al., 2013). We have attempted to finally make
this formalism explicit (2013c): we define what a system is, and what it means to analyze
one; we use the formalism of the design and analysis of experiments, e.g., Bailey (2008), to
dissect an evaluation into its aims, parts, design, execution, and analysis; and we show how
an evaluation in MIR is an experiment, and thus makes several assumptions. When this
formalism is made explicit, it becomes clear why the systematic and rigorous evaluations
performed using standardized datasets in many MIREX tasks can still not be scientific eval-
uation (Dougherty and Dalton, 2013; Sturm, 2013c). Consequently, what must be done in
order to address this becomes clearer.

4.2 Design valid and relevant experiments.

No result in Fig. 2 provides a reasonable direction for addressing a use case that requires an
artificial system having musical intelligence. That a MGR system evaluated using Classify
achieves even a perfect classification accuracy in a dataset having uncontrolled independent
variables provides no logical support for the claim that the system is using criteria relevant to
the meaning behind those labels (Sturm, 2012c, 2013a,b). The same is true of the results from
the past several years of the MIREX MGR task, or the MIREX audio mood classification
task, or the various MIREX autotagging tasks (Sturm, 2013c). An unwary company looking
to build and sell a solution for a particular use case requiring a system capable of music
intelligence will be misled by all these results when following the intuition that a system
with a classification accuracy of over 0.8 must be “better” than one below 0.7. This is not
to say that none of the systems in Fig. 2 are using relevant criteria to decide on the genre
labels (e.g., instrumentation, rhythm, form), but that such a conclusion does not follow from
the experiment, no matter its outcome. In short, none of this evaluation is valid.

In discussing the music intelligence of a system, and the lack of validity in evaluation
for measuring it, we might be mistaken as misrepresenting the purpose of machine learning,
or overstating its capabilities. Of course, machine learning algorithms are agnostic to the
intentions of an engineer employing them, and are unperturbable by any accusation of being
a “horse” (Sturm, 2013g). Machine learning does not create systems that think or behave
as humans, or mimic the decision process of humans. However, there are different use cases
where “by any means” is or is not acceptable. Even when a problem is underspecified,
e.g., “The mob boss asked me to make the guy disappear,” then any solution might not be
acceptable, e.g., “so, I hired a magician.” If one seeks only to give the illusion that a robot
can recognize genre and emotion in music (Xia et al., 2012), or that a musical water fountain
is optimal (Park et al., 2011), then “whatever works” may be fine. However, if one seeks
to make a tool that is useful to a musicologist for exploring stylistic influence in a music
archive, then a system that uses musically meaningful criteria to make judgements is likely
preferable over one that works by confounded but irrelevant factors.

24


D
R
A
FT

Validity in experiments is elegantly demonstrated with the case of “Clever Hans” (Pfungst,
1911; Sturm, 2013g). Briefly, Hans was a horse that appeared able to solve complex arith-
metic feats, as well as many other problems that require humans to think abstractly. Many
people were convinced by Hans’ abilities; and a group of intellectuals was unable to answer
the question, “Is Hans capable of abstract thought?” It was not until valid experiments were
designed and implemented to test well-posed and scientific questions that Hans was defini-
tively proven to be incapable of arithmetic — as well as his many other intellectual feats
— and only appeared so because he was responding to unintended but confounded cues of
whoever asked him a question (Pfungst, 1911). The question of Hans’ ability in arithmetic is
not validly addressed by asking Hans more arithmetic problems and comparing his answers
with the “ground truth” (Classify), but by designing valid and relevant experiments in which
careful control is exercised over independent variables.

Confounds in machine learning experiments create serious barriers to drawing conclusions
from evaluations — which can also seen as the problem of “overfitting.” Consider that we
wish to create a classifier to predict the sexes of people. We begin by randomly sampling
48,842 people from around the United States of America (USA) in 1996. Consider that we
are not aware that the sex of a person is determined (for the most part) by the presence or
absence of the “Y” chromosome, and so we ask about age, education, occupation, marital
status, relationship to spouse, and many other things that might be relevant, and create
a multivariate dataset from the entries.16 As is common in machine learning, we split this
into a training dataset (2/3) and testing dataset (1/3), and then perform feature selection
(Theodoridis and Koutroumbas, 2009) using the training dataset to find the best features
of sex with respect to, e.g., maximizing the ratio of intra- to inter-class separation, or by
considering the Fisher discriminant. Finally, we create several classifiers using these feature
combinations and training dataset, evaluate each using Classify on the testing dataset, and
compute its normalized classification accuracy (accuracy).

Figure 10 plots the accuracies of the classifiers as a function of the number of best features.
We find that with the “relationship” feature17 we can build a classifier with an accuracy of
almost 0.7 in the testing dataset. If we add to this the “occupation” feature18 then accuracy
increases to about 0.8 in the test dataset. If we also consider the “marital-status” feature19

then we can build a classifier having an accuracy that exceeds 0.85 in the test dataset. Since
we believe random classifiers would have an accuracy of 0.5 in each case, we might claim
from these results, “Our systems predict the sex of a person better than random.” We might
go further and conclude, “The two most relevant features for predicting the sex of a person
is ‘relationship’ and ‘occupation’.”

Now consider these results while knowing how sex is truly determined. While it is clear
that the “relationship” feature is a direct signifier of sex given a person is married, we know
that the “occupation” feature is irrelevant to the sex of any person. It is only a factor that
happens to be confounded in our dataset with the attribute of interest through the economic

16We use here the multivariate dataset, http://archive.ics.uci.edu/ml/datasets/Adult
17This feature takes a value in { “Wife”, “Husband”, “Unmarried”, “Child”}.
18This feature takes a value in { “Tech-support”, “Craft-repair”, “Other-service”, “Sales”, “Exec-

managerial”, “Prof-specialty”, “Handlers-cleaners”, “Machine-op-inspct”, “Adm-clerical”, “Farming-
fishing”, “Transport-moving”, “Priv-house-serv”, “Protective-serv”, “Armed-Forces”}.

19This feature takes a value in {“Married”, “Divorced”, “Never-married”, “Widowed”, “Separated”}.

25


D
R
A
FT

1 2 3 4 5 6 7 8 9 10
0.5

0.6

0.7

0.8

0.9

Number of features

A
c
c
u
ra

c
y

re
la
tio

n
sh
ip

∪
o
c
c
u
p
a
tio

n

∪
m
a
rita

l-sta
tu
s

∪
h
o
u
rs-p

e
r-w

e
e
k

∪
c
a
p
ita

l-lo
ss

∪
a
g
e

∪
w
o
rk
c
la
ss

∪
e
d
u
c
a
tio

n

∪
e
d
u
c
a
tio

n
-n
u
m

∪
ra
c
e

Figure 10: Normalized classification accuracy of male/female classifier (in test dataset) as
a function of the number of features selected by maximizing inter- and intra-class separa-
tion. The three features producing the highest classification accuracy in the training set are
“relationship”, “occupation” and “marital-status.”

and societal norms of the population from which we collected samples, i.e., 1996 USA. If we
test our trained systems in populations with economics and societal norms that are different
from 1996 USA, the confounds could break and performance thus degrade. Furthermore,
the “occupation” feature only appears to be relevant to our task because: 1) it is a confound
factor of sex more correlated than the others; and 2) it is among the attributes we chose for
this task from the outset, few of which have any relevance to sex. Hence, due to confounds
— independent variables that are not controlled in the experimental design — the evaluation
lacks validity, and neither one of the conclusions can follow from the experiment.

Our work joins the growing chorus of concerns about validity in MIR evaluation. A
number of interesting and practical works have been produced by Urbano et al. (Urbano
et al., 2011; Urbano, 2011; Urbano et al., 2012, 2013). The work in Urbano et al. (2011) is
concerned with evaluating the performance of audio similarity algorithms, and is extended
in Urbano et al. (2012) to explore how well an MIR system must perform in a particular
evaluation in order to be potentially useful in the real world. Urbano (2011) deals specifically
with the question of experimental validity of evaluation in MIR; and Urbano et al. (2013)
extend this to show that evaluation of evaluation remains neglected by the MIR community.
Aucouturier and Bigand (2013) look at reasons why MIR has failed to acquire attention
from outside computer science, and pose it is due in no small part to its lack of a scientific
approach. While evaluation is also cited in the MIReS roadmap (Serra et al., 2013) as an
current area of concern, such concerns are by no means only recent. The initiatives behind
MIREX were motivated in part to address the fact that evaluation in early MIR research did
not facilitate comparisons with new methods (Downie, 2003, 2004). MIREX has since run
numerous systematic, rigorous and standardized evaluation campaigns for many different
tasks since 2005 (Downie, 2008; Downie et al., 2010), which has undoubtedly increased the
visibility of MIR (Cunningham et al., 2012). Furthermore, past MIR evaluation has stumbled
upon the observations that evaluations that split across training and testing datasets music
from the same albums (Mandel and Ellis, 2005) or artists (Pampalk et al., 2005) results in
inflated performances than when the training and testing dataset do not share songs from
the same album or by the same artists. Just as we have shown above for GTZAN, the effects
of this can be quite significant (Flexer, 2007), and can now finally be explained through our
work as being a result of confounds.

26


D
R
A
FT

Evaluation in MIR is not just a question of what figure of merit to use, but of how to
draw a valid conclusion by designing, implementing, and analyzing an experiment relevant to
the intended scientific question (Sturm, 2013c). For any experiment, thorough consideration
must be given to its “materials,” “treatments,” “design,” “responses,” “measurements,”
and “models” of those measurements. Beginning ideally with a well-specified and scientific
question, the experimentalist is faced with innumerable ways of attempting to answer that
question. However, a constant threat to the validity of an experiment is the selection of the
components of an experiment because of convenience, and not because they actually address
the scientific question. Another threat is the persuasion of precision. Richard Hamming has
argued,20 “There is a confusion between what is reliably measured, and what is relevant. ...
What can be measured precisely or reliably does not mean it is relevant.” Another threat
is the blind application of statistics, resulting in “errors of the third kind” (Kimball, 1957):
correct answers to the wrong questions. Hand (1994) shows why one must compare the
intended scientific questions to the statistical questions that are actually answered. While
the absence of formal statistics in MIR research has been highlighted by Flexer (2006),
statistics is agnostic to the quality of data, the intention of the experiment, and the ability
of the experimentalist. Statistics provides powerful tools, but none of them can rescue an
invalid evaluation.

One might be inclined to say that since one problem with Fig. 2 is that GTZAN has faults
that are not considered in any of the results, better datasets could help. However, though a
dataset may be larger and more modern than GTZAN does not mean that it is free of the
same kinds of faults found in GTZAN. Furthermore, that a dataset is large does not free the
designer of an MIR system of the necessarily difficult task of designing, implementing, and
analyzing an evaluation having the validity to conclude — if such a thing is indeed relevant
to a use case — whether the decisions and behaviors of a system are related to the musical
content that is supposedly behind those decisions (Sturm, 2013b,g). We have shown instead
that the significant root of this problem, of which data plays a role, is in the validity and
relevance of experiments (Sturm, 2013c,b). All independent variables must be accounted
for an in evaluation, by controlling the experimental conditions, and by guaranteeing that
enough material is sampled in a random enough way that the effects of the independent
variables that are out of reach of the experimentalist will satisfy the assumptions of the
measurement model (Sturm, 2013c). What amount of data is necessary depends on the
scientific question of interest; but even a million songs may not be enough (Bertin-Mahieux
et al., 2011).

An excellent example providing a model for the design, implementation, analysis and
description of valid and relevant scientific experiments is that of Chase (2001) — which com-
plements the earlier and pioneering results of Porter and Neuringer (1984). This work shows
the effort necessary to planning and executing experiments to take place over a duration
of about two years such that waste is avoided and results are maximized. Chase designs
four experiments to answer two specific questions: 1) Can fish learn to discriminate between
complex auditory stimuli that humans can also discriminate? 2) If so, does this behavior
generalize to novel auditory stimuli with the same discriminability? Chase uses three koi fish

20R. Hamming, “You get what you measure”, lecture at Naval Postgraduate School, June 1995. http:
//www.youtube.com/watch?v=LNhcaVi3zPA

27


D
R
A
FT

as the material of the experiment, and specifies as the stimuli (treatments) recorded musical
audio, both real and synthesized, in the categories “blues” and “classical.” Chase designs the
experiments to have hundreds of trials, some of which investigate potential confounds, probe
the generalizability of the discrimination of each fish, confirm results when the reinforcement
stimuli are switched, and so on. From these experiments, one can conclude whether fish can
indeed learn to discriminate between complex auditory stimuli like music styles.

Unfortunately, much evaluation in MGR, music emotion recognition, and autotagging
has been just the kind of show performed by Clever Hans (Pfungst, 1911): certainly, an
evaluation observed by an audience and done not to deceive, but entirely irrelevant for
proving any capacity of these systems for musical intelligence. This has thus contributed to
the training of horses of many exotic breeds, and competitions to show whose horse is better-
trained; but, aside from this pageantry, one does not know whether any meaningful problem
has ever been addressed in it all. The design and implementation of a valid experiment
requires creativity and effort that cannot be substituted with collecting more data; and
a flawed design or implementation cannot be rescued by statistics. If one can elicit any
response from a system by changing irrelevant factors, then one has a horse (Sturm, 2013g).

4.3 Perform system analysis deeper than just evaluation.

In our work (2013c), we define a system (“a connected set of interacting and interdepen-
dent components that together address a goal”), system analysis (addressing questions and
hypotheses related to the past, present and future of a system), and highlight the key role
played by evaluation (“a ‘fact-finding campaign’ intended to address a number of relevant
questions and/or hypotheses related to the goal of a system”). A system analysis that only
addresses a question of how well a system reproduces the labels of a testing dataset is quite
shallow and of limited use. Such an analysis does not produce knowledge about how the
system is operating or whether its performance is satisfactory with respect to some use case.
More importantly, an opportunity is lost for determining how a system can be improved, or
adapted to different use cases. One of the shortcomings of evaluation currently practiced
in MIR is that system analysis remains shallow, and does not help one to understand or
improve systems.

The MIReS Roadmap (Serra et al., 2013) also points to the need for deeper system anal-
ysis. One challenge is identifies is the design of evaluation that provides “qualitative insights
on how to improve [systems].” Another challenge is that entire systems should be evaluated,
and not just their unconnected components. In a similar light, Urbano et al. (2013) discuss
how the development cycle that is common in Information Retrieval research is unsatisfac-
torily practiced in MIR: the definition of a task in MIR often remains contrived or artificial,
the development of solutions often remains artificial, the evaluation of solutions is often un-
satisfactory, the interpretation of evaluation results is often biased and unilluminating, the
lack of valid results lead to improving solutions “blindly,” and returning to the beginning of
the cycle rarely leads to a qualitative improvement of research.

4.4 Acknowledge limitations and proceed with skepticism.

The third point to come from this work is that the limitations of experiments should be
clearly acknowledged. In her experiments with music discriminating fish, Chase (2001) is

28


D
R
A
FT

not terse when mentioning their limitations:

Even a convincing demonstration of categorization can fail to identify the stim-
ulus features that exert control at any given time ... In particular, there can be
uncertainty as to whether classification behavior had been under the stimulus
control of the features in terms of which the experimenter had defined the cate-
gories or whether the subjects had discovered an effective discriminant of which
the experimenter was unaware. ... [A] constant concern is the possible existence
of a simple attribute that would have allowed the subjects merely to discriminate
instead of categorizing. (emphasis ours)

Chase thereby recognizes that, with the exception of timbre (experiment 3), her experiments
do not validly answer questions about what the fish are using in order to make their decisions,
or even whether those criteria are related in any way to how the music was categorized in
the first place. In fact, her experiments were not designed to answer such a question.

This same limitation exists for all MIR experiments that use Classify (Sturm, 2013a,c,b,g).
That a MIR system is able to perfectly reproduce the “ground truth” labels of a testing
dataset gives no reason to believe the system is using criteria relevant to the meaning of
those labels. The existence of many independent variables in the complex signals of datasets
like GTZAN, and the lack of their control in any evaluation using them, severely limits what
one can say from the results. Even if one is to be conservative and restrict the conclusion
to only that music in GTZAN, or ISMIR2004, we have shown (2013g) that even such a
conclusion is not valid.21

Other limitations come from the measurement model specified by the experimental de-
sign. Any estimate of the responses from the measurements in an experiment (Sturm, 2013c)
must also accompanied by its quality (confidence interval), and the conditions under which
it holds, e.g., “simple textbook model” (Bailey, 2008). (This is not the variance of the mea-
surements!) With random-, fixed- and mixed-effects models, these estimates generally have
wider confidence intervals than with the standard textbook model, i.e., the estimates have
more uncertainty (Bailey, 2008). In the field of bioinformatics, an evaluation that does not
give the confidence interval and conditions under which it holds has been called “scientifically
vacuous” (Dougherty and Dalton, 2013).

It is of course beyond the scope of this paper whether or not an artificial system can
learn from the few examples in GTZAN the meaning behind the various tags shown in Fig.
5; but, that SRCAM is able to achieve a classification accuracy of over 0.75 from only 50
training dataset feature vectors of 768 dimensions in each of 10 classes, is quite impressive —
almost miraculous. To then claim, “SRCAM is recognizing music genre”, is as remarkable a
claim as, “Hans the horse can add.” Any artificial system purported to be capable of, e.g.,
identifying thrilling Disco music with male lead vocals that is useful for driving, or romantic
Pop music with positive feelings and backing vocals that is useful for getting ready to go
out, should be approached with as much skepticism as Pfungst approached Hans (Pfungst,
1911). “An extraordinary claim requires extraordinary proof” (Truzzi, 1978).

21What conclusion is valid in this case has yet to be determined.

29


D
R
A
FT

4.5 Make reproducible work reproducible.

The final implication of our work here is in the reproducibility of experiments. One of
the hallmarks of modern science is the reproduction of the results of an experiment by an
independent party (with no vested interests). The six “×” in Fig. 2 come from our attempts
at reproducing the work in several publications. In (Sturm and Noorzad, 2012), we detail
our troubles in reproducing the work of Panagakis et al. (2009b). Through our collaboration
with those authors, they found their results in that paper were due to a mistake in the
experimental procedure. This also affected other published results of theirs (Panagakis et al.,
2009a; Panagakis and Kotropoulos, 2010). Our trouble in reproducing the results of Marques
et al. (2011a) led to the discovery that the authors had accidentally used the training dataset
as the testing dataset.22 Our significant trouble in reproducing the results of Bağci and Erzin
(2007) led us to analyze the algorithm (Sturm and Gouyon, 2013), which reveals that their
results could not have been due to the proper operation of their system. Finally, we have
found the approach and results of Chang et al. (2010) to be irreproducible and determined
that they are contradicted by several well-established principles (Sturm, 2013f). In only one
of these cases (Marques et al., 2011a) did we find the code to be available and accessible
enough for easily recreating the published experiments. For the others, we had to make many
assumptions and decisions from the beginning, sometimes in consultation with the authors,
to ultimately find that the description in the paper does not match what was produced, or
that the published results are in error.

Among other things, “reproducible research” (Vandewalle et al., 2009) is motivated by
the requirements of modern science, and by the algorithm-centric nature of disciplines like
signal processing. Vandewalle et al. (2009) define a piece of research “reproducible” if: “all
information relevant to the work, including, but not limited to, text, data and code, is made
available, such that an independent researcher can reproduce the results.” This obviously
places an extraordinary burden on a researcher, and is impossible when data is privately
owned; but when work can be reproduced, then it should be reproducible. Making work
reproducible can not only increase citations, but also greatly aid in peer review. For instance,
the negative results in our work (Sturm and Gouyon, 2013) were greeted by skepticism in
peer review; but, because we made entirely reproducible the results of our experiments and
the figures in our paper, the reviewers were able to easily verify our analysis, and explore
the problem themselves. The results of all our work, and the software to produce them, are
available online for anyone else to run and modify.23 The MIReS Roadmap (Serra et al.,
2013) also highlights the need for reproducibility in MIR. Great progress is being achieved
by projects such as SoundSoftwatre.24 Vandewalle et al. (2009) provides an excellent guide
of how to make research reproducible.

5 Conclusions

Our “state of the art” here takes a much different approach than the previous four reviews of
MGR (Aucouturier and Pachet, 2003; Scaringella et al., 2006; Dannenberg, 2010; Fu et al.,

22Personal communication with J. P. Papa.
23http://imi.aau.dk/~bst/software/index.html
24http://soundsoftware.ac.uk

30


D
R
A
FT

2011). It aims not to summarize the variety of features and machine learning approaches used
in MGR systems in the ten years since Aucouturier and Pachet (2003), or the new datasets
available and how “ground truths” are generated. It is not concerned with the validity,
well-posedness, value, usefulness, or applicability of MGR; or whether MGR is “replaced
by,” or used in the service of, e.g., music similarity, autotagging, or the like. These are
comprehensively addressed in other works, e.g., (McKay and Fujinaga, 2006; Craft et al.,
2007; Craft, 2007; Wiggins, 2009; Sturm, 2012c,b, 2013a,b,g). It is not concerned with how,
or even whether it is possible, to create “faultless” datasets for MGR, music similarity,
autotagging, and the like. Instead, this article is concerned with evaluations performed in
the past ten years of work in MGR, and its implications for the next ten years of work in
MIR. It aims to look closely at whether the results of Fig. 2 imply any progress has been
made in solving the problem.

If one is to look at Fig. 2 as a “map” showing which pairing of audio features with machine
learning algorithms is successful, or which is better than another, or whether progress has
been made in charting an unknown land, then one must defend such conclusions as being
valid with these evaluations — which are certainly rigorous, systematic, and standardized
evaluations, but that may not address the scientific question of interest. One simply cannot
take for granted that a systematic, standardized and rigorous evaluation design using a
benchmark dataset produces scientific results, let alone results that reflect anything to do
with music intelligence. Questions of music intelligence cannot be addressed with validity
by an experimental design that counts the number of matches between labels output by a
system and the “ground truth” of a testing dataset (Classify), but that does not account for
the numerous independent variables of a testing dataset with flawed integrity. Only a few
works have sought to determine whether the decisions made by an MGR system come from
anything at all relevant to genre (Sturm, 2012c, 2013a), e.g., musicological dimensions such as
instrumentation, rhythm (Dixon et al., 2004), harmony (Anglade et al., 2009), compositional
form, and lyrics (Mayer et al., 2008). Metaphorically, whereas only a few have been using
a tape measure, a level and a theodolite to survey the land, the points of Fig. 2 have
been found using microscopes and scales — certainly scientific instruments, but irrelevant to
answering the questions of interest. It may be a culmination of a decade of work, but Fig. 2
is no map of reality.

These problems do not just exist for research in MGR, but are also seen in research
on music emotion recognition (Sturm, 2013c,b), and autotagging (Marques et al., 2011b;
Gouyon et al., 2013). These kinds of tasks constitute more than half of the evaluation
tasks in MIREX since 2005, and are, in essence, representative of the IR in MIR. How,
then, has this happened, and why does it continue? First, as we have shown for GTZAN
(Sturm, 2012a, 2013d), many researchers assume a dataset is a good dataset because many
others use it. Second, as we have shown for Classify (Sturm, 2013c,b), many researchers
assume evaluation that is standard in machine learning or information retrieval, or that
is implemented to be systematic and rigorous, is thus relevant for MIR and is scientific.
Third, researchers fail to define proper use cases, so problems and success criteria remain by
and large ill-defined — or, as in the case of GTZAN, defined by and large by the artificial
construction of a dataset. Fourth, the foundation of evaluation in MIR remains obscure
because it lacks an explicit formalism in its design and analysis of systems and experiments
(Urbano et al., 2013; Sturm, 2013c). Hence, evaluations in all MIR are accompanied by

31


D
R
A
FT

many assumptions that remain implicit, but which directly affect the conclusions, relevance,
and validity of experiments. Because these assumptions remain implicit, researchers fail to
acknowledge limitations of experiments, and are persuaded that their solutions are actually
addressing the problem, or that their evaluation is measuring success — the Clever Hans
Effect (Pfungst, 1911; Sturm, 2013g). While we are not claiming that all evaluation in MIR
lacks validity, we do claim there is a “crisis of evaluation” in MIR more severe than what is
reflected by the MIReS Roadmap (Serra et al., 2013). The single most important

The situation in MIR is reminiscent of computer science in the late 1970s. In an edito-
rial from 1979 (McCracken et al., 1979), the President, Vice-President and Secretary of the
Association for Computing Machinery acknowledge that “experimental computer science”
(to distinguish it from theoretical computer science) is in a crisis in part because there is “a
severe shortage of computer scientists engaged in, or qualified for, experimental research in
computer science.” This led practitioners of the field to define more formally the aims of ex-
perimental computer science, how they relate to the scientific pursuit of knowledge (Denning,
1980), and find model examples of scientific practice in their field (Denning, 1981). Six years
later, Basili et al. (1986) contributed a review of fundamentals in the design and analysis of
experiments, how it has been applied to and benefited software engineering, and suggestions
for continued improvement. Almost a decade after this, however, Fenton et al. (1994) ar-
gue “there are far too few examples of moderately effective research in software engineering
... much of what we believe about which approaches are best is based on anecdotes, gut
feelings, expert opinions, and flawed research, not on careful, rigorous software-engineering
experimentation.” This is not a surprise to Fenton, however, since “topics like experimental
design, statistical analysis, and measurement principles” remain neglected in contemporary
computer science educations (Fenton et al., 1994). The conversation is still continuing (Fei-
telson, 2006).

In summary, even statistics is not immune to these problems (Hand, 1994). Donald
Preece provides a stratospherically high-level of summary of Hand (1994): “[it speaks on]
the questions that the researcher wishes to consider ...: (a) how do I obtain a statistically
significant result?; (b) how do I get my paper published?; (c) when will I be promoted?”
There are no shortcuts to the design, implementation, and analysis of a valid experiment.
It requires hard work, creativity, and intellectual postures that are uncomfortable for many.
Good examples abound, however. Chase (2001) shows the time and effort necessary to
scientifically and efficiently answer real questions, and exemplifies the kinds of considerations
that are taken for granted in disciplines where thousands of experiments can be run in minutes
on machines that do not currently need positive reinforcement. Just as Pfungst does with
Hans (Pfungst, 1911), Chase scientifically tests whether Oro, Beauty and Pepi are actually
discriminating styles of music. Neither Pfungst nor Chase give their subjects free passes of
accountability. Why, then, should it be any different for artificial systems?

Acknowledgments

Thanks to: Fabien Gouyon, Nick Collins, Arthur Flexer, Mark Plumbley, Geraint Wiggins,
Mark Levy, Roger Dean, Julián Urbano, Alan Marsden, Lars Kai Hansen, Jan Larsen, Mads
G. Christensen, Sergios Theodoridis, Aggelos Pikrakis, Dan Stowell, Rémi Gribonval, Ge-
offrey Peeters, Diemo Schwarz, Roger Dannenberg, Bernard Mont-Reynaud, Gaël Richard,

32


D
R
A
FT

Rolf Bardeli, Jort Gemmeke, Curtis Roads, Stephen Pope, George Tzanetakis, Constantine
Kotropoulos, Yannis Panagakis, Ulaş Bağci, Engin Erzin, and João Paulo Papa for illumi-
nating discussions about these topics (which does not mean any endorse the ideas herein).
Mads G. Christensen, Nick Collins, Cynthia Liem, and Clemens Hage helped identify several
excerpts in GTZAN, and my wife Carla Sturm endured my repeated listening to all of its
excerpts. Thanks to the many, many associate editors and anonymous reviewers for the
comments that helped move this work closer and closer to being publishable.

References

Ammer, C. (2004). Dictionary of Music. The Facts on File, Inc., New York, NY, USA, 4 edition.

Andén, J. and Mallat, S. (2011). Multiscale scattering for audio classification. In Proc. ISMIR,
pages 657–662.

Andén, J. and Mallat, S. (2013). Deep scattering spectrum. http://arxiv.org/abs/1304.6763.

Anglade, A., Ramirez, R., and Dixon, S. (2009). Genre classification using harmony rules induced
from automatic chord transcriptions. In Proc. ISMIR.

Aucouturier, J.-J. (2009). Sounds like teen spirit: Computational insights into the grounding of
everyday musical terms. In Minett, J. and Wang, W., editors, Language, Evolution and the
Brain: Frontiers in Linguistic Series. Academia Sinica Press.

Aucouturier, J.-J. and Bigand, E. (2013). Seven problems that keep MIR from attracting the
interest of cognition and neuroscience. J. Intell. Info. Systems.

Aucouturier, J.-J. and Pachet, F. (2003). Representing music genre: A state of the art. J. New
Music Research, 32(1):83–93.

Aucouturier, J.-J. and Pampalk, E. (2008). Introduction – from genres to tags: A little epistemology
of music information retrieval research. J. New Music Research, 37(2):87–92.

Bailey, R. A. (2008). Design of comparative experiments. Cambridge University Press.

Barbedo, J. G. A. and Lopes, A. (2007). Automatic genre classification of musical signals. EURASIP
J. Adv. Sig. Process.

Barreira, L., Cavaco, S., and da Silva, J. (2011). Unsupervised music genre classification with a
model-based approach. In Proc. Portugese Conf. Progress Artificial Intell., pages 268–281.

Barrington, L., Yazdani, M., Turnbull, D., and Lanckriet, G. R. G. (2008). Combining feature
kernels for semantic music retrieval. In ISMIR, pages 614–619.

Basili, V., Selby, R., and Hutchens, D. (1986). Experimentation in software engineering. IEEE
Trans. Software Eng., 12(7):733–743.

Bağci, U. and Erzin, E. (2007). Automatic classification of musical genres using inter-genre simi-
larity. IEEE Signal Proc. Letters, 14(8):521–524.

33


D
R
A
FT

Bergstra, J., Casagrande, N., Erhan, D., Eck, D., and Kégl, B. (2006). Aggregate features and
AdaBoost for music classification. Machine Learning, 65(2-3):473–484.

Bertin-Mahieux, T., Eck, D., Maillet, F., and Lamere, P. (2008). Autotagger: A model for predict-
ing social tags from acoustic features on large music databases. Journal of New Music Research,
37(2):115–135.

Bertin-Mahieux, T., Eck, D., and Mandel, M. (2010). Automatic tagging of audio: The state-
of-the-art. In Wang, W., editor, Machine Audition: Principles, Algorithms and Systems. IGI
Publishing.

Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. (2011). The million song dataset.
In Proc. ISMIR.

Bowker, G. C. and Star, S. L. (1999). Sorting things out: Classification and its consequences. The
MIT Press.

Chang, K., Jang, J.-S. R., and Iliopoulos, C. S. (2010). Music genre classification via compressive
sampling. In Proc. ISMIR, pages 387–392, Amsterdam, The Netherlands.

Chase, A. (2001). Music discriminations by carp “Cyprinus carpio”. Animal Learning & Behavior,
29(4):336–353.

Craft, A. (2007). The role of culture in the music genre classification task: human behaviour and
its effect on methodology and evaluation. Technical report, Queen Mary University of London.

Craft, A., Wiggins, G. A., and Crawford, T. (2007). How many beans make five? The consensus
problem in music-genre classification and a new evaluation method for single-genre categorisation
systems. In Proc. ISMIR.

Cunningham, S. J., Bainbridge, D., and Downie, J. S. (2012). The impact of MIREX on scholarly
research. In Proc. ISMIR.

Dannenberg, R. B. (2010). Style in music. In Argamon, S., Burns, K., and Dubnov, S., editors,
The Structure of Style, pages 45–57. Springer Berlin Heidelberg.

Dannenberg, R. B., Thom, B., and Watson, D. (1997). A machine learning approach to musical
style recognition. In Proc. ICMC, pages 344–347.

Denning, P. J. (1980). What is experimental computer science? Communications of the ACM,
23(10):543–544.

Denning, P. J. (1981). Performance analysis: experimental computer science as its best. Commu-
nications of the ACM, 24(11):725–727.

Dixon, S., Gouyon, F., and Widmer, G. (2004). Towards characterisation of music via rhythmic
patterns. In Proc. ISMIR, pages 509–517, Barcelona, Spain.

Dougherty, E. R. and Dalton, L. A. (2013). Scientific knowledge is possible with small-sample
classification. EURASIP J. Bioinformatics and Systems Biology, 10.

34


D
R
A
FT

Downie, J., Ehmann, A., Bay, M., and Jones, M. (2010). The music information retrieval evaluation
exchange: Some observations and insights. In Ras, Z. and Wieczorkowska, A., editors, Advances
in Music Information Retrieval, pages 93–115. Springer Berlin / Heidelberg.

Downie, J. S. (2003). Toward the scientific evaluation of music information retrieval systems. In
ISMIR, Baltimore, USA.

Downie, J. S. (2004). The scientific evaluation of music information retrieval systems: Foundations
and future. Computer Music Journal, 28(2):12–23. Cited By (since 1996): 12.

Downie, J. S. (2008). The music information retrieval evaluation exchange (2005–2007): a window
into music information retrieval research. Acoustical Science and Tech., 29(4):247–255.

Duin, R. P. W., Juszczak, P., de Ridder, D., Paclik, P., Pekalska, E., and Tax, D. M. J. (2007).
PR-Tools4.1, a matlab toolbox for pattern recognition. Delft University of Technology. http:
//prtools.org.

Fabbri, F. (1980). A theory of musical genres: Two applications. In Proc. Int. Conf. Popular Music
Studies, Amsterdam, The Netherlands.

Fabbri, F. (1999). Browsing musical spaces. In Proc. IASPM, Amsterdam, The Netherlands.

Feitelson, D. G. (2006). Experimental computer science: The need for a cultural change. Technical
report, The Hebrew University of Jerusalem.

Fenton, N., Pfleeger, S. L., and Glass, R. L. (1994). Science and substance: a challenge to software
engineers. IEEE Software, 11(4):86–95.

Flexer, A. (2006). Statistical evaluation of music information retrieval experiments. J. New Music
Research, 35(2):113–120.

Flexer, A. (2007). A closer look on artist filters for musical genre classification. In Proc. ISMIR,
Vienna, Austria.

Flexer, A. and Schnitzer, D. (2009). Album and artist effects for audio similarity at the scale of
the web. In Proc. SMC, pages 59–64, Porto, Portugal.

Flexer, A. and Schnitzer, D. (2010). Effects of album and artist filters in audio similarity computed
for very large music databases. Computer Music J., 34(3):20–28.

Frow, J. (2005). Genre. Routledge, New York, NY, USA.

Fu, Z., Lu, G., Ting, K. M., and Zhang, D. (2011). A survey of audio-based music classification
and annotation. IEEE Trans. Multimedia, 13(2):303–319.

Gjerdingen, R. O. and Perrott, D. (2008). Scanning the dial: The rapid recognition of music genres.
J. New Music Research, 37(2):93–100.

Gouyon, F., Sturm, B. L., Oliveira, J. L., Hespanhol, N., and Langlois, T. (2013). On evaluation
in music autotagging research. (submitted).

Guaus, E. (2009). Audio content processing for automatic music genre classification: descriptors,
databases, and classifiers. PhD thesis, Universitat Pompeu Fabra, Barcelona, Spain.

35


D
R
A
FT

Hand, D. J. (1994). Deconstructing statistical questions. J. Royal Statist. Soc. A (Statistics in
Society), 157(3):317–356.

Hartmann, M. A. (2011). Testing a spectral-based feature set for audio genre classification. Master’s
thesis, University of Jyväskylä.

Humphrey, E. J., Bello, J. P., and LeCun, Y. (2013). Feature learning and deep architectures: New
directions for music informatics. J. Intell. Info. Systems, 41(3):461–481.

Kemp, C. (2004). Towards a holistic interpretation of musical genre classification. Master’s thesis,
University of Jyväskylä.

Kimball, A. W. (1957). Errors of the third kind in statistical consulting. J. American Statistical
Assoc., 52(278):133–142.

Law, E. (2011). Human computation for music classification. In Li, T., Ogihara, M., and Tzanetakis,
G., editors, Music Data Mining, pages 281–301. CRC Press.

Levy, M. and Sandler, M. (2009). Music information retrieval using social tags and audio. Multi-
media, IEEE Transactions on, 11(3):383 –395.

Li, M. and Sleep, R. (2005). Genre classification via an LZ78-based string kernel. In Proc. ISMIR.

Li, T. and Chan, A. (2011). Genre classification and the invariance of MFCC features to key and
tempo. In Proc. Int. Conf. MultiMedia Modeling, Taipei, China.

Li, T. and Ogihara, M. (2006). Toward intelligent music information retrieval. IEEE Trans.
Multimedia, 8(3):564–574.

Lidy, T. (2006). Evaluation of new audio features and their utilization in novel music retrieval
applications. Master’s thesis, Vienna University of Tech.

Lidy, T. and Rauber, A. (2005). Evaluation of feature extractors and psycho-acoustic transforma-
tions for music genre classification. In Proc. ISMIR.

Lippens, S., Martens, J., and De Mulder, T. (2004). A comparison of human and automatic musical
genre classification. In Proc. ICASSP, pages 233–236.

Mandel, M. and Ellis, D. P. W. (2005). Song-level features and support vector machines for music
classification. In Proc. Int. Symp. Music Info. Retrieval, London, U.K.

Markov, K. and Matsui, T. (2012a). Music genre classification using self-taught learning via sparse
coding. In Proc. ICASSP, pages 1929 –1932.

Markov, K. and Matsui, T. (2012b). Nonnegative matrix factorization based self-taught learning
with application to music genre classification. In Proc. IEEE Int. Workshop Machine Learn.
Signal Process., pages 1–5.

Marques, C., Guiherme, I. R., Nakamura, R. Y. M., and Papa, J. P. (2011a). New trends in musical
genre classification using optimum-path forest. In Proc. ISMIR.

Marques, G., Domingues, M., Langlois, T., and Gouyon, F. (2011b). Three current issues in music
autotagging. In Proc. ISMIR.

36


D
R
A
FT

Matityaho, B. and Furst, M. (1995). Neural network based model for classification of music type.
In Proc. Conv. Electrical and Elect. Eng. in Israel, pages 1–5.

Mayer, R., Neumayer, R., and Rauber, A. (2008). Rhyme and style features for musical genre
classification by song lyrics. In Proc. ISMIR.

McCracken, D. D., Denning, P. J., and Brandin, D. H. (1979). An ACM executive committee
position on the crisis in experimental computer science. Communications of the ACM, 22(9):503–
504.

McKay, C. and Fujinaga, I. (2006). Music genre classification: Is it worth pursuing and how can it
be improved? In Proc. ISMIR, Victoria, Canada.

McLuhan, M. (1964). Understanding Media: The Extensions of Man. The MIT Press, 94 edition.

Moerchen, F., Mierswa, I., and Ultsch, A. (2006). Understandable models of music collections based
on exhaustive feature generation with temporal statistics. In Int. Conf. Knowledge Discover and
Data Mining.

Noorzad, P. and Sturm, B. L. (2012). Regression with sparse approximations of data. In Proc.
EUSIPCO, Bucharest, Romania.

Pachet, F. and Cazaly, D. (2000). A taxonomy of musical genres. In Proc. Content-based Multimedia
Information Access Conference, Paris, France.

Pampalk, E. (2006). Computational Models of Music Similarity and their Application in Music
Information Retrieval. PhD thesis, Vienna University of Tech., Vienna, Austria.

Pampalk, E., Flexer, A., and Widmer, G. (2005). Improvements of audio-based music similarity
and genre classification. In Proc. ISMIR, pages 628–233, London, U.K.

Panagakis, Y. and Kotropoulos, C. (2010). Music genre classification via topology preserving non-
negative tensor factorization and sparse representations. In Proc. ICASSP, pages 249–252.

Panagakis, Y., Kotropoulos, C., and Arce, G. R. (2009a). Music genre classification using locality
preserving non-negative tensor factorization and sparse representations. In Proc. ISMIR, pages
249–254, Kobe, Japan.

Panagakis, Y., Kotropoulos, C., and Arce, G. R. (2009b). Music genre classification via sparse
representations of auditory temporal modulations. In Proc. EUSIPCO.

Park, S., Park, J., and Sim, K. (2011). Optimization system of musical expression for the music
genre classification. In Proc. Int. Conf. Control, Auto. Syst., pages 1644–1648.

Pfungst, O. (1911). Clever Hans (The horse of Mr. Von Osten): A contribution to experimental
animal and human psychology. Henry Holt, New York.

Porter, D. and Neuringer, A. (1984). Music discriminations by pigeons. Experimental Psychology:
Animal Behavior Processes, 10(2):138–148.

Salzberg, S. L. (1997). Data mining and knowledge discovery, chapter On comparing classifiers:
Pitfalls to avoid and a recommended approach, pages 317–328. Kluwer Academic Publishers.

37


D
R
A
FT

Scaringella, N., Zoia, G., and Mlynek, D. (2006). Automatic genre classification of music content:
A survey. IEEE Signal Process. Mag., 23(2):133–141.

Serra, X., Magas, M., Benetos, E., Chudy, M., Dixon, S., Flexer, A., Gómez, E., Gouyon, F.,
Herrera, P., Jordà, S., Paytuvi, O., Peeters, G., Schlüter, J., Vinet, H., and Widmer, G. (2013).
Roadmap for Music Information ReSearch. Creative Commons.

Seyerlehner, K. (2010). Content-based Music Recommender Systems: Beyond Simple Frame-level
Audio Similarity. PhD thesis, Johannes Kepler University, Linz, Austria.

Seyerlehner, K., Widmer, G., and Pohle, T. (2010). Fusing block-level features for music similarity
estimation. In DAFx.

Shapiro, P. (2005). Turn the Beat Around: The Secret History of Disco. Faber & Faber, London,
U.K.

Slaney, M. (1998). Auditory toolbox. Technical report, Interval Research Corporation.

Sordo, M., Celma, O., Blech, M., and Guaus, E. (2008). The quest for musical genres: Do the
experts and the wisdom of crowds agree? In Proc. ISMIR.

Sturm, B. L. (2012a). An analysis of the GTZAN music genre dataset. In Proc. ACM MIRUM
Workshop, Nara, Japan.

Sturm, B. L. (2012b). A survey of evaluation in music genre recognition. In Proc. Adaptive
Multimedia Retrieval, Copenhagen, Denmark.

Sturm, B. L. (2012c). Two systems for automatic music genre recognition: What are they really
recognizing? In Proc. ACM MIRUM Workshop, Nara, Japan.

Sturm, B. L. (2013a). Classification accuracy is not enough: On the evaluation of music genre
recognition systems. J. Intell. Info. Systems, 41(3):371–406.

Sturm, B. L. (2013b). Evaluating music emotion recognition: Lessons from music genre recognition?
In Proc. ICME.

Sturm, B. L. (2013c). Formalizing evaluation in music information retrieval: A look at the mirex
automatic mood classification task. In Proc. CMMR.

Sturm, B. L. (2013d). The GTZAN dataset: Its contents, its faults, their effects on evaluation, and
its future use. http://arxiv.org/abs/1306.1461.

Sturm, B. L. (2013e). Music genre recognition with risk and rejection. In Proc. ICME.

Sturm, B. L. (2013f). On music genre classification via compressive sampling. In Proc. ICME.

Sturm, B. L. (2013g). A simple method to determine if a music information retrieval system is a
horse. submitted.

Sturm, B. L. and Gouyon, F. (2013). Revisiting inter-genre similarity. IEEE Sig. Proc. Letts.,
20(11):1050 – 1053.

38


D
R
A
FT

Sturm, B. L. and Noorzad, P. (2012). On automatic music genre recognition by sparse representation
classification using auditory temporal modulations. In Proc. CMMR, London, UK.

Theodoridis, S. and Koutroumbas, K. (2009). Pattern Recognition. Academic Press, Elsevier,
Amsterdam, The Netherlands, 4 edition.

Truzzi, M. (1978). On the extraordinary: An attempt at clarification. Zetetic Scholar, 1:11–22.

Tzanetakis, G. and Cook, P. (2002). Musical genre classification of audio signals. IEEE Trans.
Speech Audio Process., 10(5):293–302.

Urbano, J. (2011). Information retrieval meta-evaluation: Challenges and opportunities in the
music domain. In ISMIR, pages 609–614.

Urbano, J., McFee, B., Downie, J. S., and Schedl, M. (2012). How significant is statistically
significant? the case of audio music similarity and retrieval. In Proc. ISMIR.

Urbano, J., Mónica, M., and Morato, J. (2011). Audio music similarity and retrieval: Evaluation
power and stability. In Proc. ISMIR, pages 597–602.

Urbano, J., Schedl, M., and Serra, X. (2013). Evaluation in music information retrieval. J. Intell.
Info. Systems (in press).

van den Berg, E. and Friedlander, M. P. (2008). Probing the Pareto frontier for basis pursuit
solutions. SIAM J. on Scientific Computing, 31(2):890–912.

Vandewalle, P., Kovacevic, J., and Vetterli, M. (2009). Reproducible research in signal processing
— what, why, and how. IEEE Signal Process. Mag., 26(3):37–47.

Wang, A. (2003). An industrial strength audio search algorithm. In Proc. Int. Soc. Music Info.
Retrieval, Baltimore, Maryland, USA.

Wiggins, G. A. (2009). Semantic gap?? Schemantic schmap!! Methodological considerations in the
scientific study of music. In Proc. IEEE Int. Symp. Mulitmedia, pages 477–482.

Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., and Ma, Y. (2009). Robust face recognition via
sparse representation. IEEE Trans. Pattern Anal. Machine Intell., 31(2):210–227.

Xia, G., Tay, J., Dannenberg, R., and Veloso, M. (2012). Autonomous robot dancing driven by
beats and emotions of music. In Proc. Int. Conf. Autonomous Agents Multiagent Syst., pages
205–212, Richland, SC.

39