Local-Shapelets for Fast Classification of Spectrographic Measurements


Local-Shapelets for Fast Classification of Spectrographic
Measurements

Daniel Gordona,c,∗, Danny Hendlera, Aryeh Kontorovicha, Lior Rokachb,c

aDepartment of Computer Science, Ben-Gurion University of The Negev Be’er Sheva 84105, Israel
bDepartment of Information Systems Engineering, Ben-Gurion University of The Negev Be’er Sheva 84105, Israel

cTelekom Innovation Laboratories, Ben-Gurion University of The Negev Be’er Sheva 84105, Israel

Abstract

Spectroscopy is widely used in the food industry as a time-efficient alternative to chemical test-
ing. Lightning-monitoring systems also employ spectroscopic measurements. The latter appli-
cation is important as it can help predict the occurrence of severe storms, such as tornadoes.

The shapelet based classification method is particularly well-suited for spectroscopic data
sets. This technique for classifying time series extracts patterns unique to each class. A signif-
icant downside of this approach is the time required to build the classification tree. In addition,
for high throughput applications the classification time of long time series is inhibitive. Although
some progress has been made in terms of reducing the time complexity of building shapelet based
models, the problem of reducing classification time has remained an open challenge.

We address this challenge by introducing local-shapelets. This variant of the shapelet method
restricts the search for a match between shapelets and time series to the vicinity of the location
from which each shapelet was extracted. This significantly reduces the time required to examine
each shapelet during both the learning and classification phases. Classification based on local-
shapelets is well-suited for spectroscopic data sets as these are typically very tightly aligned.
Our experimental results on such data sets demonstrate that the new approach reduces learning
and classification time by two orders of magnitude while retaining the accuracy of regular (non-
local) shapelets-based classification. In addition, we provide some theoretical justification for
local-shapelets.

Keywords: Spectrography, time series, classification, shapelets, local

Research highlights

• We present an algorithm for classifying spectrographic measurements.

• The concept of locality is introduced into an established time series algorithm.

• A technique for estimating a tolerance parameter is presented.

∗Corresponding author. Tel: +972 (0)86428782; fax: +972 (0)86477650;
Email addresses: gordonda@cs.bgu.ac.il (Daniel Gordon), hendlerd@cs.bgu.ac.il (Danny Hendler),

karyeh@cs.bgu.ac.il (Aryeh Kontorovich), liorrk@bgu.ac.il (Lior Rokach)

Preprint submitted to Expert Systems with Applications November 13, 2014


• Learning and classification times are reduced by two orders of magnitude.

• Accuracy levels are retained.

1. Introduction

Spectroscopy is a field devoted to the study and characterization of physical systems by
measuring the electromagnetic frequencies they absorb or emit (Herrmann and Onkelinx, 1986).
Items differing in their chemical composition or molecular bonds absorb or emit light at different
wavelengths leaving a different spectroscopic fingerprint thus enabling differentiation between
them. For example, Al-Jowder et al. (2002) used mid-infrared spectroscopy to detect meat adul-
teration by comparing the spectra of adulterated meat with that of unadulterated meat. A study
by Briandet et al. (1996) discriminated between two different types of coffee beans (Arabica and
Robusta) using mid-infrared spectroscopy. Other methods for distinguishing between different
types of food exist, which are based on wet chemical analysis (Bicchi et al., 1993; Lumley, 1996;
Sharma et al., 1994). The advantages of spectroscopy over wet chemical analysis are in its sim-
plicity (Briandet et al., 1996) and speed of response.

Spectroscopic measurements are also generated by systems monitoring lightning (Eads et al.,
2002). This application is important as relative percentages of different types of lightning can
indicate the outbreak of severe storms, such as tornadoes. In addition to laboratory research,
spectroscopic equipment is starting to be mass produced for every day use allowing anyone
to analyze their surroundings with the aid of spectroscopic measurements (SCIO, 2014). The
measurements are uploaded to a cloud service where they are analyzed and then the results of
the analysis are made available. The service is cloud based, requiring algorithms with high
throughput to enable a quick response to high volumes of queries by users.

The outcome of the spectroscopic analysis of a physical system is a vector in which each
index represents a frequency and each value is the measured intensity of that frequency. The
representation of spectroscopic measures and time series are identical (Ye and Keogh, 2011a),
as the only explicit data are the measurements and the meaning of each measurement is defined
by its location in the vector. This equivalence allows the application of time series classifica-
tion methods to the field of spectroscopy. A previous experimental study (Hills et al., 2013)
showed that the shapelet based classification method is particularly suited for data sets from the
field of spectroscopy, as it achieved a higher accuracy than other machine-learning classification
methods.

Recently, Ye and Keogh (2011a) introduced the shapelets approach for classifying time se-
ries. A shapelet is a subsequence extracted from one of the time series in the data set. The
shapelet is chosen by its ability to distinguish between time series from different classes. A test
time series is classified based on its distance from the shapelet. In the case of multiple shapelets,
these form the nodes of a classification tree. The intuition behind this approach is that the pattern
best separating the classes is not necessarily an entire time series. Rather, a certain subsequence
may best describe the discriminating pattern. Ye and Keogh’s algorithm considers all possible
subsequences in the training set in order to identify those shapelets that yield the optimal split.
Through the rest of this paper, we will refer to this algorithm as the YK-algorithm.

Two key advantages of classification with shapelets are the accuracy and interpretability of
the induced classification model, as it supplies information on the patterns characteristic of the
different classes (Ye and Keogh, 2011a). A significant downside of this approach is the time re-
quired for building the classification tree (Hills et al., 2013; Mueen et al., 2011; Rakthanmanon and Keogh,

2


0 50 100 150 200 250

Time series index

A
b

so
rb

a
n

ce

(a) Coffee data set

0 200 400 600 800 1000

Time series index
A

b
so

rb
a

n
ce

(b) Wheat data set

Figure 1: Examples of two data sets (coffee and wheat) from the field of spectroscopy. The time series of each class are
vertically separated and in different colors. As shown, the examples of each class are tightly aligned, i.e., similar patterns
are exhibited at similar locations along the x-axis.

2013). The search for the best shapelet requires examining all subsequences of all lengths from
all the time series in the training set, and for each shapelet calculating its distance to each time
series at all possible locations. Even for small data sets, this process has a time scale of days,
and for large data sets, the time scale becomes one of years. Hence, the original implementation
on commonly available hardware is only practical on the smallest of data sets. Additionally,
for high-throughput applications, the classification time may be prohibitively expensive for long
time series. This is because at each node of the tree, all possible matches between the node’s
shapelet and the time series to be classified need to be examined.

1.1. Our Contributions

Our goal was to reduce both learning and classification time without impairing the accuracy
of the induced model by exploiting a feature common to spectroscopic data – the localized nature
of information in the time series. In the YK-algorithm, no importance is attributed to the location
from which the shapelet was extracted. Hence, the best match between a shapelet and a time
series is searched for anywhere in the time series. We observed that for many data sets from the
field of spectroscopy, time series from the same class show similar behavior patterns at similar
locations along the frequency axis. Fig. 1 presents examples of two data sets which strongly
support this insight. Based on this insight, we propose a new property as part of the definition
of a shapelet, derived from the location in the time series from which the shapelet was extracted.
This property limits the scope of the search for the best match of a shapelet to a time series to
the vicinity of the location from which the shapelet was extracted. The assumption of locality is
justified as spectroscopic measurements of items with similar properties should have very similar
spectroscopic fingerprints, especially in areas characteristic of a specimen which are not expected
to be contaminated. Our current implementation assumes that all time series are of equal length.

3


Although the time series are generally aligned, some allowance for misalignment is neces-
sary. We therefore introduce a method for learning the misalignment characteristic of a data
set. We evaluate our approach on data sets from the field of spectroscopy, and show that local-
shapelets can reduce learning and classification time (especially for data sets with long time
series) by over two orders of magnitude without impairing accuracy. For reproducibility, we
have made all our code available online (local shapelets, 2014).

The rest of the article is organized as follows: First we present basic definitions required
for understanding the article and shortly describe the YK-algorithm in Sect. 2. Then we present
related work (Sect. 3), followed by a description of local-shapelets and our proposed method
for determining the range to examine (Sect. 4). Next we present our experimental evaluation
(Sect. 5) followed by a brief statistical analysis, which provides a theoretical justification for
our local-shapelets approach (Sect. 6). Finally, we summarize our results and present additional
research directions to pursue (Sect. 7).

2. Background

Here we present a number of definitions necessary for the proper understanding of this article
and a short description of the original shapelet algorithm as it is the basis of our work.

2.1. Definitions

Definition 1. A time series T of length m is a series of m consecutive equally spaced measure-
ments:

T = t0, t1, ..., tm−1.

Definition 2. A subsequence S of length k extracted from time series T of length m at index i
such that k ≤ m is a series of k consecutive measurements:

S = ti, ti+1, ..., ti+k−1.

Definition 3. The Euclidean distance between two time series T,R of length m is:

dE (T,R) =

√√√
m−1∑

i=0

(ti − ri)
2
.

Definition 4. The Euclidean distance between a time series T of length m and a subsequence S
of length k such that k ≤ m is:

dE (T,S ) = min
i

dE (T [i : i + k − 1],S ) i∈[0,m−k].

This is the minimal distance between S and all subsequences of length k in T .

Definition 5. Given a data set D, with c classes and n examples, each class i with ni examples,
the entropy is:

Ent(D) =−
c−1∑

i=0

ni
n

log
ni
n
.

4


Intuitively, a data set’s entropy is a measure of its class-homogeneity. Larger homogeneity
corresponds to smaller entropy values. Specifically, the smallest entropy value (0) corresponds
to a data set in which all members belong to the same class.

Definition 6. Given a data set D with n examples, split into two subsets D1 and D2, containing
n1 and n2 examples respectively, such that D1 ∪ D2 = D and D1 ∩ D2 = ∅ the information gain
(IG) is:

IG(D, D1, D2) = Ent(D) −
(n1

n
Ent(D1) +

n2
n

Ent(D2)
)
.

Intuitively, information gain is a measure of the class-homogeneity induced by a split of data
set D. The larger the information gain, the larger the decrease in entropy of the split data set
w.r.t. D, in turn implying better class-homogeneity. As we will soon see, each shapelet induces
a data set split and its effectiveness is measured by the split’s information gain.

2.2. The YK-Algorithm

For completeness, we briefly present the YK-algorithm as presented in Ye and Keogh (2011a).
First, we present the algorithm for two classes; we then extend the description to a multi-class
data set.

Let D be a dataset with two classes and n time series. The YK-algorithm examines all possi-
ble subsequences of every length (from a minimal length, usually 3, to a maximal length which
is usually the length of the shortest time series) from every time series. For each subsequence
S , the distance to each time series is calculated, as defined in Definition 4. Then, the time series
are ordered by their distance from S . Using this induced order, the average distance of every
two adjacent time series to S is calculated. We will refer to this average distance as the splitting
distance. Each of the n splitting distances defines two subsets, one containing all time series with
a distance to S smaller than or equal to the splitting distance, and the other containing all time
series with a distance to S greater than the splitting distance. For every possible split into two
subsets, the information gain is calculated (see Definition 6). If the current information gain is
better than the best so far, the shapelet is kept along with the corresponding splitting distance.
Tie breaking is done by keeping the shapelet which induces a larger average distance between
the two subsets which is referred to as the margin. After checking all possible subsequences, the
best shapelet and the corresponding splitting distance are returned.

This method can be easily extended to a multi-class problem by building a tree, with a
shapelet and splitting distance in each node. A new node receives one of the two subsets created
by the shapelet found by the node above it, and learns the best shapelet and splitting distance for
this subset of time series. The stopping criteria for this recursive algorithm is that all the time
series in the subset be of one class.

Two important implementation issues are that all distance calculations are computed after
local normalization (Goldin and Kanellakis, 1995) and that the margin is normalized, by dividing
it by the length of the subsequence.

Classification of a time series T is accomplished by traveling down the tree. At each node the
distance of the shapelet S associated with the node to T is calculated. The node decides to which
of its child nodes T should be directed, depending on whether its distance from S is smaller or
greater than the splitting distance. When T reaches a leaf, it is assigned the class associated with
this leaf.

5


2.2.1. Time Complexity of the YK-Algorithm
Let m denote the length of a time series and let us assume all time series are of equal length.

Assuming all shapelet lengths from 3 to m are examined, the number of different shapelets to
examine in a single time series is

∑m
i=3 i = O(m

2). Let n denote the number of time series in data
set D. The number of shapelets to examine in the entire data set is O(nm2).

When searching for the minimal distance between a shapelet S of length k and a time series
T , the distance of S to all subsequences of length k in T needs to be calculated (see Definition 4).
The time complexity of this operation is O(m2). Calculating the distance of S to all time series
requires O(nm2) calculations. The total number of calculations for all shapelets and time series
is O(n2m4) which explains the formidable time required for learning a model even for small data
sets.

3. Related Work

The time complexity of the YK-algorithm is formidable (see Sect. 2.2.1) leading to a large
number of attempts to reduce it. As we will show in this section, none of the previous approaches
utilized the location from which the shapelet was extracted to reduce the time required to learn a
model. In addition, most of these approaches do not reduce the time required to classify a time
series.

The first attempt to reduce the time complexity of the YK-algorithm was introduced in the
paper first presenting shapelets (Ye and Keogh, 2011a). Two optimizations were suggested. The
first optimizes the distance calculation of a shapelet to a time series. The distance calculation is
terminated if it exceeds the minimum distance found so far between the current shapelet and time
series. This optimization was coined early-abandon. The second optimization (named entropy-
pruning) checks whether the most optimistic IG possible, given the distances of a shapelet to
time series already computed, can be better than the best IG found so far. If the IG cannot be
improved, the shapelet under examination is discarded. As pointed out by Lines et al. (2012) this
optimization requires testing O(2c) different possibilities (c is the number of classes in the data
set), which can greatly reduce the effectiveness of this optimization when the number of classes
is large.

Later, Mueen et al. (2011) introduced additional optimizations. One optimization manages
to compute the distance of a shapelet to a time series in constant time by precomputing necessary
statistics. This optimization manages to reduce the time complexity to O(n2m3). A major down-
side is that for each two time series, a matrix of size m2 needs to be maintained. This leads to a
total space complexity of O((nm)2) which is untenable for large data sets (Gordon et al., 2015;
Rakthanmanon and Keogh, 2013). A second optimization discards shapelets similar to shapelets
that were already discarded. A disadvantage of this approach is the large time overhead when
applied to data sets with a large number of classes.

Two recent attempts managed to dramatically reduce the time complexity for learning a
shapelet based model. The first method (Rakthanmanon and Keogh, 2013) quickly picks out
a small number of shapelets from each shapelet length which seem able to effectively divide the
data set into its classes. Then only this subset of shapelets are fully analyzed. This approach
manages to reduce the time complexity to O(nm3) but requires a considerable amount of space
to accommodate this reduction in time complexity. We will refer to this solution as the hashing-
algorithm. A second approach (Gordon et al., 2015), named SALSA-R, randomly samples a
constant number of shapelets (10,000) which are examined, reducing the time complexity to
O(10,000 × nm2) with no excess space requirements.

6


As shown, none of the aforementioned optimizations utilize the location from which the
shapelet was extracted to reduce the time complexity of the learning process. In addition none of
them significantly reduces classification time.

Xing et al. (2011) introduced an optimization for fast classification of streaming time series
with the aid of shapelets. Their main idea was to prefer shapelets which appear early in a time
series over shapelets which appear later. This ensures that time series can be classified quickly
once initial measurements have arrived with no need to wait for further measurements. They
coined this new type of shapelets as local-shapelets. Although we both name our shapelets
similarly, the concepts differ. The motivation of Xing et al. was to classify streaming time series
as early on as possible while ours is to optimize learning and classification time of non-streaming
data sets. Also, the implementations differ. Xing et al. did not preserve the location from which
the shapelet was extracted. Conversely, with our method, the location from which the shapelet
was extracted is exploited with no limitation on the location from which to extract a shapelet.

4. Local-Shapelets

In the material that follows, the basic idea of local-shapelets is described as well as modifi-
cations which make it useful in practice.

Definition 7. A shapelet is a tuple <~S ,d>. ~S is a series of consecutive measurements extracted
from one of the time series in the data set and d is a cutoff distance. Time series with a distance
to S smaller or equal to d traverse one side of the tree while time series with a distance to S
greater than d traverse the other side of the tree.

Definition 8. A local-shapelet is a tuple <~S ,d, i>. ~S and d are as in Definition 7. i is the location
from which the local-shapelet was extracted.

Unlike a shapelet, a local-shapelet contains information regarding the location from which
it was extracted, which is utilized when calculating the distance of the local-shapelet to a time
series.

Definition 9. The Euclidean distance between a time series T of length m and a local-shapelet
<~S ,d, i> of length k such that k ≤ m given x which defines a range around i is:

dE (T,S ) = min
j

dE (T [ j : j + k − 1],S ) j∈[max(0,i−x),min(i+x,m−k)].

The distance between T and a local-shapelet <~S ,d, i> adds a constraint on the subsequences
of T to which the distance of subsequence ~S is calculated. Instead of calculating the distance of
~S to all subsequences of length k in T , the distance of ~S is calculated only to subsequences in
the vicinity of the location i from which ~S was extracted. This vicinity is defined by the constant
x. Thus, the time required for calculating the distance of a shapelet to a time series is reduced.

4.1. The Tolerance Range

A naı̈ve implementation of local-shapelets is to calculate the distance of a shapelet only to
the single subsequence in the time series at the exact location i from which it was extracted. This
approach may have detrimental impact on the learning process as exemplified in Fig. 2 which
presents two time series from the same class of the data set Lightning7. As is clearly shown, the
characteristic spike may appear at slightly different locations. Restricting the distance calculation

7


Time series index
0 100 200 300

A
b

so
rb

a
n

ce

(a) Example from class 2

Time series index
0 100 200 300

A
b

so
rb

a
n

ce

(b) Example from class 3

Figure 2: Two examples from different classes of the Lightning7 data set illustrating the need for a tolerance range.
Without a tolerance range the spike characteristic of each class cannot be utilized as its location is different in different
time series.

of a shapelet to the exact location from which it was extracted would cause characteristic patterns
to be overlooked. To accommodate this issue some tolerance, which we will refer to as the
radius, needs to be added to the index i, such that the distance of a local-shapelet to a time series
T is the minimum distance to all subsequences starting in the range [i − radius, i + radius] (see
definition 9). We will refer to this range as the tolerance range.

In food spectroscopy there are many sources of noise, such as differences in the residual water
content of the freeze-dried samples (Briandet et al., 1996). This noise may cause distortions
in the pattern generated but will not cause a shift in the frequencies emitted, therefore we set
the tolerance range to 0. In lightning spectroscopy, interferences such as frequency-dependent
dispersion induced by the ionosphere (Moore et al., 1995) may lead to a shift in the frequencies
measured requiring the introduction of a tolerance range greater than 0.

We present a method for computing the value of the tolerance required, as described in Pro-
cedure 1. This is achieved by experimentally observing the tolerance required for a small number
of subsequences. Procedure 1 splits each time series into ten equal consecutive and disjoint sub-
sequences (line 1). From each class (C) and for each subsequence location, a single subsequence
(subseq) is extracted from a randomly selected time series (rand-ts) (lines 2-6). Then the distance
of subseq to all time series in C is calculated using the global method for calculating distances
(see Definition 4). The location of the best match of subseq to each time series is recorded in
locations (line 8).

Before utilizing the information available in the list of locations, it is necessary to filter out
values which are obviously non-characteristic of the data set (line 9). A major motivation is
that a larger tolerance leads to an increase in learning and classification time as more distance
calculations are required for each shapelet. Filtering out locations which are atypical should
not impair accuracy significantly. For example, let us suppose we receive the following list of
locations: 1,30,30,31,32,34,37,38,40,40,41,81. It is quite clear that the values 1,81 are outliers

8


Procedure 1 Algorithm for computing the tolerance characteristic of a data set
Compute-Tolerance(D) {Input is a data set}

1: start-indices ← indices, s.t. time series will be split into 10 subsequences which are as equal
in length as possible

2: for each class C in D do
3: for each start-index i in start-indices do
4: subseq-length ← length of subsequence
5: rand-ts ← randomly selected time series from C
6: subseq ← rand-ts[i : i+subseq-length-1]
7: for each time series t in class C do
8: locations ← append(location-of-best-distance(t, subseq))
9: locations ← outlier-filter(locations) {Using IQR-filtering}

10: Radiusc,i ← max(locations) - min(locations)
11: Radiusc ← min(Radiusc,i)
12: tolerance ← max(Radiusc)
13: return tolerance

and should be filtered out. Without filtering, the size of the range is 81, while after filtering, the
size of the range is only 12.

Our method for outlier filtering is based on the interquartile range (IQR). Using this method,
the first (Q1) and third (Q3) quartiles are calculated and the IQR is calculated as I QR = Q3 − Q1.
All values greater than Q3 +3× I QR or smaller than Q1 −3× I QR are filtered out. Although it is
customary to use 1.5 as the multiplication factor, we chose a multiplication factor of 3 so as to not
filter out too many values which may lead to overfitting. Our tuning of the multiplication factor
to a value of 3 is a heuristic as it cannot guarantee total avoidance of overfitting, because other
parameters such as the model complexity also play a part. Three advantages of IQR filtering are
that it is simple, that it is a-parametric and that it does not automatically drop extreme values if
they are similar to the rest.

Once we have filtered out the outliers, the characteristic radius as reflected by this subse-
quence of class C is calculated and recorded (line 10). After a radius for each of the subse-
quences of a class has been calculated, the minimum of all these radii is selected to represent the
locality of the class (line 11). We chose the minimal radius as this leads to the smallest number
of distance calculations of a shapelet to a time series, which promises the best possible speedup
in runtime. The last stage is the selection of a single radius as the tolerance for the data set.
We chose the maximum radius from all classes (line 12) as the tolerance must accommodate the
most loosely localized class. Otherwise, for some of the classes, the ultimate matches may reside
outside of the recommended range and will not be examined.

The time complexity of this phase is negligible as only 10 subsequences per class are exam-
ined and the distance of each subsequence is calculated only to time series of its class.

4.2. Random Selection of Shapelets

For completeness, we present the procedure used by SALSA-R for randomly selecting shapelets
in Procedure 2. We chose a distribution similar to the uniform distribution but simpler to imple-
ment.

Procedure 2 describes the method for randomly selecting the next shapelet to examine. First
(line 2), the time series from which to extract the shapelet is chosen. Then (lines 3-4), the

9


Procedure 2 Algorithm for randomly selecting shapelets
extract-shapelet(D,min-sh-length) {Input is a data set and the minimum length of a shapelet}

1: num-ts ← number-of-time-series-in-data-set(D)
2: ts-index ← random-selection(0,num-ts)
3: highest-index ← times-series-length - min-sh-length + 1
4: sh-index ← random-selection(0,highest-index)
5: longest-possible-shapelet ← times-series-length - sh-index +1
6: sh-length ← random-selection(min-sh-length,longest-possible-shapelet)
7: sh ← extract-shapelet(D,ts-index,sh-index,sh-length)
8: return sh

index in the time series from which to extract the shapelet is randomly generated. The range of
possible indices is between 0 and the last index from which the shortest possible shapelet can be
extracted. Last (lines 5-6), the length of the shapelet is randomly selected. The upper limit on the
length of the shapelet (longest-possible-shapelet) is calculated based on the location from which
the shapelet is to be extracted (line 5). The function random-selection(a,b) randomly selects an
integer from the range [a,b-1] with a uniform distribution.

5. Experimental Results

Our goal is to show that for data sets from the field of spectroscopy, the usage of local-
shapelets reduces the time complexity during both training and classification phases in com-
parison with global-shapelets (i.e., non-local shapelets) without degrading accuracy. First, we
present the data sets from the field of spectroscopy with which we evaluated local-shapelets.
Then, we establish the utility of our method (Procedure 1) for calculating the tolerance range. In
the next phase of our evaluation we compare local-shapelets vs. global-shapelets within the YK-
algorithm. In the last phase, we re-implement SALSA-R to utilize the locality of shapelets. We
compare accuracy and run-time with the original implementation of SALSA-R and the hashing-
algorithm. We ran all experiments on an Intel Xeon E5620 computer comprising two 2.40GHz
processors with 24GB of RAM and with a 64-bit version of Windows Server 2008 R2 as the
operating system.

5.1. Description of Data Sets

Our experiments were conducted on a collection of 6 data sets from the field of spectroscopy,
available online (Ye and Keogh, 2011b; Keogh et al., 2014). The collection contains four data
sets from the field of food spectroscopy (Beef, Coffee, OliveOil, Wheat) and two from the field
of lightning spectroscopy (Lightning2, Lightning7). Table 1 contains information on the number
of examples in the training and test sets, the number of classes, the length of the time series and
the tolerance range used. All data sets were already split into train and test sets. We preserved
the original division to train and test sets to allow easy reproduction of our results, as well as a
fair comparison with other published results.

As the test sets are very small, our initial measurements of classification times were inaccu-
rate due to minor overheads which dampened the effect of locality on the outcome and due to
the inaccuracy of computer time measurements at small time scales. We solved this by enlarging
each test set to a size of 1GB. This was done by duplicating examples. To ensure that the ac-
curacy obtained would be identical to that on the original data set, we duplicated each example

10


an equal number of times. As the classifier is deterministic, the classification of an example will
always be identical no matter how many times it appears. Therefore, the proportion of correct
classifications out of all classification examples will not change and the accuracy will remain the
same.

Table 1: Description of the data sets

dataset train set test set num. time series tolerance
size size classes length

Beef 30 252,510 5 470 0
Coffee 28 383,264 2 286 0
Lighting2 60 106,140 2 637 0
Lighting7 70 209,510 7 319 78
OliveOil 30 208,770 4 570 0
Wheat 49 115,434 7 1,050 0

5.1.1. Food Spectrographs
Beef. The beef data set (Al-Jowder et al., 2002) contains the spectral absorbance of one type
of beef cut (silverside). One class contains the spectral absorbance of the beef cut without any
contaminates. Each of the other four classes contains the spectral absorbance of the beef cut
contaminated with a different type of offal (kidney, liver, heart and tripe) which is cheaper than
the declared beef cut.

Coffee. The coffee data set (Briandet et al., 1996) contains the spectral absorbance of instant
coffee of two different types of coffee beans Arabica and Robusta. Coffee from Arabica beans is
more highly estimated as it has a finer and more pronounced taste. Approximately 90% of world
coffee production is from Arabica and another 9% is from Robusta. As the price of Arabica is
higher than that of Robusta, it is important to be able to distinguish between them even after the
long process that is required to produce instant coffee.

OliveOil. Olive oil samples from four different European countries (Greece, Italy, Portugal and
Spain) were collected (Tapp et al., 2003). The classification task is to be able to discern the
country of origin using the spectrograph of the olive oil sample.

Wheat. This data set (Ye and Keogh, 2011a) consists of spectrographs of wheat grown during
the years 1998-2005. There are a number of different types of wheat in the data set but the class
was assigned based only on the year in which the wheat was grown and not on the type of wheat.

5.1.2. Lightning Spectrographs
Data on frequencies emitted during lightning events were collected and then a Fourier trans-

form was applied to produce spectrographs (Eads et al., 2002). The lightning events were cat-
egorized into 7 different classes differing in the charge of the lightning (positive or negative),
whether the event was gradual or abrupt and whether the event was intra-cloud or from cloud to
ground. The original authors of this data set (Eads et al., 2002) note that there is a large inter-
class variation and intra-class similarity. The data set Lightning7 contains examples of all 7
classes, while Lightning2 is a simpler binary problem of distinguishing between cloud to ground
events and intra-cloud events.

11


5.2. Utility of Tolerance Range Calculation

The last column of Table 1 presents the tolerance range used during our experiments. For
the four data sets of food spectroscopy, the value of the tolerance range was not calculated using
Procedure 1; rather we set it to 0 based on prior knowledge in this domain. A comparison of
the tolerance range based on prior knowledge with those recommended by Procedure 1 shows
large agreement. For three data sets (Coffee, OliveOil and wheat) values are identical. For the
Beef data set, although the calculated tolerance range was not exactly 0, it was very close with a
value of 2. Our procedure also succeeds when a tolerance range greater than 0 is required. When
applied to the Lightning7 data set for which it is clear a tolerance range larger than 0 is required,
as shown in Fig. 2, the calculated tolerance range is 78. These findings show that our method for
predicting the tolerance range manages to successfully estimate the required range.

5.3. Local YK-Algorithm

Here we show that local-shapelets reduce the time consumption without impairing accuracy.
We compare local and global shapelets within the YK-algorithm framework which is the initial
implementation of the shapelet algorithm which examines all possible shapelets. Due to the large
time and space requirements of the YK-algorithm we could not collect results for the Lightning2
data set.

For these experiments, we used the original code used by Ye and Keogh (2011a). As pointed
out by Hills et al. (2013), the entropy-pruning optimization (see Sec. 3) has an overhead which
grows exponentially with the increase in the number of classes in the data set. We encountered
this experimentally with the wheat data set which has 7 classes. The original implementation
did not finish examining all shapelets for the first node of the tree after 5 days, while the same
implementation without the entropy-pruning optimization finished learning the whole tree in this
period of time. Therefore we conducted the experiments using the original code without the
entropy-pruning optimization.

Results of our experiments are presented in Table 2. Each two columns compare results of
global-shapelets vs. local-shapelets. The first comparison is the accuracy achieved, the second
is the time required to learn a model and the third is the time required to classify the test set. In
all measures, the local approach outperforms the global approach. The average improvement in
accuracy is 8%, the average speedup during the learning phase is 9.5 and during the classification
phase it is 80.

Table 2: Global YK-algorithm vs. Local YK-algorithm

data set accuracy (%) learning time (sec) classification time (sec)
global local global local global local

Beef 46.67 56.67 24,307 1,364 76.03 1.10
Coffee 96.43 92.86 1,466 407 36.79 1.54
Lighting7 43.84 53.42 98,636 67,660 109.75 59.58
OliveOil 33.33 50.00 17,448 2,287 46.85 1.01
Wheat 57.71 65.43 433,630 25,221 158.36 0.59

5.4. Local-SALSA-R

In this set of experiments we re-implemented SALSA-R to use local-shapelets instead of
global-shapelets. The number of shapelets randomly selected was set to 10,000 which was found

12


to be an optimal number by Gordon et al. (2015). We compared local-SALSA-R with global-
SALSA-R and the hashing-algorithm. We repeated our experiments thirty times as each of the
three methods includes an element of randomness.

Table 3 presents a comparison of the average accuracy of local-SALSA-R with global-
SALSA-R and the hashing-algorithm. We applied a Friedman test (Friedman, 1937) to test if
there is a significant difference between accuracy attained by the different methods. The p-value
was 0.85, which clearly shows that there is no significant difference in the accuracy achieved by
any of the methods.

Table 3: Comparison of accuracy of local-SALSA-R with that of global-SALSA-R and the hashing-algorithm

data set global-SALSA-R hashing-algorithm local-SALSA-R
Beef 51.78 50.78 62.11
Coffee 94.17 92.74 97.02
Lighting2 67.98 67.22 66.17
Lighting7 58.45 61.62 59.13
OliveOil 73.11 73.33 73.33
Wheat 66.72 69.94 66.46

A comparison of average learning and classification times is presented in Table 4. A Fried-
man test on the learning and classification times shows that there is a significant difference be-
tween the methods during both the learning (p-value = 0.0057) and classification phases (p-value
= 0.030). A one sided Wilcoxon-test (Wilcoxon, 1945) affirms our claim that local-shapelets are
significantly faster than global-shapelets during the learning phase (p-value = 0.016 for both
global-SALSA-R and the hashing-algorithm) and the classification phase (p-value = 0.016 for
global-SALSA-R and p-value = 0.031 for the hashing-algorithm). On average, local-SALSA-R
reduces the time required to learn a model by a factor of 120 and 440 in comparison with global-
SALSA-R and the hashing-algorithm, respectively. During classification, local-SALSA-R is 180
times faster than global-SALSA-R and 110 times faster than the hashing-algorithm, on average.

Table 4: Comparison of learning and classification time of local-SALSA-R with that of global-SALSA-R and the
hashing-algorithm

data set learning time (sec) classification time (sec)
global- hashing- local- global- hashing- local-

SALSA-R algorithm SALSA-R SALSA-R algorithm SALSA-R
Beef 44.98 169.77 0.47 87.46 104.87 0.51
Coffee 16.15 12.81 0.26 31.61 34.06 0.47
Lighting2 164.09 539.11 1.00 144.78 86.10 0.85
Lighting7 56.51 205.17 35.39 152.61 95.97 111.30
OliveOil 59.39 114.37 0.50 65.52 45.27 0.51
Wheat 315.19 1693.48 1.17 231.46 78.71 0.42

Analytically, it is easy to argue that the longer the time series, the greater the time saved by
using local-shapelets, as the number of distance calculations avoided increases. We confirmed
this experimentally as can be seen in Fig. 3. The figure shows the ratio between the time required
by global-SALSA-R and local-SALSA-R as a function of the length of the time series for all
data sets for which the tolerance range used was 0 (all data sets apart from Lightning7). This

13


figure clearly confirms that the ratio increases with the length of the time series. We chose
to compare local-SALSA-R with global-SALSA-R and not with the hashing-algorithm as their
implementation apart from the aspect of locality is identical, allowing isolation of locality as the
only parameter influencing the outcome.

Time series length

T
im

e
 s

p
e

e
d

u
p

0 200 400 600 800 1000

0
1
0
0

3
0
0

5
0
0

Learning time
Classification time

Figure 3: Each point is the ratio of the time required by global-SALSA-R and local-SALSA-R. One plot is for learning
times and the second for classification times. As can be seen the ratio increases with the length of the time series.

6. Statistical analysis

The approach proposed in this paper was mainly motivated by algorithmic considerations:
restricting the search to a small subset of the possible shapelet locations significantly speeds up
both training and classification. In this section, we will argue that as a by-product, our approach
offers statistical advantages as well. By restricting the number of features, we are constraining
the complexity of the hypothesis class. As we show below, hypothesis classes of low complexity
require fewer training examples to attain a certain accuracy level.

The argument is made precise in the language of learning theory. A learner faced with a
classification task is trying to learn a function g : X → {−1,1}, where X is the instance space
(in our case, it is the set of all possible time series). The learner gets to observe example-label
pairs (Xi,Yi) ∈ X× {−1,1} generated iid from some unknown distribution P over X× {−1,1}.
This corresponds to the intuition that the training labels may be noisy, and indeed, there may
be no “correct” classifier g : X → {−1,1} that achieves perfect accuracy. Although there are
universal approximators capable of fitting arbitrary labeled samples, if unconstrained they will
necessarily overfit (Devroye et al., 1996). Hence, when choosing the learning model, its richness
(i.e., hypothesis complexity) must be taken into account.

14


The learner’s n observed labeled examples (Xi,Yi) constitute the training set, based on which
it will produce a hypothesis g : X→{−1,1}. We will denote by H the collection of all admissible
hypotheses (formally, H ⊂ 2X) and associate with every h ∈ H two key quantities: its sample
(or training) error,

êrr(h) =
1
n

n∑

i=1

1{h(Xi ),Yi}

and generalization error,

err(h) = E[1{h(Xi ),Yi}] = P(h(X) , Y ).

In words, êrr(h) is the relative fraction of mistakes that h makes on the training set while err(h)
is the probability that h makes a mistake on a freshly drawn (X,Y ) pair — crucially, drawn
from the same distribution used to generate the training set. Note that while the typical goal is to
guarantee a small generalization error, the latter quantity cannot be computed without knowledge
of the sampling distribution. Instead, the readily computable êrr(h) may be used (under certain
conditions, detailed below) as a proxy for err(h).

In this setting, the learner’s task is twofold: (i) algorithmic: efficiently find an h ∈ H for
which êrr(h) is small, and (ii) statistical: guarantee that, with high probability, err(h) will not be
much greater than êrr(h), regardless of which h ∈H the learner chooses. The foregoing sections
were devoted to the algorithmic aspects, and we shall focus on the statistical one here. In the
case of finite H, a particularly simple connection exists between err(h) and êrr(h):

Theorem 1 (Mohri et al. (2012)). Suppose that |H|<∞ and the learner observes a training set
consisting of n examples. Then, for any δ > 0, we have that

err(h) ≤ êrr(h) +

√
log |H|+ log(1/δ)

2n
(1)

holds with probability at least 1 −δ, uniformly over all h ∈H.

For simplicity, let us consider the case where H = 2X (i.e., H consists of all possible binary
functions). In this case, |H| = 2|X|, and hence even a modest reduction of the instance space —
by reducing the feature set, for example — can have a noticeable effect on the second term in
the right-hand side of (1), and hence yield a faster convergence rate. The basic features used in
this paper are distances from a shapelet to a time series. It is precisely this feature set that gets
reduced when our algorithm considers only a subset of the possible locations. This observation
provides a statistical justification to our local-shapelets approach, in addition to the algorithmic
speedup.

7. Conclusions

The objective of our investigation was to utilize the localization of characteristic patterns in
spectrographic measurements using the shapelet algorithm, which has previously been found to
be suited (Hills et al., 2013) for this domain. Our adaption to the shapelet algorithm reduces the
number of distance calculations and thus shortens the time required to train a model and classify
examples.

15


As pointed out by Ye and Keogh (2011a), one important advantage of the shapelet approach
is its interpretability, i.e., the process extracts informative subsequences representative of each
class. The chosen shapelets provide insights into patterns characteristic of each class. One
such example can be seen in Fig. 4. The shapelet, shown as a dashed red line, is overlaid on
each of the two time series at the location from which it was extracted and represents a pattern
characteristic of class 2. In addition to identifying the discriminative pattern, local-shapelets also
identify the discriminative frequencies. We compared the frequencies found to be discriminative
by the shapelet with those found to be discriminative by Briandet et al. (1996) and found that
they coincide.

Time series index

A
b
so

rb
a
n
ce

160 180 200 220 240 260

Class 1

Class 2

Figure 4: Interpretability of local-shapelets. Two time series from the coffee data set are presented in black. Each time
series is from a different class. Only part of each time series is presented so as to focus on the important details. Overlaid
on each of the time series is the shapelet selected by local-SALSA-R as most discriminative appearing as a dashed red
line. As can be seen the shapelet represents time series from class 2 (Robusta coffee beans).

The algorithm we presented searches for matches of a shapelet on a time series only in the
vicinity of the location from which the shapelet was originally extracted. We proposed an algo-
rithm (Procedure 1) for calculating the exact vicinity based on properties of each class in the data
set. Our main result is that local-shapelets can indeed speedup both the learning and classifica-
tion methods 100-fold when using SALSA-R without impairing accuracy. We also show that our
estimation of the vicinity to examine is quite accurate.

This research can be extended in many ways. It may be interesting to explore possible trade-
offs between speedup and accuracy as a function of the vicinity to examine. Another research
direction is the application of local-shapelets to domains other than spectroscopy which may also
require the adaption of our algorithm for time series of different lengths.

16


References

R. Herrmann, C. Onkelinx, Quantities and units in clinical chemistry: Nebulizer and flame properties in flame emission
and absorption spectrometry (Recommendations 1986), Pure and Applied Chemistry 58 (12) (1986) 1737–1742.

O. Al-Jowder, E. K. Kemsley, R. H. Wilson, Detection of adulteration in cooked meat products by mid-infrared spec-
troscopy, Journal of Agricultural and Food Chemistry 50 (6) (2002) 1325–1329.

R. Briandet, E. K. Kemsley, R. H. Wilson, Discrimination of Arabica and Robusta in instant coffee by Fourier transform
infrared spectroscopy and chemometrics, Journal of Agricultural and Food Chemistry 44 (1) (1996) 170–174.

C. P. Bicchi, A. E. Binello, M. M. Legovich, G. M. Pellegrino, A. C. Vanni, Characterization of roasted coffee by S-
HSGC and HPLC-UV and principal component analysis, Journal of Agricultural and Food Chemistry 41 (12) (1993)
2324–2328.

I. Lumley, Authenticity of meat and meat products, in: Food Authentication, Springer, 108–139, 1996.
N. Sharma, A. Srivastava, J. Gill, D. Joshi, Differentiation of meat from food animals by enzyme assay, Food Control

5 (4) (1994) 219–221.
D. R. Eads, D. Hill, S. Davis, S. J. Perkins, J. Ma, R. B. Porter, J. P. Theiler, Genetic algorithms and support vector

machines for time series classification, in: International Symposium on Optical Science and Technology, International
Society for Optics and Photonics, 74–85, 2002.

SCIO, http://www.consumerphysics.com/myscio/, 2014.
L. Ye, E. Keogh, Time series shapelets: a novel technique that allows accurate, interpretable and fast classification, Data

Mining and Knowledge Discovery (2011a) 1–34.
J. Hills, J. Lines, E. Baranauskas, J. Mapp, A. Bagnall, Classification of time series by shapelet transformation, Data

Mining and Knowledge Discovery (2013) 1–31.
A. Mueen, E. Keogh, N. Young, Logical-shapelets: an expressive primitive for time series classification, in: Proceedings

of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 1154–1162,
2011.

T. Rakthanmanon, E. Keogh, Fast Shapelets: A Scalable Algorithm for Discovering Time Series Shapelets, in: Proceed-
ings of the Thirteenth SIAM Conference on Data Mining (SDM), SIAM, 668–676, 2013.

local shapelets, ftp://www.ise.bgu.ac.il/, 2014.
D. Goldin, P. Kanellakis, On similarity queries for time-series data: Constraint specification and implementation, in:

Principles and Practice of Constraint Programming CP’95, Springer, 137–153, 1995.
J. Lines, L. M. Davis, J. Hills, A. Bagnall, A shapelet transform for time series classification, in: Proceedings of the 18th

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 289–297, 2012.
D. Gordon, D. Hendler, L. Rokach, (in press) Fast and Space-Efficient Shapelets-Based Time-Series Classification,

Intelligent Data Analysis 19 (5).
Z. Xing, J. Pei, S. Y. Philip, K. Wang, Extracting Interpretable Features for Early Classification on Time Series., in:

Eleventh SIAM International Conference on Data Mining (SDM), SIAM, 247–258, 2011.
K. R. Moore, P. C. Blain, S. D. Briles, R. G. Jones, Classification of RF transients in space using digital signal processing

and neural network techniques, in: SPIE’s 1995 Symposium on OE/Aerospace Sensing and Dual Use Photonics,
International Society for Optics and Photonics, 995–1006, 1995.

L. Ye, E. Keogh, shapelet data sets, http://alumni.cs.ucr.edu/~lexiangy/shapelet.html, 2011b.
E. Keogh, Q. Zhu, B. Hu, H. Y., X. Xi, L. Wei, C. A. Ratanamahatana, The UCR Time Series Classification/Clustering

Homepage, www.cs.ucr.edu/~eamonn/time_series_data/, 2014.
H. S. Tapp, M. Defernez, E. K. Kemsley, FTIR spectroscopy and multivariate analysis can distinguish the geographic

origin of extra virgin olive oils, Journal of Agricultural and Food Chemistry 51 (21) (2003) 6110–6115.
M. Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the

American Statistical Association 32 (200) (1937) 675–701.
F. Wilcoxon, Individual comparisons by ranking methods, Biometrics Bulletin 1 (6) (1945) 80–83.
L. Devroye, L. Györfi, G. Lugosi, A probabilistic theory of pattern recognition, vol. 31 of Applications of Mathematics

(New York), Springer-Verlag, New York, ISBN 0-387-94618-7, 1996.
M. Mohri, A. Rostamizadeh, A. Talwalkar, Foundations of machine learning, MIT press, 2012.

17

http://www.consumerphysics.com/myscio/
ftp://www.ise.bgu.ac.il/
http://alumni.cs.ucr.edu/~lexiangy/shapelet.html
www.cs.ucr.edu/~eamonn/time_series_data/

	Introduction
	Our Contributions

	Background
	Definitions
	The YK-Algorithm
	Time Complexity of the YK-Algorithm


	Related Work
	Local-Shapelets
	The Tolerance Range
	Random Selection of Shapelets

	Experimental Results
	Description of Data Sets
	Food Spectrographs
	Lightning Spectrographs

	Utility of Tolerance Range Calculation
	Local YK-Algorithm
	Local-SALSA-R

	Statistical analysis
	Conclusions