PLEASE SCROLL DOWN FOR ARTICLE

This article was downloaded by: [Universiteit van Amsterdam]
On: 25 March 2010
Access details: Access Details: [subscription number 919366406]
Publisher Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-
41 Mortimer Street, London W1T 3JH, UK

Connection Science
Publication details, including instructions for authors and subscription information:
http://www.informaworld.com/smpp/title~content=t713411269

Sentence-processing in echo state networks: a qualitative analysis by finite
state machine extraction
Stefan L. Frank a;Henrik Jacobsson b
a Institute for Logic, Language and Computation, University of Amsterdam, Amsterdam, GE, The
Netherlands b German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany

First published on: 22 March 2010

To cite this Article Frank, Stefan L. andJacobsson, Henrik(2010) 'Sentence-processing in echo state networks: a qualitative
analysis by finite state machine extraction', Connection Science,, First published on: 22 March 2010 (iFirst)
To link to this Article: DOI: 10.1080/09540090903398039
URL: http://dx.doi.org/10.1080/09540090903398039

Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf

This article may be used for research, teaching and private study purposes. Any substantial or
systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or
distribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contents
will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses
should be independently verified with primary sources. The publisher shall not be liable for any loss,
actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly
or indirectly in connection with or arising out of the use of this material.

http://www.informaworld.com/smpp/title~content=t713411269
http://dx.doi.org/10.1080/09540090903398039
http://www.informaworld.com/terms-and-conditions-of-access.pdf


Connection Science
2010, 1–21, iFirst

Sentence-processing in echo state networks: a qualitative
analysis by finite state machine extraction

Stefan L. Franka * and Henrik Jacobssonb†

Institute for Logic, Language and Computation, University of Amsterdam, P.O. Box 94242, 1090 GE,
Amsterdam, The Netherlands; bGerman Research Center for Artificial Intelligence (DFKI),

Saarbrücken, Germany

(Received 31 July 2009; final version received 31 August 2009 )

It has been shown that the ability of echo state networks (ESNs) to generalise in a sentence-processing
task can be increased by adjusting their input connection weights to the training data. We present a
qualitative analysis of the effect of such weight adjustment on an ESN that is trained to perform the
next-word prediction task. Our analysis makes use of CrySSMEx, an algorithm for extracting finite state
machines (FSMs) from the data about the inputs, internal states, and outputs of recurrent neural net-
works that process symbol sequences. We find that the ESN with adjusted input weights yields a concise
and comprehensible FSM. In contrast, the standard ESN, which shows poor generalisation, results in a
massive and complex FSM. The extracted FSMs show how the two networks differ behaviourally. More-
over, poor generalisation is shown to correspond to a highly fragmented quantisation of the network’s
state space. Such findings indicate that CrySSMEx can be a useful tool for analysing ESN sentence
processing.

Keywords: echo state networks; sentence processing; rule extraction; finite state machines

1. Introduction

Echo state networks (ESNs; Jaeger 2001, 2003) have recently gained popularity as a recurrent
neural network (RNN) model for time-series processing. The great advantage of ESNs over the
more traditional recurrent networks, such as the simple recurrent network (SRN; Elman 1990),
is that the weights of their input and recurrent connections remain fixed at their initial random
values. Only the output connection weights are trained, which can be done very efficiently by linear
regression, without the need for any iterative gradient-descent method (such as the well-known
backpropagation algorithm) for finding proper weights.

Although ESNs have been shown to be useful for several applications, attempts to use them for
modelling sentence processing in the field of cognitive science have resulted in mixed outcomes.
Whereas one experiment (Tong, Bickett, Christiansen, and Cottrell 2007) showed ESNs to perform

*Corresponding author. Email: S.L.Frank@uva.nl
†Present address: Google, Zurich, Switzerland (since October 2008).

ISSN 0954-0091 print/ISSN 1360-0494 online
© 2010 Taylor & Francis
DOI: 10.1080/09540090903398039
http://www.informaworld.com

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


2 S.L. Frank and H. Jacobsson

at least as well as SRNs in a next-word prediction task, others (Čerňanský and Tiňo, 2007; Frank
2006a) have found ESNs to generalise relatively poorly.

One way of increasing an ESN’s ability to generalise is by adding a hidden layer in-between
the recurrent and output layers (Frank 2006a, b). Unfortunately, this also adds another set of
connection weights to train, doing away with network’s efficient trainability. Alternatively, ESN
generalisation can be improved by adjusting the input connection weights to the training input.
As shown by Čerňanský, Makula, and Beňušková (2008) and Frank and Čerňanský (2008), this
can be done efficiently by one-shot, unsupervised learning. The resulting network can then be
trained on the target output just like any ESN, but with greatly increased ability to generalise.

In this paper, we analyse the difference between two types of networks: the standard ESN
with random input weights and the alternative network with adjusted input weights (as mentioned
above). Following Frank and Čerňanský (2008), we call the latter model ESN+. Both networks
are trained on the next-word prediction task in sentence processing. Comparisons between the
network’s performance on sentences that differ markedly from the training sentences, reported by
Frank and Čerňanský, showed that ESN+ generalises much better than does ESN. Here, we extend
this quantitative comparison with a qualitative analysis of networks’ behaviour and of the structure
of their state spaces. This is done by applying the CrySSMEx algorithm (Jacobsson 2006a) for
extracting finite state machines (FSMs) from network’s input symbols, recurrent-layer states, and
output symbols. In this way, the difference between ESN and ESN+ becomes apparent not only
from their performance levels, but also from the extracted FSMs, which allows for a more detailed
look into the behaviour of the networks. Moreover, the quantisation of the networks’ state spaces
into FSM states, as computed by CrySSMEx, provides insight into the underlying cause of the
networks’ divergent abilities to generalise.

The next section describes the CrySSMEx algorithm. Following this, we present details of
our simulations: the language on which the networks were trained and tested, the architecture of
the networks, and the difference between ESN and ESN+. Section 4 shows the FSMs that were
extracted from the two networks, as well as the corresponding state-space quantisations, at least
to the extent that this turned out to be feasible. These results are interpreted and discussed in
Section 5.

2. Analysing recurrent neural networks

A fundamental problem for qualitative analyses of RNNs is that even small networks can form
complex dynamical systems. Given a single problem instance and a fixed network topology, a wide
range of unique behaviours may result from training the network. Understanding each network’s
idiosyncratic behaviour can take a lot of effort. Alternatively, a statistical analysis (i.e., over
populations of RNNs) could be performed. Such an analysis is usually not preferred, however,
because the networks’ individual characteristics are typically lost.

Countless analysis methods have been developed in parallel with new RNN algorithms and
architectures (for a brief list, see Jacobsson 2005). In most cases, trained network instances
are analysed on the basis of their hidden activations (e.g., Elman 1990). However, the analysis
often requires deep, or even complete knowledge of the domain itself. As a typical and excel-
lent example of this, Bodén and Wiles (2000) used several tools to analyse, quantify, and group
qualitatively different RNNs. The networks were trained on context-free and context-sensitive
formal languages. Since the RNNs were fed long monotonous sequences, their behaviours
were largely governed by the dynamics near the fixed points of the state space. The analy-
sis took advantage of this by using the properties of the eigenvalues of the Jacobian matrix
near the fixed point as a basic taxonomy. The analysis was complemented with state-space

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


Connection Science 3

plots, which was only feasible because the number of state nodes was limited to two. Although
the results were interesting (showing previously unknown regularities in the dynamics of the
RNNs), they could only be obtained because the authors defined the domain precisely and (pre-
sumably) knew what they were looking for. Moreover, the network’s dynamics was visually
accessible.

2.1. Rule extraction from recurrent neural networks

As discussed above, many researchers employ an analysis method that is tailored for the specific
network and domain under investigation. This form of ad hoc analysis is perhaps the most common,
and may also be necessary to uncover the specifics of the solutions implemented through RNNs;
solutions that may very well be counterintuitive and interesting in many ways.

However, one class of analysis methods, rule extraction, represents a broader and more generic
approach to the problem of neural network analysis and has become a research field in its own
right (Andrews, Diederich, and Tickle 1995; Jacobsson 2005). These techniques are typically
developed to be portable and as widely applicable as possible (i.e., independent of network type
and domain).

The quality of a rule extractor is typically measured by (adapted from Andrews et al. 1995):
rule accuracy, in terms of how well the rules perform on the test set; rule fidelity, that is, how well
the rules mimic the behaviour of the RNN; and rule comprehensibility, roughly corresponding
the readability of the extracted rules and/or the size of the rule set.

Existing techniques for extracting rules (represented as FSM) from RNNs were surveyed by
Jacobsson (2005). It turned out that, despite a great deal of diversity, all algorithms have four
constituents in common:

(1) Quantisation of the continuous state space of the RNN (e.g., by self-organising maps or
k-means), resulting in a discrete set of states.

(2) Generation of internal states and outputs by feeding input patterns to the RNN.
(3) Construction of rules from the observed state transitions, usually resulting in deterministic

automata.
(4) Minimisation of the rule set, using some standard algorithm (see Hopcroft and Ullman 1979).

As argued by Jacobsson (2005), none of the existing techniques attempted to merge these four
parts in any principled manner. For example, the surveyed techniques implemented the state-space
quantisation through standard techniques, completely independent from machine minimisation,
even though many clustering algorithms are based on merging of clusters that are equivalent
(Mirkin 1996), in a manner much like the merger of computationally equivalent states by the
FSM minimisation algorithms. Moreover, rule-extraction techniques could typically only be suc-
cessfully applied to low-dimensional networks (up to three dimensions). In fact, basic plots of the
state space were often more informative than the extracted machines themselves.

2.2. CrySSMEx

A basic problem when using standard clustering algorithms for quantising the state space of an
RNN, or of any other dynamical system, is that the state space should not primarily be treated
as Euclidean space: since the state of an RNN governs its current and future behaviour, small
distances in the state space may have a large impact on the functioning of the network, whereas
large state-space distances may only affect the network minimally. The algorithm we use here,
CrySSMEx (the Crystallising Substochastic Sequential Machine Extractor)1 (Jacobsson 2006a),

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


4 S.L. Frank and H. Jacobsson

does not suffer from this problem because it takes into account the dynamical rather than static
geometric properties of the state space.

2.2.1. Finite state machine extraction

The CrySSMEx algorithm (outlined in Algorithm 1) is based on the combination of the above-
mentioned four constituents. As an RNN operates on training or test data, CrySSMEx extracts a
probabilistic FSM given the network’s input, internal states, and outputs. These data (denoted �
in Algorithm 1) are sampled while the network processes sequences of inputs, as it does during
training or testing. Starting with a single-state FSM, CrySSMEx iteratively adjusts the FSM by
splitting and merging states (i.e., changing the quantisation of the RNN’s state space), until the
FSM is fully deterministic.

CrySSMEx(�)
Input: Time-series data, �, containing a sequence of input symbols, output symbols, and

state vectors generated by the RNN when processing test data
Output: A deterministic machine M mimicking the RNN (if successful)
begin

Let M be the stochastic machine based on � resulting from an unquantized state space
(i.e. only one state);
repeat

Select state vectors from � corresponding to indeterministic states in M;
Update the state quantiser by splitting the RNN state space according to selected
data;
Create M using new state quantiser;
if M has equivalent states then

Merge equivalent states;
end

until M is deterministic (success) or the number of iterations exceeds a predefined limit ;
return M;

end
Algorithm 1 A simplified description of the main loop of CrySSMEx. M is created from the
sequence of observed RNN inputs, outputs, and states contained in � by quantisation of the state
space. This quantisation is iteratively refined, resulting in a sequence of increasingly accurate
FSMs being extracted.

In each iteration, the current FSM’s ‘weakest’ points, which are its most non-deterministic
states, are targeted for improvement by actively selecting the corresponding RNN states as the
data to be used for refining the quantisation. The algorithm also keeps the extracted model minimal
by merging regions in the state space that correspond to states that are computationally equivalent
in the FSM.

It was shown (Jacobsson 2006a; Jacobsson, Frank, and Federici 2007) that CrySSMEx can
efficiently extract probabilistic (or, in favourable circumstances, deterministic) FSMs from RNNs
trained on deeply embedded grammars (anbn), high-dimensional RNNs (103 internal nodes),
ESNs, and chaotic systems. Moreover, the algorithm was successfully adapted to extract either
Moore- or Mealy-type machines (Jacobsson et al. 2007). In a Mealy machine, the output of the
network and the extracted machine is defined over transitions between states. In a Moore machine,
on the other hand, the output is defined as a function of the current state and input (Hopcroft and
Ullman 1979). Depending on the domain and network, either Moore or Mealy models will be

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


Connection Science 5

more compact and/or comprehensible. For the current application, we chose to extract Moore
machines, since these correspond more closely to the ESNs, whose output is indeed a function of
the current state and input.

2.2.2. State-space quantisation

To merge and split the state space, CrySSMEx creates a number of simple vector quantis-
ers, arranged hierarchically in a graph (Figure 8 in Section 4.3 shows an example). Each split
corresponds to a node in the graph, describing how the state space is separated into increasingly
smaller regions.

The algorithm that creates this arrangement of quantisers was called the Crystalline Vector
Quantiser (CVQ) in Jacobsson (2006a). Roughly speaking, it generates the so-called CVQ graph
by creating enough simple vector quantisers to split the state space properly. Consequently, the
number of required quantisers is inversely proportional to the quality of their operation for the
data set under investigation. The vector quantiser in Jacobsson (2006a) was chosen to be simple
and parameter-free, showing that the extracted machines were not impaired by an inappropriate
choice of underlying quantisers. However, this quantiser is very sensitive to outliers or skewed
distributions, typically making the resulting CVQ graph size larger than would be implied by the
number of iterations of CrySSMEx alone.

2.2.3. Scaling properties

Prior to CrySSMEx, rule extraction from RNNs had only been applied to networks with just a
handful of internal units. Whether rule extraction is feasible depends of course on the network’s
size, but even more strongly so on its dynamics: a chaotic network with just one unit requires more
FSM states than a very large but non-chaotic network (for a deeper discussion of this issue, see
Jacobsson 2006a). CrySSMEx can easily handle much larger and more chaotic networks because
it generates a sequence of stochastic FSMs that form increasingly precise approximations of the
network’s behaviour. If obtaining a full (i.e., deterministic) FSM description of the network is not
feasible, the algorithm can be stopped after any iteration, yielding an FSM that may at least be
sufficiently complete. Moreover, since CrySSMEx currently (and quite deliberately) uses very
simple vector quantisation for analysing the networks’ state space, it is reasonable to assume
that the algorithm’s scaling properties could be further improved by using optimised algorithms
(e.g., support vector machines). However, an investigation of this issue is beyond the scope of the
current paper.

It is less clear how well CrySSMEx scales up when the number of input and output symbols
increases. The problem is essentially one of data scarcity: in every state there will be many
possible input symbols for which a prediction needs to be modelled. How to deal with this is an
open issue, but as argued elsewhere (Jacobsson 2006b), one could try an active learning approach
in which the extraction algorithm interacts with the network directly to gather more data when
needed.

3. Method

This section presents the methodological details of our simulations. First, in Section 3.1, we
present the language processed by the networks, as well as the specific division into two distinct
sets of sentences: those used for training and those used for testing. As explained in Section 3.2,
two ESNs were trained to predict the upcoming word at each point in a sentence. That section also

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


6 S.L. Frank and H. Jacobsson

explains how prediction performance is assessed, and how the networks’ outputs are discretised
for CrySSMEx analysis.

3.1. The language

3.1.1. Lexicon

The semi-natural language used by Frank and Čerňanský (2008) has a lexicon of 26 words,
containing 12 plural nouns (e.g., girls, boys), 10 transitive verbs (e.g., chase, see), two prepositions
(from and with), one relative clause marker (that), and an end-of-sentence marker denoted [end],
which is also considered a word. The nouns are divided into three groups: female nouns (e.g.,
women), male nouns (e.g., men), and animal nouns (e.g., bats). As explained below, this distinction
is only relevant for distinguishing between training and test sentences. Since the language has
only syntax and no semantics, the names of words within each syntactic category are irrelevant
and only provided to make the sentences more readable.

3.1.2. Grammar

Sentences are generated by the grammar in Table 1. The simplest sentences, such as mice love
cats [end], are just four words long, but longer sentences can be constructed by adding one or
more prepositional phrases or relative clauses. Relative clauses come in two types: subject-relative
clauses (SRCs, as in mice that love cats…) and object-relative clauses (ORCs, as in mice that cats
love…). Since SRCs can themselves contain a relative clause (as in mice that love cats that dogs
avoid [end]), there is no upper bound to sentence length.

3.1.3. Training and test sentences

We adopted not only the language used by Frank and Čerňanský (2008), but also their division
of sentences into subsets used for training and testing. The common approach is to set aside a
random sample of possible inputs, and train only on the remaining sentences. However, Frank and
Čerňanský’s specific goal was to investigate whether ESNs can generalise to sentences that contain
words occurring at positions they did not occupy during training.2 Therefore, specific groups
of sentences were withheld during training, such that some nouns only occurred in particular
grammatical roles: male nouns were banned from subject position and female nouns did not
occur in the object position (Table 1 shows which positions are considered to hold subject nouns

Table 1. Probabilistic context-free grammar of the language.

S → NPsubj V NPobj [end]
NPr → Nr (0.7) | Nr SRC (0.06) | Nr ORC (0.09) | Nr PPr (0.15)
SRC → that V NPobj
ORC → that Nsubj V
PPr → from NPr | with NPr
Nr → Nfem | Nmale | Nanim
Nfem → women | girls | sisters
Nmale → men | boys | brothers
Nanim → bats | giraffes | elephants | dogs | cats | mice
V → chase | see | swing | love | avoid | follow | hate | hit | eat | like
Variable r denotes a noun’s grammatical role (subject or object). The prob-
abilities of different productions are equal, except for NP, where they are
given in parentheses.

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


Connection Science 7

and which hold object nouns). The resulting training set consisted of 5000 sentences, with an
average length of 5.93 words.

Frank and Čerňanský (2008) tested networks on eight different types of sentences, but we will
restrict our qualitative analysis to just one of these. The test set consisted of all 2700 sentences
with structure ‘Nmale V Nfem that Nmale V [end]’. That is, all test sentences had one ORC, which
modified the second noun. Note that animal nouns do not occur in test sentences, and that the
roles of male and female nouns are reversed compared with training sentences: male nouns are
in the subject position whereas female nouns are objects. This means that all test sentences differ
quite strongly from the training sentences. The first word of a test sentence is always a male noun,
which never occurred in the sentence-initial position during training. This makes generalisation
to test sentences much more challenging than would have been the case if training sentences had
been selected by an unrestricted random sample.

3.2. The echo state networks

3.2.1. Architecture

We investigated the behaviour of two ESNs, with identical architectures. As shown in Figure 1,
words are represented locally on the 26-unit input layer that feeds activation to a recurrent layer,
which is called the ‘dynamical reservoir’ (DR) as is common in the ESN literature. The DR also
receives its own activation in the previous time step and feeds activation to the output layer. The
output layer, like the input layer, has 26 units, each representing one word.

Input. If the word i forms the input to the ESN at time step t , the input activation vector
ain(t) = (a1, . . . , a26) has ai (t) = 1 and aj (t) = 0 for all j �= i. The weights of connections
from input units to the DR are collected in the 100 × 26 input weight matrix Win. In most ESNs,
these weights are chosen randomly, but here we use a different approach in which the input weights
depend on the training data. In the ESN+ model, most inputs weights are 0, but for DR-units
j = 1, . . . , 26, the weight of the connection from input unit i to DR-unit j equals

wj,i = 2N ×
N (i, j ) + N (j, i)

N (i)N (j )
,

Figure 1. Architecture of ESN(+). Weights of connections between DR units (Wdr ; solid arrow) are random and fixed;
weights of connection from DR to output units (Wout ; dotted arrow) are adapted to training data and task; weights of
connections from input to DR units (Win ; dashed arrow) are random and fixed in ESN, but adapted to training data in
ESN+.

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


8 S.L. Frank and H. Jacobsson

where N is the number of word tokens in the training data, N (i) is the number of times the word
i occurs in the training data, and N (i, j ) is the number of times the word j directly follows i. We
adopted this particular approach from Frank and Čerňanský (2008), who used it because it was
found by Bullinaria and Levy (2007) to yield the word representations that could be successfully
applied to several syntactic and semantic tasks. Moreover, it is extremely simple, allowing for
fast computation of the input weights.

Choosing Win in this manner results in the encoding of syntactic category information in the
input weights. If we take column vector i of Win (i.e., the weights of connections emanating from
input unit i) to represent word i, it turns out that representations of words of the same category
cluster together. As Figure 2 shows, there is a clear separation between nouns, verbs, prepositions,
that, and [end]. In spite of the distinctive roles of female, male, and animal nouns in the training
sentences, these three groups of nouns are more similar to one another than to other syntactic
categories.3 This increases ESN+ performance on test sentences because words within a syntactic
category are equivalent according to the grammar of Table 1 (see Frank and Čerňanský 2008, for
a comprehensive discussion).

ESN+ is compared with a standard ESN, that is, one with random input weights. However, a fair
comparison between the two networks requires that the ESN’s input weights are at least distributed
similarly to those of ESN+. For this reason, the ESN’s input weights wj,i (for j = 1, . . . , 26)
are a random permutation of the corresponding weights of ESN+. For j > 26, the ESN’s input
weights are 0, as in ESN+. Note that this does away with the syntactic information present in
Win, while making sure that the distribution of input weights is the same for ESN and ESN+.

Dynamical reservoir. The two networks, ESN and ESN+, have identical DRs. It has 100 units,
connected to each other with weights collected in the 100×100-matrix Wdr . The DR is sparsely
connected in that 85% of the values in Wdr are 0. All other values are taken randomly from a
uniform distribution centred at 0, after which they are linearly rescaled such that the spectral
radius of Wdr (i.e., its largest eigenvalue) equals 0.95.

5101520253035

dogs
mice
bats
cats
elephants
giraffes
sisters
men
brothers
boys
women
girls
[end]
from
with
chase
avoid
follow
eat
see
bump
love
like
swing
hit
that

Figure 2. Hierarchical cluster analysis of word representation vectors (i.e., input weights).

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


Connection Science 9

The DR’s activation vector at time step t , denoted adr(t) ∈ [0, 1]100, is computed at each time
step by

adr(t) = fdr(Wdr adr(t − 1) + Winain(t)), (1)
where adr(t − 1) is the DR state in the previous time step (with adr(0) = 0.5 at the beginning of
each sentence) and fdr is the logistic function.

Output. The weights of connections from DR units to the 26 outputs are collected in the 26 × 100
matrix Wout . Each output unit i also receives a bias activation bi . The output at time step t equals

aout(t) = fout (Wout adr(t) + b), (2)

where b is the vector of bias activations and fout is the softmax function:

fi,out(x) =
exi∑
j

exj
, (3)

due to which the outputs are positive and sum to 1, making aout a probability distribution. That
is, the output value aout,i (t) is the network’s estimate of the probability that the word i will be the
input at time step t + 1.

3.2.2. Network training

As explained by Jaeger (2001), useful output connection weights and biases, Wout and b, are
easy to find without any iterative training method. Ideally, a 26 × (N − 1) target matrix U =
(u(1), u(2), . . . , u(N − 1)) is constructed, where each column vector u(t) has a value of 1 at the
single element corresponding to the input word at t + 1, and all other elements are 0. That is, the
vector u(t) forms the correct prediction of the input at t + 1. Next, the complete training sequence
(excluding the last word) is run through the DR, according to Equation (1). The resulting vectors
adr(t) are collected in a 101 × (N − 1)-matrix A. The connection weight matrix and bias vector
are now computed by

Wout = f−1out(U)A+,
b = f−1out(U)1,

(4)

where A+ is the pseudoinverse of A, and 1 is an (N − 1)-element column vector consisting of 1s.
An obvious problem with this method is that the softmax function (Equation (3)) does not have

an inverse f−1out , and even if it did, the values 0 and 1 would be outside the inverse domain. In the
past, this problem was avoided by either applying a less efficient training method that does not
involve f−1out (Frank 2006a, b), or by taking fout to be the identity function. In the latter case, the
resulting aout does not form a probability distribution, so an additional transformation is needed
to remove negative values and make sure the activations sum to 1 (Čerňanský and Tiňo 2007,
2008; Čerňanský et al. 2008; Frank and Čerňanský, 2008).

Here, we do take fout to be the softmax function and train the network using Equation (4). This is
possible because, although fout has no inverse in general, a proper f

−1
out(u) does exist for particular

target vectors u. In our case, each u = (u1, . . . , un)T has one element uc arbitrarily close to 1,
whereas all other elements have an arbitrarily small but positive value � < n−1 (where n is the
number of words in the language, that is, n = 26 here). As derived in Appendix 1, the inverse

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


10 S.L. Frank and H. Jacobsson

softmax in that case equals:

f
−1
i,out(u) =

{
ln(�−1 − n + 1) if i = c,
0 if i �= c.

We take � = 0.01, making f −1c,out(u) = ln(75) ≈ 4.317.

3.2.3. Network analysis

Prediction performance. Since the grammar of the language is known, the true probability of
occurrence of each word at each point in the sentence can be computed. The prediction perfor-
mance of the networks is rated by comparing their output vectors, which can be interpreted as
probability distributions over words, to the true distributions according to the grammar of Table 1.
More precisely, we take the Kullback–Leibler divergence from the output distribution to the true
distribution as a measure for the network’s prediction error.

Extracting FSMs. Extracting an FSM from an RNN requires symbolic input and output, since
an FSM’s transitions and states come with a finite set of unique labels. The network inputs in
our simulations represent words, so denoting them by symbols is straightforward. The networks’
outputs, however, are not symbols but probability distributions. These continuous-valued vectors
need to be discretised into symbols before CrySSMEx can be applied.

The particular choice of output symbols is very important for the analysis. In the extreme
case that all network outputs are given the same label, the resulting machine will have only
one state and be completely uninformative. In contrast, choosing an output discretisation that is
too fine-grained yields ‘bloated’ and incomprehensible FSMs. Therefore, the granularity of the
discretisation should be fine enough for the symbols to be meaningful, but no finer than required
for the issue under investigation.

Here, we make use of the fact that all words within each of the five syntactic categories (noun,
verb, preposition, that, and [end]) are equivalent according to the language’s grammar (Table 1).
This means that syntactic categories are just as meaningful as words, making a descretising
of output vectors into words overly fine-grained. We therefore turn these vectors into symbols
corresponding to the five syntactic categories. This is done by summing the activations of all
output units representing words from the same category. The category with the highest total
activation (i.e., the most probable one) is considered to be the output symbol of the network.
Summing over words of a category does not give unfair advantage to the categories with many
words (such as the nouns) over those with few words (such as the single-word category ‘relative
clause marker’). This is because, if some noun is possible according to the grammar, any noun
is possible. Therefore, the network needs to spread the total probability mass for ‘nouns’ over all
individual nouns, whereas the total probability of the ‘relative clause marker’ goes to the single
output unit representing the word that.

4. Results

4.1. Network performance

The networks’ prediction error on test sentences is shown in the left panel of Figure 3. As expected,
ESN+ does better than the standard ESN. This is true in particular at the sentence-final verb, that
is, when having to predict [end].

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


Connection Science 11

0

1

2

3

4
test sentences

K
L
−

d
iv

e
rg

e
n
ce

b
o
ys lik
e

g
ir

ls

th
a
t

m
e
n

se
e

[e
n
d
]

0

1

2

3

4
training sentences

g
ir

ls

lik
e

th
a
t

se
e

[e
n
d
]

b
o
ys

w
o
m

e
n

ESN
ESN+

Figure 3. Prediction error of ESN and ESN+ at each word of test sentences (left) and training sentences with the same
structure (right). Results are averaged over sentences; the ones shown on the x-axes are just examples.

Eleven of the 5000 training sentences are of the form ‘Nfem V Nmale that Nfem V [end]’, that is,
they have the same structure as the test sentences. The right panel of Figure 3 shows the networks’
performance on these training sentences. Again, ESN+ does better than ESN, but the difference
is much smaller than it was for test sentences. Interestingly, ESN+ performs nearly as well on test
sentences as on training sentences, indicating that it has reached the highest level of generalisation
that might be expected. The standard ESN, on the other hand, performs badly at several points of
test sentences compared with the same points on training sentences.

4.2. Extracted FSMs

The quantitative findings presented above are not particularly surprising as they basically form a
replication of Frank and Čerňanský (2008). The contribution of this paper lies in CrySSMEx’s
qualitative analysis of two trained networks and of the difference between them. Both networks
were given all 2700 test sentences as well as all 2700 sentences of the form ‘Nfem V Nmale that
Nfem V [end]’. These latter sentences could have been in the training set but, as mentioned above,
only 11 of them were. During processing of the sentences, the networks’ inputs, internal states,
and discretised outputs were recorded. Next, CrySSMEx was applied to these data, for ESN and
ESN+ separately.

4.2.1. ESN+
Figure 4 shows the FSM that CrySSMEx extracted from the ESN+ data after three iterations.
This FSM is fully deterministic, indicating that CrySSMEx has converged after this iteration.

The FSM has six states, indicated by the circles numbered 0–5. The machine moves from one
state to the next when it receives as input one of the words in the box connecting the two states.
At the beginning of a sentence, the FSM must be in State 0 since this is the only state that is
entered when receiving the input symbol [end]. Moreover, there is no other way to enter this
state, showing that it does not occur at any other point in the sentence. When in this state, the
network predicts the following input to be a noun, and indeed, all sentences of the language start
with a noun, moving the FSM into State 1. Here, it predicts the next word to be a verb, irrespective
of whether the input was a male or a female noun. This is remarkable because in the network
training data, a verb can only follow a male noun in some SRC constructions, such as mice that

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


12 S.L. Frank and H. Jacobsson

0:N

boys, brothers, men, 
girls, sisters, women

1:V

avoid, bump, chase, eat, like,
follow, hit, love, see, swing

2:N

boys, brothers, men, 
girls, sisters, women

3:[end]

[end]that

4:N

boys, brothers, men, 
girls, sisters, women

5:V

avoid, bump, chase, eat,like,
follow, hit, love, see, swing

Figure 4. Final finite state machine extracted from ESN+. Circles denote network states. The symbol inside a circle is
the output symbol (syntactic category) generated from that state. Words in boxes are the input symbols (words) that move
the automaton from one state to the next.

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


Connection Science 13

like men chase cats [end]. Only 98 times in the 5000 training sentences did a verb directly follow
a male noun, and never at the beginning of a sentence. The correct prediction of a verb in State 1,
therefore must result from the previous input being a ‘subject’ and not just a ‘male noun’. That
is, ESN+ is more sensitive to the grammatical role of the noun than to its identity. Otherwise, it
would not have predicted a verb to come next.

In fact, the FSM of Figure 4 makes no difference whatsoever between male and female nouns:
whenever a noun allows for a particular state transition, all nouns do. This means that all nouns are
equivalent to the FSM, even though male and female nouns are restricted to particular grammatical
roles in training sentences. In other words, the network from which the FSM was extracted shows
perfect generalisation to test sentences.4

After receiving the sentence’s second word (a verb), the FSM moves to State 2, where it predicts
the next input to be a noun. Indeed a noun occurs next, moving the FSM to State 3. Here, the
end-of-sentence marker is predicted, which is incorrect in the sense that the sentence is not yet
over. However, it is a grammatically correct prediction: [end] can indeed follow ‘N V N’.

After receiving that, the FSM is in State 4, where the next input is predicted to be a noun, that is,
an ORC is expected. It may seem surprising that the FSM never expects an SRC at this point (i.e.,
it never predicts a verb) even though that would be perfectly grammatical. Presumably, a noun is
always expected here because ORCs occur more often than SRCs: according to the grammar of
Table 1, ORCs are 50% more frequent than SRCs.

The next two inputs are a noun (moving the system into State 5) and a (correctly predicted)
verb. The FSM ends up in State 3, where it correctly predicts [end]. It is now back in its starting
state. No error (i.e., no grammatically incorrect prediction) was made in processing any of the
5400 sentences.

4.2.2. Echo state networks

First iteration. Figure 5 shows the FSM extracted from the ESN data after the first CrySSMEx
iteration. It has only three states at this point, but CrySSMEx did not yet terminate as is apparent
from the FSM being stochastic: in many cases, a particular input word licences several transitions
from the same state. For example, if the FSM is in State 0 and receives the word brothers, it moves
to State 1 (where it predicts a verb) in only 40% of the cases. Alternatively, it can remain in State
0 and predict another noun to come next. That prediction is ungrammatical because a noun can
never directly be followed by a noun. Yet, in over 29% of the cases, receiving a noun in State 0
results in the ungrammatical prediction of a noun.

Looking at Figure 3, ESN seems to perform only slightly worse than ESN+ after noun inputs.
We can now conclude that this is somewhat misleading. The small quantitative advantage of
ESN+ over ESN (at noun inputs) in fact corresponds to a very large qualitative improvement:
whereas ESN+ makes no errors, ESN often predicts a noun to follow a noun. It is important to
keep in mind that this is not an error of the extracted FSM. It correctly describes the behaviour
of the ESN, which erroneously predicts a noun to follow a noun. Although the FSM is correct in
this sense, it is not complete: for instance, it must be in State 0 at the beginning of a sentence (i.e.,
after processing [end]) but that same state can also be entered after processing a noun or verb.

Another error that can be observed in the FSM of Figure 5 is the occasional prediction of a verb
(i.e., being in State 1) after processing a verb. Although the language does allow for a verb to be
directly followed by another (e.g, mice that men like chase cats [end]), this is not possible in any
of the ‘N V N that N V [end]’-sentences processed by the FSM. Possibly, this error accounts for
ESN’s low prediction performance (Figure 3) at the sentence-final verb, that is, when [end] is the
only correct prediction. Further insight into this error can be obtained from the FSM extracted at
the second CrySSMEx iteration.

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


14 S.L. Frank and H. Jacobsson

0:N

[end], that, 
boys: .47, brothers: .28, men: .27,
girls: .29, women: .33, sisters: .12,
avoid: .96, bump: .96, chase: .96,

eat: .96, like: .96, follow: .96, hit: .93,
love: .96, see: .96, swing: .96

boys: .20,
brothers: .40,

men: .41,
girls: .67,

sisters: .72,
women: .62,

hit: .03

boys: .33, brothers: .32, men: .32,
girls: .04, sisters: .16, women: .04,
avoid: .04, bump: .04, chase: .04,

eat: .04, like: .04, follow: .04, hit: .04,
love: .04, see: .04, swing: .04

1:V

[end], that,
avoid: .85, bump: .84,

chase: .74, eat: .83,
like: .83, follow: .85,

hit: .77, love: .80,
see: .88, swing: .83

chase: .06,
follow: .00,

hit: .04

avoid: .15, bump: .16,
chase: .20, eat: .17,
like: .17, follow: .15,

hit: .19, love: .20,
see: .14, swing: .17

2:[end]

[end], that

Figure 5. First (stochastic) finite state machine extracted from ESN. If a particular input word licences more than one
transition from a state, the probability of a transition is given after the word.

Second iteration. In its second iteration, CrySSMEx splits the three states into seven, resulting
in the FSM shown in Figure 6. The fact that it is not deterministic shows that CrySSMEx has not
yet converged, so more iterations are needed for further refinement.

Although this FSM has only one more state than the one extracted from the ESN+ data
(Figure 4), it is much more complex, making it difficult to interpret in full. For this reason,
we will not discuss it in depth, but only point out how it sheds light on ESN’s low prediction per-
formance at the sentence-final verb. Also, the FSM shows why there is a large difference between
the performance on training and test sentences at this point.

To begin with, note that State 0 must be the sentence-initial state, as it is the only state entered
after the input [end]. If the input is a potential training sentence (i.e., ‘Nfem V Nmale that Nfem V
[end]’), the path through the machine’s states is easy to follow. The first word, a female noun,
must move the FSM into State 1, where a verb is correctly predicted. The incoming verb will
nearly always result in State 2, predicting a noun. The following input is a male noun, moving
the FSM into State 6. Here, the prediction [end] is grammatical, although the actual next input
is that, resulting in State 3 in as much as 95% of the cases. The incoming female noun will most

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


Connection Science 15

0:N

boys: .84,
men: .01

boys: .16,
brothers,
men: .99,

girls,
sisters,
women

1:V

love: .00 love: .21 [end]
avoid, bump, chase,

eat, like, follow, hit: .98,
love: .79, see, swing

hit: .02

2:N

boys, brothers, men,
girls: .13, sisters: .47, women: .13,
avoid: .04, bump: .04, chase: .04,

eat: .04, like: .04, follow: .04, hit: .04,
love: .04, see: .04, swing: .04

that: .83

[end], that: .17

girls: .87,
sisters: .37,
women: .87,
avoid: .96,
bump: .96,
chase: .96,

eat: .96,
like: .96,

follow: .96,
hit: .93,

love: .96,
see: .96,

swing: .96

hit: .03

sisters: .173:N

boys: .04 [end]

boys: .47,
brothers,
men: .93,

women: .13

boys: .49, men: .07,
girls: .39, sisters: .20

girls: .04,
sisters: .44

girls: .57,
sisters: .36,
women: .87

4:V

love: .09
avoid, bump, eat,

like, follow: .95, hit: .59,
love: .91, see, swing

chase,
follow: .05,

hit: .41

5:V

avoid: .77,
bump: .78,

chase,
eat: .86,
like: .83,

follow: .74,
hit: .96,

love: .97,
see: .70,

swing: .83

that: .80 that: .20

avoid: .23,
bump: .22,

eat: .14,
like: .17,

follow: .26,
hit: .04,

love: .03,
see: .30,

swing: .17

6:[end]

that: .95[end], that: .05

Figure 6. Second finite-state machine extracted from ESN.

often put the FSM into State 5, although States 2 and 4 are also possible. At this point, the only
grammatically correct prediction is [end], that is, the FSM should move into State 6. Indeed, a
large majority of verb inputs at State 5 yield State 6, but some errors (i.e, a noun prediction at
State 2) are also possible. If the FSM was not in State 5 but in State 4, it cannot end up in the
correct State 6. Likewise, if the machine was in State 2 (rather than 4 or 5) a verb input is very
unlikely to move it to State 6.

In short, there is not much opportunity for errors in training sentences, except at the sentence-
final verb. This finding corresponds to ESN’s prediction performance on training sentences, as
displayed in Figure 3. But how about test sentences? Be reminded that after processing the
sentence-final verb, the FSM should be in State 6. This state can be entered from States 1–3 and 5,
but only from State 5 it is likely that verb input results in State 6. From States 1–3, there is only a

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


16 S.L. Frank and H. Jacobsson

1 6 12
0

100

200

CrySSMEx iteration

N
u
m

b
e
r 

o
f 
F

S
M

 s
ta

te
s

Figure 7. Number of states in the FSMs extracted from ESN, after each CrySSMEx iteration.

very small probability that a verb moves the FSM into State 6. Therefore, we can safely conclude
that the machine should be in State 5 immediately before the sentence-final verb arrives. However,
note that State 5 can only be entered when the input is a female noun whereas the actual input at
this point in test sentences is always a male noun. Hence, when the input is a test sentence, the
FSM is not in State 5 when it needs to be. Consequently, it will hardly ever predict [end] when
it needs to. This explains the large difference in ESN prediction error between training and test
sentences at the sentence-final verb.

Further iterations. As CrySSMEx continues to extract increasingly deterministic FSMs from
the ESN, the number of states rises sharply, as can be seen in Figure 7. After 12 iterations,
CrySSMEx has converged at an FSM with as many as 190 states.

4.3. State-space quantisation

So far, we have only looked into the extracted machines themselves. Machines that, to a large
extent, reflect the network behaviour and grammatical correctness rather than their internal dynam-
ics. In addition to these FSMs, CrySSMEx generates hierarchical descriptions of the networks’
state space quantisations. Since these CVQ graphs form a rough description of the layout of the
state space, they potentially hold important qualitative information about the networks’ dynamics.

The CVQ graph describing the state space of ESN+, displayed in Figure 8 shows that ESN+ is
trivially mapped onto an FSM: it takes at most five quantisations to determine the FSM state for
any state vector in the network. For the state space of ESN, the situation is remarkably different.
Figure 6 shows just a small part of the CVQ graph corresponding to the first FSM extracted from
the ESN data (i.e., the one in Figure 5). This FSM has fewer states than the final machine extracted
from ESN+, yet the CVQ graph is immensely more complex. In short, this tells us that the state
space of ESN is much more fragmented than that of ESN+. Consequently, CrySSMEx needs to
work a lot harder to render an FSM description for ESN than for ESN+.

5. Discussion

5.1. Comparing network behaviours

We have presented the first ever application of CrySSMEx for a qualitative comparison of the
behaviours of two networks. The algorithm proved to provide more insight than was obtained
from a merely quantitative investigation of network performance. The difference in performance

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


Connection Science 17

VQ

VQ VQ

VQ VQ

3VQ

VQ

VQ

VQ

1 5

VQ

2

0 4

Figure 8. Arrangement of Vector Quantisers (VQ) for splitting the ESN+ state space (corresponding to the FSM of
Figure 4). The FSM state to which a given state vector belongs is decided by starting at the root VQ and following the
graph according to the winning model vector of each VQ. Arrows with a dot denote a merger of states.

between ESN and ESN+ turned out to be the consequence of a much more significant, qualitative
difference in their behaviours. Comparing the two extracted FSMs in Figures 4 and 6, it becomes
clear that the two networks implement very different machines. In fact, it is hard to imagine that the
RNNs that gave rise to these FSMs are nearly identical and process the same input sentences. The
only difference between ESN+ and ESN is that the first has input connection weights that were
adapted to the training data, whereas the second uses a random permutation of these. Apparently,
this small difference has immense consequences for the networks’ dynamics and behaviour: ESN+
implements a straightforward FSM with six states, whereas the FSM extracted from ESN has 190.
Even if we restrict ourselves to the non-deterministic, seven-state FSM after the second iteration
of CrySSMEx, the behaviour of ESN already turns out to be much more complex than that
of ESN+.

Looking at Figures 6 and 7, it becomes clear that a system that should be relatively simple
(processing just one type of seven-word sentence) can implement an oversized FSM. The primary

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


18 S.L. Frank and H. Jacobsson

reason why the ESN data causes the extracted FSMs to grow so large is that CrySSMEx will
blindly keep refining the extracted rules, also with respect to undesired behaviour by the network.
In this case, the desired grammar, as instantiated by ESN+, is computationally much simpler than
the erratic one instantiated by ESN.

Since CrySSMEx is ignorant about which behaviour is desired and which is not, it cannot be
blamed for creating huge FSMs. It merely shows us the networks’ behaviour. The more com-
plex that behavior, the more complex the extracted FSM. Thanks to the iterative approach of
the algorithm, however, the FSMs extracted in the first few iterations can be quite informative.
Although they are incomplete, they may provide a comprehensible ‘summary’ of the complete,
but incomprehensible, deterministic FSM that is implemented by the network.

5.2. Comparing network state spaces

In an early comparison between SRNs and FSMs (Servan-Schreiber, Cleeremans, and McCelel-
land 1991), the term graded state machine was tossed to describe the type of system embodied
by an SRN. The difference with an FSM is that a graded state machine has (possibly infinite)
non-discrete states. Each point in an SRN’s state space can be considered a state of the machine,
and states that are near to each other in the state space tend to (but do not necessarily) result from
similar inputs and have similar effects on the network.

ESN+ makes uses of this graded nature of the state space by using inputs that are not truly
symbolic. As pointed out by Frank and Čerňanský (2008), the vector of weights emanating from a
particular input unit can be viewed as the representation of the word corresponding to that input,
and adapting these weights results in representations that are analogical rather than symbolic:
similarities among word representations are analogical to similarities among the grammatical
properties of the represented words. More specifically, words from the same syntactic category
are represented by vectors that are more similar to each other than vectors representing words
from different categories, as is also apparent from Figure 2.

Each input word drives the activation pattern in ESN’s dynamical reservoir towards an attractor
point that is unique for that input (Tinǒ, Čerňanský and Beňušková 2004). Because of the analogical
nature of word representations in ESN+, the attractor points associated with words from the same
syntactic category will be closer together than those of words from different categories. As a result,
FSM states that are functionally equivalent correspond to ESN+ state-space points that are near to
one another. Such clustering of equivalent states facilitates state-space quantisation. Presumably,
this is why CrySSMEx converges after just three iterations when processing the ESN+ data.
Moreover, the state-space clustering improves generalisation. This is because processing a test
sentence leads to a state-space trajectory that visits the same clusters that were encountered during
ESN+ training. In other words, new sentences result in not-so-new internal network states.

For ESNs, however, the situation is radically different. Since its input weights are fixed at
random values (i.e., it uses symbolic rather than analogical word representations) the distribution
of attractors in its state space is basically random. Even two states that are functionally equivalent
can correspond to distant points in the state space. As a result, many splits and merges are
required for a meaningful quantisation, delaying CrySSMEx convergence and resulting in the
highly complex CVQ graph, part of which is displayed in Figure 9. Also, the generalisation is
impaired because the state-space trajectory resulting from a new sentence is largely unrelated
to what was encountered during ESN training. To summarise, we have found the impoverished
generalisation of ESN to be related to the relative complexity of its internal dynamics, as apparent
from the size of its CVQ graph.

As mentioned in section 2.2.2, the depth and size of CVQ graphs are partly governed by the
quality of the underlying quantisation algorithm: the worse the quantiser, the larger the CVQ

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


Connection Science 19

VQ

VQ

VQ

VQ

0

VQ

VQ

VQ

VQ

VQ

VQ

VQ

VQ

VQ

VQ VQVQ

VQ

VQ

VQ

VQ

VQ

VQVQ

VQ

VQ

VQ

VQ

VQ

VQ

VQ

VQ VQVQVQ

VQ

VQ VQ

VQ VQ

Figure 9. Small fragment of the arrangement of Vector Quantisers (VQ) for splitting the ESN state space (corresponding
to the FSM of Figure 5).

graph. Hence, the size of the CVQ graph for ESN (Figure 9) is partly due to the simplicity of
the quantiser that was used, rather than from the actual dynamics of the system. However, the
immense difference between the CVQ graphs for ESN and ESN+ cannot all be blamed on the
quantiser’s suboptimality. Therefore, even without a more sophisticated quantisation algorithm,
we can provide a reasonable suggestion about what goes on inside the network, and how the
dynamics of the two networks underlies their quantitative and qualitative differences.

6. Conclusion

CrySSMEx allows finite state descriptions to be generated from high-dimensional and complex
ESNs, opening up a new window into the internal workings of specific ESN instances. In this
paper, we have only scraped the surface of the interesting dynamics of ESNs. Only a small step
was made towards understanding exactly how and why ESN+ manages to utilise its state space
in a manner that governs a deeper correspondence to the intended grammar. It is our hope and
conjecture that CrySSMEx may provide much deeper insights into these systems than presented
in this paper, insights that may lead to further improvements beyond those of the ESN+ model.

Acknowledgements

This research was supported by grant 277-70-006 from the Netherlands Organisation for Scientific Research (NWO) and
by EU grant FP6-004250-IP.

Notes

1. CrySSMEx can be downloaded from http://cryssmex.sourceforge.net/
2. This is because it has been argued that people only need to experience a word in one grammatical position to

generalise it to novel positions. Consequently, neural networks should also have this ability in order to be regarded
as cognitive models of human language processing (Hadley 1994).

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


20 S.L. Frank and H. Jacobsson

3. Note that all nouns within each of the three subcategories (as well as all verbs) are interchangeable in the training
sentences. As a result, their representations will become more and more similar as the number of training sentence
is increased.

4. That is, when outputs are discretised by syntactical category. As Figure 3 shows, the output vectors of ESN+ are
not identical to the true probability distributions.

References

Andrews, R., Diederich, J. and Tickle, A.B. (1995), ‘Survey and Critique of Techniques for Extracting Rules from Trained
Artificial Neural Networks’, Knowledge Based Systems, 8, 373–389.

Bodén, M. and Wiles, J. (2000), ‘Context-free and Context-Sensitive Dynamics in Recurrent Neural Networks’, Connection
Science, 12, 196–210.

Bullinaria, J.A. and Levy, J.P. (2007), ‘Extracting Semantic Representations from Word Co-Occurrence Statistics:
A Computational Study’, Behavior Research Methods, 39, 510–526.

Čerňanský, M. and Tiňo, P. (2007), ‘Comparison of echo state networks with simple recurrent networks and variable-length
markov models on symbolic sequences’, in Artificial Neural Networks – ICANN 2007, Part I (vol. 4668), eds. J.M. de
Sá, L.A. Alexandre, W. Duch and D.P. Mandic, Lecture Notes in Computer Science, Berlin: Springer, pp. 618–627.

Čerňanský, M. and Tiňo, P. (2008), ‘Processing Symbolic Sequences Using Echo-State Networks’, in From associations
to rules: Proceedings of the 10th Neural Computation and Psychology Workshop, eds. R.M. French, and E. Thomas,
Singapore: World Scientific, pp. 153–159.

Čerňanský, M., Makula, M. and Beňušková Ľ. (2008), ‘Improving the state space organization of untrained recurrent
networks’, in Proceedings of the 15th International Conference on Neural Information Processing, Auckland,
New Zealand.

Elman, J.L. (1990), ‘Finding Structure in Time’, Cognitive Science, 14, 179–211.
Frank, S.L. (2006a), ‘Learn More by Training Less: Systematicity in Sentence Processing by Recurrent Networks’,

Connection Science, 18, 287–302.
Frank, S.L. (2006b), ‘Strong Systematicity in Sentence Processing by an Echo State Network’, in Artificial Neural Net-

works – ICANN 2006, Part I (vol. 4131), eds. S. Kollias, A. Stafylopatis, W. Duch, and E. Oja, Lecture Notes in
Computer Science, Berlin: Springer, pp. 505–514..

Frank, S.L. and Čerňanský, M. (2008), ‘Generalization and systematicity in echo state networks’, in Proceedings of the
30th Annual Conference of the Cognitive Science Society, eds. B.C. Love, K. McRae, and V.M. Sloutsky, Austin,
TX: Cognitive Science Society, pp. 733–738.

Hadley, R.F. (1994), ‘Systematicity in Connectionist Language Learning’ Mind and Language, 9, 247–272.
Hopcroft, J. and Ullman, J.D. (1979), Introduction to Automata Theory, Languages, and Compilation, Reading, MA:

Addison-Wesley Publishing Company.
Jacobsson, H. (2005), ‘Rule Extraction from Recurrent Neural Networks: A Taxonomy and Review’, Neural Computation,

17, 1223–1263.
Jacobsson, H. (2006a), ‘The crystallizing substochastic sequential machine extractor: Cryssmex’, Neural Computation,

18, 2211–2255.
Jacobsson, H. (2006b), ‘Rule Extraction from Recurrent Neural Networks’, uppublished doctoral dissertation, Department

of Computer Science, University of Sheffield, Sheffield, UK.
Jacobsson, H., Frank, S.L. and Federici, D. (2007), ‘Automated abstraction of dynamic neural systems for natural language

processing’, in Proceedings of the International Joint Conference on Neural Networks, Orlando, FL, pp. 1446–1451.
Jaeger, H. (2001), ‘The “Echo State” Approach to Analysing and Training Recurrent Neural Networks’, GMD

Report No. 148, GMD – German National Research Institute for Computer Science, http://www.faculty.iu-
bremen.de/hjaeger/pubs/EchoStatesTechRep.pdf.

Jaeger, H. (2003), ‘Adaptive Nonlinear System Identification with Echo State Networks’, in Advances in Neural Infor-
mation Processing Systems (Vol. 15), eds. S. Becker, S. Thrun and K. Obermayer, Cambridge, MA: MIT Press,
pp. 593–600.

Mirkin, B. (1996), Mathematical Classification and Clustering (Vol. 11), Dordrecht, The Netherlands: Kluwer Academic
Publishers.

Servan-Schreiber, D., Cleeremans, and A. McClelland, J.L. (1991), ‘Graded State Machines: The Representation of
Temporal Contingencies in Simple Recurrent Networks’, Machine Learning, 7, 161–193.

Tiňo, P., Čerňanský, M. and Beňušková, Ľ. (2004) ‘Markovian architectural bias of recurrent neural networks’, IEEE
Transactions on Neural Networks, 15, 6–15.

Tong, M.H., Bickett, A.D., Christiansen, E.M. and Cottrell, G.W. (2007), ‘Learning Grammatical Structure with Echo
State Networks’, Neural Networks, 20, 424–432.

Appendix 1. Inverse of softmax

The softmax function (Equation (3)) does not have an inverse in general. However, it is possible to define a proper inverse
for the particular target vectors that arise in our ESN training procedure.

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0


Connection Science 21

We have a network with n output units, one of which (called unit c) represents the correct output for each input vector.
In the corresponding target output vector u = (u1, . . . , un), element c should have large value (i.e., close to 1) whereas
all other elements should have small values (i.e., close to 0). That is

uj =
{

1 − �c if j = c
�i if j �= c.

Ideally, �c = �i = 0, making uc = 1 and ui = 0. However, the softmax inverse will be applied to u and its domain does
not contain 0 and 1. Therefore, we take �c, �i > 0. Also, it is desired that uc > ui , so 1 − �c > �i . From here on, we
denote by i all output units that are not c, so always i �= c. Note that we assume that the target values �i are equal for all i.

We are looking for the inverse of the softmax function, that is, we want to find xi and xc such that

exi∑n
j =1 e

xj
= �i and

exc∑n
j =1 e

xj
= 1 − �c. (A1)

First, note that
∑n

j =1 e
xj = exc + (n − 1)exi . Therefore, Equation (A1) becomes

exi = �i exc + �i (n − 1)exi and exc = (1 − �c)exc + (1 − �c)(n − 1)exi ⇐⇒

exc = exi
(

1 − n�i + �i
�i

)
and exc = exi

(
(1 − �c)(n − 1)

�c

)
⇐⇒

xc = ln
(

1 − n�i + �i
�i

)
+ xi and xc = ln

(
(1 − �c)(n − 1)

�c

)
+ xi .

(A2)

Since uc > ui , we must have xc > xi , so

1 − n�i + �i
�i

> 1 ⇐⇒ �i < n−1.

From Equation A2, it is clear that

1 − n�i + �i
�i

= (1 − �c)(n − 1)
�c

,

which yields �c = (n − 1)�i . This means that �i and �c cannot be set independently: choosing some minimum target value
�i < n

−1 fixes �c as well. The total target output is∑
j

uj = (1 − �c) + (n − 1)�i

= 1 − (n − 1)�i + (n − 1)�i = 1.
Therefore, the target vector forms a probability distribution.

By choosing a low enough value for �i , the difference between xc and xi is fixed as in Equation A2: xc = ln(�−1i −
n + 1) + xi . As it turns out, it is only this difference that matters in practice: adding a constant value y to xc and xi does
not change the networks’ output. Let Wy,out denote the output connection weights resulting from this addition of y (for
convenience, we ignore the bias vector). As in Equation (4) of Section 3.2.2, A is the matrix of DR states resulting from
the training inputs, and U is the matrix of corresponding target outputs. The resulting connection weights are

Wy,out =
(

f−1out (U) + y
)

A+

= f−1out (U)A+ + yA+

= Wout + yA+.
After training, the network receives an input resulting in the DR state adr . Its output becomes (see Equation (2) in

Section 3.2.1):

fout (Wy,out adr ) = fout (Wout adr + y|A+adr |),
which equals fout (Wout adr ) because the softmax function is translation invariant (i.e., fj,out(x + a) = fj,out(x)). Since it
does not matter whether the connection weights are Wout or Wy,out , adding y to xc and xi has no effect. We can therefore
simply take xi = 0.

To summarise, all that is needed for training an ESN with the softmax output activation is to choose an �i < n
−1. The

softmax inverse of the target outputs then equals:

f
−1
j,out(u) =

{
ln(�−1

i
− n + 1) if j = c

0 if j �= c.

D
o
w
n
l
o
a
d
e
d
 
B
y
:
 
[
U
n
i
v
e
r
s
i
t
e
i
t
 
v
a
n
 
A
m
s
t
e
r
d
a
m
]
 
A
t
:
 
1
4
:
0
6
 
2
5
 
M
a
r
c
h
 
2
0
1
0