Modeling Word Forms Using Latent Underlying Morphs and Phonology

Ryan Cotterell and Nanyun Peng and Jason Eisner
Department of Computer Science, Johns Hopkins University
{ryan.cotterell,npeng1,eisner}@jhu.edu

Abstract

The observed pronunciations or spellings
of words are often explained as arising
from the “underlying forms” of their mor-
phemes. These forms are latent strings that
linguists try to reconstruct by hand. We
propose to reconstruct them automatically
at scale, enabling generalization to new
words. Given some surface word types
of a concatenative language along with the
abstract morpheme sequences that they ex-
press, we show how to recover consistent
underlying forms for these morphemes,
together with the (stochastic) phonology
that maps each concatenation of underly-
ing forms to a surface form. Our technique
involves loopy belief propagation in a nat-
ural directed graphical model whose vari-
ables are unknown strings and whose con-
ditional distributions are encoded as finite-
state machines with trainable weights. We
define training and evaluation paradigms
for the task of surface word prediction, and
report results on subsets of 7 languages.

1 Introduction

How is plurality expressed in English? Compar-
ing cats ([kæts]), dogs ([dOgz]), and quizzes
([kwIzIz]), the plural morpheme evidently has at
least three pronunciations ([s], [z], [Iz]) and at
least two spellings (-s and -es). Also, consider-
ing singular quiz, perhaps the “short exam” mor-
pheme has multiple spellings (quizz-, quiz-).

Fortunately, languages are systematic. The re-
alization of a morpheme may vary by context but
is largely predictable from context, in a way that
generalizes across morphemes. In fact, gener-
ative linguists traditionally posit that each mor-
pheme of a language has a single representation
shared across all contexts (Jakobson, 1948; Ken-
stowicz and Kisseberth, 1979, chapter 6). How-
ever, this string is a latent variable that is never
observed. Variation appears when the phonology

of the language maps these underlying represen-
tations (URs)—in context—to surface represen-
tations (SRs) that may be easier to pronounce.
The phonology is usually described by a grammar
that may consist of either rewrite rules (Chomsky
and Halle, 1968) or ranked constraints (Prince and
Smolensky, 2004).

We will review this framework in section 2. The
upshot is that the observed words in a language are
supposed to be explainable in terms of a smaller
underlying lexicon of morphemes, plus a phonol-
ogy. Our goal in this paper is to recover the lexicon
and phonology (enabling generalization to new
words). This is difficult even when we are told
which morphemes are expressed by each word, be-
cause the unknown underlying forms of the mor-
phemes must cooperate properly with one another
and with the unknown phonological rules to pro-
duce the observed results. Because of these in-
teractions, we must reconstruct everything jointly.
We regard this as a problem of inference in a di-
rected graphical model, as sketched in Figure 1.

This is a natural problem for computational lin-
guistics. Phonology students are trained to puzzle
out solutions for small datasets by hand. Children
apparently solve it at the scale of an entire lan-
guage. Phonologists would like to have grammars
for many languages, not just to study each lan-
guage but also to understand universal principles
and differences among related languages. Auto-
matic procedures would recover such grammars.
They would also allow comprehensive evaluation
and comparison of different phonological theories
(i.e., what inductive biases are useful?), and would
suggest models of human language learning.

Solving this problem is also practically impor-
tant for NLP. What we recover is a model that
can generate and help analyze novel word forms,1

which abound in morphologically complex lan-
guages. Our approach is designed to model sur-

1An analyzer would require a prior over possible analyses.
Our present model defines just the corresponding likelihoods,
i.e., the probability of the observed word given each analysis.


rizajgn z eɪʃ#n dæmn

rizajgn#eɪʃ#n rizajgn#z dæmn#z dæmn#eɪʃ#n

rˌɛ.zɪg.nˈeɪ.ʃ#n ri.zˈajnz dæmz dˌæm.nˈeɪ.ʃ#n

rˌɛzɪgnˈeɪʃn̩ rizˈajnz dˈæmz dˌæmnˈeɪʃn̩

1) Morpheme URs

2) Word URs

3) Word SRs

Concatenation (e.g.)

Phonology (PFST)

Phonetics

resignation resigns damns damnation

2 M

2 U

2 S

4) Word Observations

Figure 1: Our model as a Bayesian network, in which surface forms arise from applying phonology to a concatenation of
underlying forms. Shaded nodes show the observed surface forms for four words: resignation, resigns, damns, and damnation.
The graphical model encodes their morphological relationships using latent forms. Each morpheme UR at layer 1 is generated
by the lexicon model Mφ (a probabilistic finite-state automaton). These are concatenated into various word URs at layer 2. Each
SR at layer 3 is generated using the phonology model Sθ (a probabilistic finite-state transducer). Layer 4 derives observable
phonetic forms from layer 3. This deletes unpronounced symbols such as syllable boundaries, and translates the phonemes into
an observed phonetic, articulatory, or acoustic representation. However, our present paper simply merges layers 3 and 4: our
layer 3 does not currently make use of any unpronounced symbols (e.g., syllable boundaries) and we observe it directly.

face pronunciations (as needed for text-to-speech
and ASR). It might also be applied in practice
to model surface spellings (as needed for MT on
text). Good morphological analysis has been used
to improve NLP tasks such as machine translation,
parsing, and NER (Fraser et al., 2012; Hohensee
and Bender, 2012; Yeniterzi, 2011).

Using loopy belief propagation, this paper at-
tacks larger-scale learning problems than prior
work on this task (section 8). We also develop a
new evaluation paradigm that examines how well
an inferred grammar predicts held-out SRs. Un-
like previous algorithms, we do not pre-restrict the
possible URs for each morpheme to a small or
structured finite set, but use weighted finite-state
machines to reason about the infinite space of all
strings. Our graphical model captures the standard
assumption that each morpheme has a single UR,
unlike some probabilistic learners. However, we
do not try to learn traditional ordered rules or con-
straint rankings like previous methods. We just
search directly for a probabilistic finite-state trans-
ducer that captures likely UR-to-SR mappings.

2 Formal Framework

We urge the reader to begin by examining Fig-
ure 1, which summarizes our modeling approach
through an example. The upcoming sections then
give a formal treatment with details and discus-
sion. Section 2 describes the random variables
in Figure 1’s Bayesian network, while section 3
describes its conditional probability distributions.
Sections 4–5 give inference and learning methods.

A morpheme is a lexical entry that pairs form
with content (Saussure, 1916). Its form is a
morph—a string of phonemes. Its content is a
bundle of syntactic and/or semantic properties.2

Note that in this paper, we are nonstandardly us-
ing “morph” to denote an underlying form. We
assume that all underlying and surface represen-
tations can be encoded as strings, over respective
alphabets Σu and Σs. This would be possible even
for autosegmental representations (Kornai, 1995).

A language’s phonological system thus consists
of the following components. We denote each im-
portant set by a calligraphic letter. We use the cor-
responding uppercase letter to denote a function
to that set, the corresponding lowercase letter as a
variable that ranges over the set’s elements, and a
distinguished typeface for specific elements.

• A is a set of abstract morphemes such as �qu�i�z
and �p�l�u�r$a�l. These are atoms, not strings.
• M = Σ∗u is the space of possible morphs:

concrete UR strings such as /kwIz/ or /z/.
• M : A → M is the lexicon that maps each

morpheme a to an underlying morph m =
M(a). We will find M(a) for each a.
• U = (Σu ∪{#})∗ is the space of underlying

representations for words, such as /kwIz#z/.
• U : M∗ → U combines morphs. A word

is specified by a sequence of morphemes ~a =
a1,a2, . . ., with concrete forms mi = M(ai).

2This paper does not deal with the content. However,
note that a single morpheme might specify a conjunction or
disjunction of multiple properties, leading to morphological
phenomena such as fusion, suppletion, or syncretism.


That word’s underlying form is then u =
U(m1,m2, . . .) ∈U.
• S = Σ∗s is the space of surface representa-

tions for words, such as [kwIzIz].
• S : U → S is the phonology. It maps an

underlying form u to its surface form s. We
will find this function S along with M.

We assume in this paper that U simply con-
catenates the sequence of morphs, separating
them by the morph boundary symbol #: u =
U(m1,m2, . . .) = m1#m2# · · · . However, see
section 4.3 for generalizations.

The overall system serves to map an (abstract)
morpheme sequence ~a ∈ A∗ to a surface word
s ∈ S. Crucially, S acts on the underlying form
u of the entire word, not one morph at a time.
Hence its effect on a morph may depend on con-
text, as we saw for English pluralization. For ex-
ample, S(/kwIz#s/) = [kwIzIz]—or if we were to
apply our model to orthography, S(/quiz#s/) =
[quizzes]. S produces a single well-formed sur-
face form, which is not arbitrarily segmented as
[quiz-zes] or [quizz-es] or [quizze-s].

3 Probability Model

Our goal is to reconstruct the lexicon M and mor-
phophonology S for a given language. We there-
fore define prior probability distributions over
them. (We assume Σu, Σs,A,U are given.)

For each morpheme a ∈ A, we model the
morph M(a) ∈M as an IID sample from a proba-
bility distribution Mφ(m).3 This model describes
what sort of underlying forms appear in the lan-
guage’s lexicon.

The phonology is probabilistic in a similar way.
For a word with underlying form u ∈ U, we pre-
sume that the surface form S(u) is a sample from
a conditional distribution Sθ(s | u). This single
sample appears in the lexical entry of the word
type and is reused for all tokens of that word.

The parameter vectors φ and θ are specific to the
language being generated. Thus, under our gener-
ative story, a language is created as follows:

1. Sample φ and θ from priors (see section 3.4).
2. For each a ∈A, sample M(a) ∼ Mφ.
3. Whenever a new abstract word~a = a1,a2 · · ·

must be pronounced for the first time, con-
struct u as described in section 2, and sample
S(u) ∼ Sθ(· | u). Reuse this S(u) in future.

3See section 3.3 for a generalization to Mφ(m | a).

Note that we have not specified a probability
distribution over abstract words ~a, since in this
paper, these sequences will always be observed.
Such a distribution might be influenced by the se-
mantic and syntactic content of the morphemes.
We would need it to recover the abstract words if
they were unobserved, e.g., when analyzing novel
word forms or attempting unsupervised training.

3.1 Discussion: Why probability?
A language’s lexicon M and morphophonology S
are deterministic, in that each morpheme has a sin-
gle underlying form and each word has a single
surface form. The point of the language-specific
distributions Mφ and Sθ is to aid recovery of these
forms by capturing regularities in M and S.

In particular, Sθ constitutes a theory of the regu-
lar phonology of the language. Its high-probability
sound changes are the “regular” ones, while irreg-
ularities and exceptions can be explained as occa-
sional lower-probability choices. We prefer a the-
ory Sθ that has high likelihood, i.e., it assigns high
probability (≈ 1) to each observed form s given its
underlying u. In linguistic terms, we prefer pre-
dictive theories that require few exceptions.

In the linguistic community, the primary mo-
tivation for probabilistic models of phonology
(Pierrehumbert, 2003) has been to explain “soft”
phenomena: synchronic variation (Sankoff, 1978;
Boersma and Hayes, 2001) or graded acceptabil-
ity judgments on novel surface forms (Hayes and
Wilson, 2008). These applications are orthog-
onal to our motivation, as we do not observe
any variation or gradience in our present exper-
iments. Fundamentally, we use probabilities to
measure irregularity—which simply means unpre-
dictability and is a matter of degree. Our objective
function will quantitatively favor explanations that
show greater regularity (Eisner, 2002b).

A probabilistic treatment also allows rela-
tively simple learning methods (e.g., Boersma and
Hayes (2001)) since inference never has to back-
track from a contradiction. Our method searches a
continuous space of phonologies Sθ, all of which
are consistent with every mapping S. That is, we
always have Sθ(s | u) > 0 for all u,s, so our
current guess of Sθ is always capable of explain-
ing the observed words, albeit perhaps with low
probability. Our EM learner tunes Sθ (and Mφ)
so as to raise the probability of the observed sur-
face forms, marginalizing over the reconstructed
lexicon M of underlying forms. We do warn that


w ɛ t
ɾw ɛ

rə

Next Input Character Input Right 
Context

Input Left 
Context

Output Left 
Context

Output Right
 Context

Stochastic Choice
Of Edit in Context C

Figure 2: Illustration of a contextual edit process as it pro-
nounces the English word wetter by transducing the under-
lying /wEt#@r/ (after erasing #) to the surface [wER@r]. At
the point shown, it is applying the “intervocalic alveolar flap-
ping” rule, replacing /t/ in this context by applying SUBST(R).

EM can get stuck at a local optimum; random
restarts and simulated annealing are ways to es-
cape such low-likelihood solutions, much as back-
tracking escapes zero-likelihood solutions.

3.2 Mapping URs to SRs: The phonology Sθ
We currently model Sθ(s | u) as the probability
that a left-to-right stochastic contextual edit pro-
cess (Figure 2) would edit u into s. This probabil-
ity is a sum over all edit sequences that produce s
from u—that is, all s-to-u alignments.

Stochastic contextual edit processes were de-
scribed by Cotterell et al. (2014). Such a pro-
cess writes surface string s ∈ Σ∗s while reading
the underlying string u ∈ Σ∗u . If the process has
so far consumed some prefix of the input and pro-
duced some prefix of the output, it will next make
a stochastic choice among 2|Σs| + 1 possible ed-
its. Edits of the form SUBST(c) or INSERT(c) (for
c ∈ Σs) append c to the output string. Edits of the
form SUBST(c) or DELETE will (also) consume the
next input phoneme; if no input phonemes remain,
the only possible edits are INSERT(c) or HALT.

The stochastic choice of edit, given context, is
governed by a conditional log-linear distribution
with feature weight vector θ. The feature functions
may look at a bounded amount of left and right
input context, as well as left output context. Our
feature functions are described in section 6.

Our normalized probabilities Sθ(s | u) can be
computed by a weighted finite-state transducer, a
crucial computational property that we will ex-
ploit in section 4.2. As Cotterell et al. (2014)
explain, the price is that our model is left/right-
asymmetric. The inability to condition directly on
the right output context arises from local normal-

ization, just like “label bias” in maximum entropy
Markov models (McCallum et al., 2000). With
certain fancier approaches to modeling Sθ, which
we leave to future work, this effect could be miti-
gated while preserving the transducer property.

3.3 Generating URs: The lexicon model Mφ
In our present experiments, we use a very simple
lexicon model Mφ, so that the burden falls on the
phonology Sθ to account for any language-specific
regularities in surface forms. This corresponds to
the “Richness of the Base” principle advocated by
some phonologists (Prince and Smolensky, 2004),
and seems to yield good generalization for us. We
say all URs of the same length have the same
probability, and the length is geometrically dis-
tributed with mean (1/φ) − 1. This is a 0-gram
model with a single parameter φ ∈ (0, 1], namely
Mφ(m) = ((1 −φ)/|Σu|)|m| ·φ.

It would be straightforward to experiment with
other divisions of labor between the lexicon model
and phonology model. A 1-gram model for Mφ
would also model which underlying phonemes
are common in the lexicon. A 2-gram model
would model the “underlying phonotactics” of
morphs, though phonological processes would
still be needed at morph boundaries. Such models
are the probabilistic analogue of morpheme struc-
ture constraints. We could further generalize from
Mφ(m) to Mφ(m | a), to allow the shape of the
morph m to be influenced by a’s content. For
example, Mφ(m | a) for English might describe
how nouns tend to have underlying stress on the
first syllable; similarly, Mφ(m | a) for Arabic
might capture the fact that underlying stems tend
to consist of 3 consonants; and across languages,
Mφ(m | a) would prefer affixes to be short.

Note that we will always learn a language’s Mφ
jointly with its actual lexicon M. Loosely speak-
ing, the parameter vector φ is found from easily
reconstructed URs in M; then Mφ serves as a prior
that can help us reconstruct more difficult URs.

3.4 Prior Over the Parameters
For φ, which is a scalar under our 0-gram model,
our prior is uniform over (0, 1]. We place a spher-
ical Gaussian prior on the vector θ, with mean ~0
and a variance σ2 tuned by coarse grid search on
dev data (see captions of Figures 3–4).

The Gaussian favors phonologies that are sim-
ple in the sense that they have few strongly
weighted features. A grammar that refers once to


the natural class of voiced consonants (section 6),
which captures a generalization, is preferred to an
equally descriptive grammar that refers separately
to several specific voiced consonants. If it is hard
to tell whether a change applies to round or back
vowels (because these properties are strongly cor-
related in the training data), then the prior resists
grammars that make an arbitrary choice. It prefers
to “spread the blame” by giving half the weight
to each feature. The change is still probable for
round back vowels, and moderately probable for
other vowels that are either round or back.

4 Inference

We are given a training set of surface word forms
s that realize known abstract words ~a. We aim to
reconstruct the underlying morphs m and words u,
and predict new surface word forms s.

4.1 A Bayesian network

For fixed θ and φ, this task can be regarded as
marginal inference in a Bayesian network (Pearl,
1988). Figure 1 displays part of a network that en-
codes the modeling assumptions of section 3. The
nodes at layers 1, 2, and 3 of this network repre-
sent string-valued random variables in M, U, and
S respectively. Each variable’s distribution is con-
ditioned on the values of its parents, if any.

In particular, layer 1 represents the unknown
M(a) for various a. Notice that each M(a) is
softly constrained by the prior Mφ, and also by
the fact that it must help produce various observed
surface words via Sθ.

Each underlying word u at level 2 is a concate-
nation of its underlying morphs M(ai) at level 1.
Thus, the topology at levels 1–2 is given by super-
vision. We would have to learn this topology if the
word’s morphemes ai were not known.

Our approach captures the unbounded genera-
tive capacity of language. In contrast to Dreyer
and Eisner (2009) (see section 8), we have defined
a directed graphical model. Hence new unob-
served descendants can be added without chang-
ing the posterior distribution over the existing vari-
ables. So our finite network can be viewed as a
subgraph of an infinite graph. That is, we make no
closed-vocabulary assumption, but implicitly in-
clude (and predict the surface forms of) any un-
observed words that could result from combining
morphemes, even morphemes not in our dataset.

While the present paper focuses on word types,

we could extend the model to consider tokens as
well. In Figure 1, each phonological surface type
at layer 3 could be observed to generate 0 or more
noisy phonetic tokens at layer 4, in contexts that
call for the morphemes expressed by that type.

4.2 Loopy belief propagation
The top two layers of Figure 1 include a long
undirected cycle (involving all 8 nodes and all 8
edges shown). On such “loopy” graphical models,
exact inference is in general uncomputable when
the random variables are string-valued. However,
Dreyer and Eisner (2009) showed how to substi-
tute a popular approximate joint inference method,
loopy belief propagation (Murphy et al., 1999).

Qualitatively, what does this do on Figure 1?4

Let u denote the leftmost layer-2 node. Midway
through loopy BP, u is not yet sure of its value, but
is receiving suggestions from its neighbors. The
stem UR immediately above u would like u to
start with something like /rizajgn#/.5 Meanwhile,
the word SR immediately below u encourages u
to be any UR that would have a high probability
(under Sθ) of surfacing as [rEzIgn#eIS@n]. So u
tries to meet both requirements, guessing that its
value might be something like /rizajgn#eIS@n/ (the
product of this string’s scores under the two mes-
sages to u is relatively high). Now, for U to have
produced something like /rizajgn#eIS@n/ by stem-
suffix concatenation, the suffix’s UR must have
been something like /eIS@n/. u sends a message
saying so to the third node in layer 1. This induces
that node (the suffix UR) to inform the rightmost
layer-2 node that it probably ends in /#eIS@n/ as
well—and so forth, iterating until convergence.

Formally, the loopy BP algorithm iteratively
updates messages and beliefs. Each is a func-
tion that scores possible strings (or string tuples).
Dreyer and Eisner (2009)’s key insight is that these
messages and beliefs can be represented using
weighted finite-state machines (WFSMs), and fur-
thermore, loopy BP can compute all of its updates
using standard polytime finite-state constructions.

4.3 Discussion: The finite-state requirement
The above results hold when the “factors” that de-
fine the graphical model are themselves expressed

4Loopy BP actually passes messages on a factor graph de-
rived from Figure 1. However, in this informal paragraph we
will speak as if it were passing messages on Figure 1 directly.

5Because that stem UR thinks its own value is something
like /rizajgn/—based on the messages that it is currently re-
ceiving from related forms such as /rizajgn#z/, and from Mφ.


as WFSMs. This is true in our model. The fac-
tors of section 4.1 correspond to the conditional
distributions Mφ, U, and Sθ that respectively se-
lect values for nodes at layers 1, 2, and 3 given the
values at their parents. As section 3 models these,
for any φ and θ, we can represent Mφ as a 1-tape
WFSM (acceptor), U as a multi-tape WFSM, and
Sθ as a 2-tape WFSM (transducer).6

Any other WFSMs could be substituted. We are
on rather firm ground in restricting to finite-state
(regular) models of Sθ. The apparent regularity
of natural-language phonology was first observed
by Johnson (1972), so computational phonology
has generally preferred grammar formalisms that
compile into (unweighted) finite-state machines,
whether the formalism is based on rewrite rules
(Kaplan and Kay, 1994) or constraints (Eisner,
2002a; Riggle, 2004).

Similarly, U could be any multi-tape finite-state
relation,7 not just concatenation as assumed in sec-
tion 2. This would allow our framework to handle
templatic morphology (Hulden, 2009), infixation,
or circumfixation.

Although only regular factors are allowed in our
graphical model, a loopy graphical model with
multiple such factors can actually capture non-
regular phenomena, for example by using auxil-
iary variables (Dreyer and Eisner, 2009, §3.4). Ap-
proximate inference then proceeds by loopy BP on
this model. In particular, reduplication is not reg-
ular if unbounded, but we can adopt morphologi-
cal doubling theory (Inkelas and Zoll, 2005) and
model it by having U concatenate two copies of
the same morph. During inference of URs, this
morph exchanges messages with two substrings of
the underlying word.

6Mφ has a single state, with halt probability φ and the
remaining probability 1 − φ divided among self-loop arcs
labeled with the phonemes in Σu. U must concatenate k
morphs by copying all of tape 1, then tape 2, etc., to tape
k + 1: this is easily done using k + 1 states, and arcs of
probability 1. Sθ is constructed as in Cotterell et al. (2014).

7In general, a U factor enforces u = U(m1, . . . ,mk),
so it is a degree-(k + 1) factor, represented by a (k + 1)-
tape WFSM connecting these variables (Dreyer and Eis-
ner, 2009). If one’s finite-state library is limited to 2-
tape WFSMs, then one can simulate the U factor us-
ing (1) an auxiliary string variable π encoding the path
through U, (2) a unary factor weighting π according to
U, (3) a set of binary factors relating π to each of
u,m1, . . . ,mk. The standard case u = m1# . . . #mk can
be handled more easily. Given factor U’s incoming mes-
sages µ·→U , each being a 1-tape WFSM, compute its loopy
BP outgoing messages µU→u = µm1→U # · · ·#µmk→U
and (e.g.) µU→m2 = range(µu→U ◦ ((µm1→U # × �)
Σ∗u (#µm3→U # · · ·#µmk→U × �))).

Similarly, we can manipulate the graphical
model structure to encode cyclic phonology—i.e.,
concatenating a word SR with a derivational affix
UR and passing the result through Sθ once again.
An alternative is to encode this hierarchical struc-
ture into the word UR u, by encoding level-1 and
level-2 boundaries with different symbols. A sin-
gle application of Sθ can treat these boundaries
differently: for example, by implementing cyclic
phonology as a composition of two transductions.

4.4 Loopy BP implementation details
Each loopy BP message to or from a random
variable is a 1-tape WFSM (acceptor) that scores
all possible values of that variable (given by the
set M, U, or S: see section 2). We initialized
each message to the uniform distribution.8 We
then updated the messages serially, alternating be-
tween upward and downward sweeps through the
Bayesian network. After 10 iterations we stopped
and computed the final belief at each variable.

A complication is that a popular affix such as
/z/ (in layer 1) receives messages from hundreds of
words that realize that affix. Loopy BP obtains that
affix’s belief and outgoing messages by intersect-
ing all these WFSMs—which can lead to astro-
nomically large results and runtimes. We address
this for now with a simple pruning approximation
where at each variable m, we dynamically restrict
to a finite support set of plausible values for m.
We take this to be the union of the 20-best lists of
all messages sent to m.9 We modify those mes-
sages so that strings in m’s support set have un-
changed weight, but all other strings have weight
0. As a result, m’s outgoing messages and belief
are also confined to its support set. Note that the
support set is not hand-specified, but determined
automatically by taking the best hypotheses under
the probability model.

Improved approaches with no pruning are pos-
sible. After submitting this paper, we devel-
oped a penalized expectation propagation method
(Cotterell and Eisner, 2015). It approximates
the messages using log-linear functions (based on
variable-order n-gram features) whose support is
the entire space Σ∗ . We also developed a dual

8This is standard—although the uniform distribution over
the space of strings is actually an improper distribution. It is
expressed by a single-state WFSM whose arcs have weight 1.

9In general, we should update this support set dynami-
cally as inference and learning improve the messages. But in
our present experiments, that appears unnecessary, since the
initial support set always appears to contain the “correct” UR.


decomposition method (Peng et al., 2015), which
if it converges, exactly recovers the single most
probable explanation of the data10 given φ and θ.

5 Parameter Learning

We employ MAP-EM as the learning algorithm.
The E-step is approximated by the loopy BP algo-
rithm of section 4. The M-step takes the resulting
beliefs, together with the prior of section 3.4, and
uses them to reestimate the parameters θ and φ.

If we knew the true UR uk for each observed
word type sk, we would just do supervised training
of θ, using L-BFGS (Liu and Nocedal, 1989) to
locally maximize θ’s posterior log-probability

(
∑

k log Sθ(sk | uk)) + log pprior(θ)

Cotterell et al. (2014) give the natural dynamic
programming algorithm to compute each sum-
mand and its gradient w.r.t. θ. The gradient is the
difference between observed and expected feature
vectors of the contextual edits (section 3.2), aver-
aged over edit contexts in proportion to how many
times those contexts were likely encountered. The
latent alignment makes the objective non-concave.

In our EM setting, uk is not known. So our M-
step replaces log Sθ(sk | uk) with its expectation,∑

uk
bk(uk) log Sθ(sk | uk), where bk is the nor-

malized belief about uk computed by the previ-
ous E-step. Since bk and Sθ are both represented
by WFSMs (with 1 and 2 tapes respectively), it is
possible to compute this quantity and its gradient
exactly, using finite-state composition in a second-
order expectation semiring (Li and Eisner, 2009).
For speed, however, we currently prune bk back
to the 5-best values of uk. This lets us use a sim-
pler and faster approach: a weighted average over
5 runs of the Cotterell et al. (2014) algorithm.

Our asymptotic runtime benefits from the fact
that our graphical model is directed (so our objec-
tive does not have to contrast with all other values
of uk) and the fact that Sθ is locally normalized
(so our objective does not have to contrast with all
other values of sk for each uk). In practice we are
far faster than Dreyer and Eisner (2009).

We initialized the parameter vector θ to ~0, ex-
cept for setting the weight of the COPY feature
(section 6) such that the probability of a COPY edit
is 0.99 in every context other than end-of-string.
This encourages URs to resemble their SRs.

10That is, a lexicon of morphs together with contextual edit
sequences that will produce the observed word SRs.

BIGRAM(strident,strident) adjacent surface stridents
BIGRAM(�,uvular) surface uvular
EDIT([s],[z]) /s/ became [z]
EDIT(coronal,labial) coronal became labial
EDIT(�, phoneme) phoneme was inserted
EDIT(consonant,�) consonant was deleted

Table 1: Examples of markedness and faithfulness features
that fire in our model. They have a natural interpretation as
Optimality-Theoretic constraints. � denotes the empty string.
The natural classes were adapted from (Riggle, 2005).

To reestimate φ, the M-step does not need to
use L-BFGS, for section 3.3’s simple model of Mφ
and uniform prior over φ ∈ (0, 1]. It simply sets
φ = 1/(` + 1) where ` is the average expected
length of a UR according to the previous E-step.
The expected length of each uk is extracted from
the WFSM for the belief bk, using dynamic pro-
gramming (Li and Eisner, 2009). We initialized
φ to 0.1; experiments on development data sug-
gested that the choice of initializer had little effect.

6 Features of the Phonology Model

Our stochastic edit process Sθ(s | u) assigns a
probability to each possible u-to-s edit sequence.
This edit sequence corresponds to a character-wise
alignment of u to s. Our features for modeling
the contextual probability of each edit are loosely
inspired by constraints from Harmonic Grammar
and Optimality Theory (Smolensky and Legendre,
2006). Such constraints similarly evaluate a u-to-
s alignment (or “correspondence”). They are tra-
ditionally divided into markedness constraints that
encourage a well-formed s, and faithfulness con-
straints that encourage phonemes of s to resemble
their aligned phonemes in u.

Our EDIT faithfulness features evaluate an
edit’s (input, output) phoneme pair. Our BIGRAM
markedness features evaluate an edit that emits a
new phoneme of s. They evaluate the surface bi-
gram it forms with the previous output phoneme.11

Table 1 shows example features. Notice that these
features back off to various natural classes of
phonemes (Clements and Hume, 1995).

These features of an edit need to examine at
most (0,1,1) phonemes of (left input, right input,
left output) context respectively (see Figure 2). So
the PFST that implements Sθ should be able to use
what Cotterell et al. (2014) calls a (0,1,1) topol-
ogy. However, we actually used a (0,2,1) topology,

11At beginning-of-string, the previous “phoneme” is the
special symbol BOS. For the HALT edit at end-of-string,
which copies the symbol EOS, the new “phoneme” is EOS.


to allow features that also look at the “upcoming”
input phoneme that immediately follows the edit’s
input (/@/ in Figure 2). Specifically, for each nat-
ural class, we also included contextual versions of
each EDIT or BIGRAM feature, which fired only if
the “upcoming” input phoneme fell in that natu-
ral class. Contextual BIGRAM features are our ap-
proximation to surface trigram features that look
at the edit’s output phoneme together with the pre-
vious and next output phonemes. (A PFST can-
not condition its edit probabilities on the next out-
put phoneme because that has not been generated
yet—see section 3.2—so we are using the upcom-
ing input phoneme as a proxy.) Contextual EDIT
features were cheap to add once we were using a
(0,2,1) topology, and in fact they turned out to be
helpful for capturing processes such as Catalan’s
deletion of the underlyingly final consonant.

Finally, we included a COPY feature that
fires on any edit where surface and underlying
phonemes are exactly equal. (This feature resem-
bles Optimality Theory’s IDENT-IO constraint,
and ends up getting the strongest weight.) In total,
our model has roughly 50,000 binary features.

Many improvements to this basic feature set
would be possible in future. We cannot currently
express implications such as “adjacent obstruents
must also agree in voicing,” “a vowel that surfaces
must preserve its height,” or “successive vowels
must also agree in height.” We also have not
yet designed features that are sensitive to surface
prosodic boundaries or underlying morph bound-
aries. (Prosodic structure and autosegmental tiers
are absent from our current representations, and
we currently simplify the stochastic edit process’s
feature set by having Sθ erase the # morph bound-
aries before applying that process.)

Our standard prior over θ (section 3.4) resists
overfitting in a generic way, by favoring phonolo-
gies that are “simple to describe.” Linguistic im-
provements are possible here as well. The prior
should arguably discourage positive weights more
than negative ones, since our features detect con-
straint violations that ordinarily reduce probabil-
ity. It should also be adjusted to mitigate the cur-
rent structural bias against deletion edits, which
arises because the single deletion possible in a
context must compete on equal footing with |Σs|
insertions and |Σs|− 1 substitutions. More ambi-
tiously, a linguistically plausible prior should pre-
fer phonologies that are conservative (s ≈ u) and

have low conditional entropies H(s | u),H(u | s)
to facilitate communication.

7 Experimental Design

We objectively evaluate our learner on its ability to
predict held-out surface forms. This blind testing
differs from traditional practice by linguists, who
evaluate a manual or automatic analysis (= URs +
phonology) on whether it describes the full dataset
in a “natural” way that captures “appropriate” gen-
eralizations. We avoid such theory-internal evalu-
ation by simply quantifying whether the learner’s
analysis does generalize (Eisner, 2015).

To avoid tailoring to our training/test data, we
developed our method, code, features, and hy-
perparameters using only two development lan-
guages, English and German. Thus, our learner
was not engineered to do well on the other 5 lan-
guages below: the graphs below show its first at-
tempt to learn those languages. We do also eval-
uate our learners on English and German, using
separate training/test data.

We provide all our data (including cita-
tions, development data, training-test splits, and
natural classes) at http://hubal.cs.jhu.edu/
tacl2015/, along with brief sketches of the
phonological phenomena in the datasets, the
“gold” stem URs we assumed for evaluation, and
our learner’s predictions and error patterns.

7.1 Evaluation methodology

Given a probability distribution p over surface
word types of a language, we sample a training set
of N types without replacement. This simulates
reading text until we have seen N distinct types.
For each of these frequent words, we observe the
SR s and the morpheme sequence ~a.

After training our model, we evaluate its beliefs
b about the SRs s on a disjoint set of test words
whose ~a are observed. To improve interpretabil-
ity of the results, we limit the test words to those
whose morphemes have all appeared at least once
in the training set. (Any method would presumably
get other words badly wrong, just as it would get
the training words right.)

To evaluate our belief b about the SR of a test
word (~a,s∗), we use three measures for which
“smaller is better.” First, 0-1 loss asks whether
s∗ 6= argmaxs b(s). This could be compared with
non-probabilistic predictors. Second, the surprisal
− log2 b(s∗) is low if the model finds it plausible

http://hubal.cs.jhu.edu/tacl2015/
http://hubal.cs.jhu.edu/tacl2015/


Maori Catalan Tangale Indon.

0.0
1.0
2.0
3.0
4.0
5.0 cross entropy (bits)

Maori Catalan Tangale Indon.

0.0
0.2
0.4
0.6
0.8
1.0
1.2

expected edit distance

Maori Catalan Tangale Indon.

0.0
0.2
0.4
0.6
0.8
1.0

1-best error rate
Noisy Concatenation

Our Method

Oracle

Figure 3: Results on the small phonological exercise datasets (≈ 100 word types). Smaller numbers are better. Preliminary tests
suggested that the variance of the prior (section 3.4) did not strongly affect the results, so we took σ2 = 5 for all experiments.

that s∗ realizes ~a. If so, this holds out promise for
future work on analyzing or learning from unan-
notated tokens of s∗. Third, we evaluate the whole
distribution b in terms of

∑
s b(s)L(s

∗,s) where L
is unweighted Levenshtein distance.

We take the average of each measure over test
words, weighting those words according to p. This
yields our three reported metrics: 1-best error
rate, cross-entropy, and expected edit distance.
Each metric is the expected value of some mea-
sure on a random test token.

These metrics are actually random variables,
since they depend on the randomly sampled train-
ing set and the resulting test distribution. We re-
port the expectations of these random variables by
running many training-test splits (see section 7.2).

7.2 Datasets

To test discovery of interesting patterns from lim-
ited data, we ran our learner on 5 “exercises”
drawn from phonology textbooks (102 English
nouns, 68 Maori verbs, 72 Catalan adjectives, 55
Tangale nouns, 44 Indonesian nouns), exhibiting
a diverse range of phenomena. In each case we
took p to be the uniform distribution over the pro-
vided word types. We took N to be one less than
the number of provided types. So to report our
expected metrics, we ran all N + 1 experiments
where we trained jointly on N forms and tested on
the 1 remaining form. This is close to linguists’
practice of fitting an analysis on the entire dataset,
yet it is a fair test.

To test on larger, naturally occurring datasets,
we ran our learner on subsets of the CELEX
database (Baayen et al., 1995), which provides
surface phonological forms and token counts for
German, Dutch, and English words. For each
language, we constructed a coherent subcorpus
of 1000 nouns and verbs, focusing on inflections

with common phonological phenomena. These
turned out to involve mainly voicing: final obstru-
ent devoicing (German 2nd-person present indica-
tive verbs, German nominative singular nouns,
Dutch infinitive verbs, Dutch singular nouns) and
voicing assimilation (English past tense verbs, En-
glish plural nouns). We were restricted to rela-
tively simple phenomena because our current rep-
resentations are simple segmental strings that lack
prosodic and autosegmental structure. In future
we plan to consider stress, vowel harmony, and
templatic morphology.

We constructed the distribution p in proportion
to CELEX’s token counts. In each language, we
trained on N = 200, 400, 600, or 800 forms sam-
pled from p. To estimate the expectation of each
metric over all training sets of size N, we report
the sample mean and bootstrap standard error over
10 random training sets of size N.

Except in Indonesian, every word happens to
consist of at most two morphemes (one stem plus
one possibly empty suffix). In all experiments, we
take the phonological inventories Σu and Σs to be
given as the set of all surface phonemes observed
in training ∪ test.

7.3 Comparison systems

There do not appear to be previous systems that
perform our generalization task. Therefore, we
compared our own system against variants.

We performed an ablation study to determine
whether the learned phonology was helpful. We
substituted a simplified phonology model where
Sθ(s | u) just decays exponentially with the edit
distance between s and u; the decay rate was
learned by EM as usual. That is, this model uses
only the COPY feature of section 6. This baseline
system treats phonology as “noisy concatenation”
of learned URs, not trying to model its regularity.


0.00

0.05

0.10

0.15

0.20

0.25
1-

be
st

 e
rr

or
 ra

te
German

0.05

0.10

0.15

0.20

0.25

0.30
Dutch

0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7

English
Noisy Concatenation
Our Method
Oracle

0.0
0.5
1.0
1.5
2.0
2.5
3.0

cr
os

s-
en

tr
op

y 
(b

its
)

0.0
0.5
1.0
1.5
2.0
2.5
3.0

0.0
1.0
2.0
3.0
4.0
5.0
6.0

200 400 600 800
0.00
0.05
0.10
0.15
0.20
0.25
0.30

ex
pe

ct
ed

 e
di

t d
is

ta
nc

e

200 400 600 800
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50

200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
1.2

Figure 4: Results on the CELEX datasets (1000 word types) at 4 different training set sizes N. The larger training sets were
supersets of the smaller ones, obtained by continuing to sample with replacement from p. For each training set, the unconnected
points evaluate all words /∈ training whose morphemes ∈ training. Meanwhile, the connected points permit comparison across
the 4 values of N, by evaluating only on a common test set found by intersecting the 4 unconnected test sets. Each point
estimates the metric’s expectation over all ways of sampling the 4 training sets; specifically, we plot the sample mean from 10
such runs, with error bars showing a bootstrap estimate of the standard error of the mean. Non-overlapping error bars at a given
N always happen to imply that the difference in the two methods’ sample means is too extreme to be likely to have arisen by
chance (paired permutation test, p < 0.05). Each time we evaluated some training-test split on some metric, we first tuned σ2

(section 3.4) by a coarse grid search where we trained on the first 90% of the training set and evaluated on the remaining 10%.

We considered an additional ablation study to
determine whether the learned URs were helpful.
However, we did not come up with a plausible
heuristic for identifying URs in some simpler way.
Thus, instead we asked whether the learned URs
were as good as hand-constructed URs. Our “ora-
cle” system was allowed to observe gold-standard
URs for stems instead of inferring them. This
system is still fallible: it must still infer the af-
fix URs by belief propagation, and it must still
use MAP-EM to estimate a phonology within our
current model family Sθ. Even with supervision,
this family will still struggle to model many types
of phonology, e.g., ablaut patterns (in Germanic
strong verbs) and many stress-related phenomena.

7.4 Results

We graph our results in Figures 3 and 4. When
given enough evidence, our method works quite
well across the 7 datasets. For 94–98% of held-out
words on the CELEX languages (when N = 800),

Phon. Exercises CELEX
Maori 95.5 German 99.9

Catalan 99.5 Dutch 86.3
Tangale 79.8 English 82.2

Indonesian 100

Table 2: Percent of training words, weighted by the distribu-
tion p, whose 1-best recovered UR (including the boundary #)
exactly matches the manual “gold” analysis. Results are av-
erages over all runs (with N = 800 for the CELEX datasets).

and 77–100% on the phonological exercises, our
method’s top pick is the correct surface form. Fur-
ther, the other metrics show that it places most of
its probability mass on that form,12 and the rest on
highly similar forms. Notably, our method’s pre-
dictions are nearly as good as if gold stem URs had
been supplied (the “oracle” condition). Indeed, it
does tend to recover those gold URs (Table 2).

Yet there are some residual errors in predict-
ing the SRs. Our phonological learner cannot

12Cross-entropy < 1 bit means that the correct form has
probability > 1/2 on average (using geometric mean).


perfectly learn the UR-to-SR mapping even from
many well-supervised pairs (the oracle condition).
In the CELEX and Tangale datasets, this is partly
due to irregularity in the language itself. However,
error analysis suggests we also miss some gener-
alizations due to the imperfections of our current
Sθ model (as discussed in sections 3.2 and 6).

When given less evidence, our method’s perfor-
mance is more sensitive to the training sample and
is worse on average. This is expected: e.g., a
stem’s final consonant cannot be reconstructed if
it was devoiced (German) or deleted (Maori) in all
the training SRs. However, a contributing factor
may be the increased error rate of the phonolog-
ical learner, visible even with oracle data. Thus,
we suspect that a Sθ model with better gener-
alization would improve our results at all train-
ing sizes. Note that harming Sθ—allowing only
“noisy concatenation”—clearly harms the method,
proving the need for true phonological modeling.

8 Related Work

Jarosz (2013, §2) and Tesar (2014, chapters 5–
6) review work on learning the phonology Sθ.
Phonologists pioneered stochastic-gradient and
passive-aggressive training methods—the Gradual
Learning Algorithm (Boersma, 1998) and Error-
Driven Constraint Demotion (Tesar and Smolen-
sky, 1998)—for structured prediction of the sur-
face word s from the underlying word u. If s is not
fully observed during training (we illustrate why
in layer 4 of Figure 1), then it can be imputed, a
step known as Robust Interpretive Parsing.

Recent papers consider our setting where u =
m1#m2# · · · is not observed either. The contrast
analysis method (Tesar, 2004; Merchant, 2008) in
effect uses constraint propagation (Dechter, 2003).
That is, it serially eliminates variable values (de-
scribing aspects of the URs or the constraint rank-
ing) that are provably incompatible with the data.
Constraint propagation is an incomplete method
that is not guaranteed to make all logical de-
ductions. We use its probabilistic generalization,
loopy belief propagation (Dechter et al., 2010)—
which is still approximate but can deal with noise
and stochastic irregularity. A further improvement
is that we work with string-valued variables, repre-
senting uncertainty using WFSMs; this lets us rea-
son about URs of unknown length and unknown
alignment to the SRs. (Tesar and Merchant in-
stead used binary variables, one for each segmen-

tal feature in each UR, requiring the simplifying
assumption that the URs are known except for
their segmental features. They assume that SRs
are annotated with morph boundaries and that the
phonology only changes segmental features, never
inserting or deleting segments.) On the other hand,
Tesar and Merchant reason globally about the con-
straint ranking, whereas in this paper, we only lo-
cally improve the phonology—we use EM, rather
than the full Bayesian approach that treats the pa-
rameters ~θ as variables within BP.

Jarosz (2006) is closest to our work in that she
uses EM, just as we do, to maximize the probabil-
ity of observed surface forms whose constituent
morphemes (but not morphs) are known.13 Her
model is a probabilistic analogue of Apoussidou
(2006), who uses a latent-variable structured per-
ceptron. A non-standard aspect of this model (de-
fended by Pater et al. (2012)) is that a morpheme
a can stochastically choose different morphs M(a)
when it appears in different words. To obtain a sin-
gle shared morph, one could penalize this distribu-
tion’s entropy, driving it toward 0 as learning pro-
ceeds. Such an approach—which builds on a sug-
gestion by Eisenstat (2009, §5.4)—would loosely
resemble dual decomposition (Peng et al., 2015).
Unlike our BP approach, it would maximize rather
than marginalize over possible morphs.

Our work has focused on scaling up inference.
For the phonology S, the above papers learn the
weights or rankings of just a few plausible con-
straints (or Jarosz (2006) learns a discrete distribu-
tion over all 5! = 120 rankings of 5 constraints),
whereas we use Sθ with roughly 50,000 con-
straints (features) to enable learning of unknown
languages. Our S also allows exceptions. The
above papers also consider only very restricted
sets of morphs, either identifying a small set of
plausible morphs or prohibiting segmental inser-
tion/deletion. We use finite-state methods so that it
is possible to consider the space Σ∗u of all strings.

On the other hand, we are divided from pre-
vious work by our inability to use an OT gram-
mar (Prince and Smolensky, 2004), a stochastic
OT grammar (Boersma, 1997), or even a maxi-
mum entropy grammar (Goldwater and Johnson,
2003; Dreyer et al., 2008; Eisenstat, 2009). The
reason is that our BP method inverts the phono-
logical mapping Sθ to find possible word URs.

13She still assumes that word SRs are annotated with mor-
pheme boundaries, and that a small set of possible morphs is
given. These assumptions are relaxed by Eisenstat (2009).


Given a word SR s, we construct a WFSM (mes-
sage) that scores every possible UR u ∈ Σ∗u —the
score of u is Sθ(s | u). To accomplish this step
without approximation, our method needs Sθ itself
to be represented as a WFSM (section 3.2). (The
WFSM for a maximum entropy grammar unfortu-
nately does not compute Sθ but only an unnormal-
ized version. A different normalizing constant is
needed for each u, akin to the “double intractabil-
ity” problem in Bayesian learning.)

In the NLP community, Elsner et al. (2013) re-
sembles our work in many respects. Like us, they
recover a latent underlying lexicon (using the same
simple prior Mφ) and use EM to learn a phonol-
ogy (rather similar to our Sθ, though less power-
ful).14 Unlike us, they do not assume annotation
of the (abstract) morpheme sequence, but jointly
learn a nonparametric bigram model to discover
the morphemes. Their evaluation is quite different,
as their aim is actually to recover underlying words
from phonemically transcribed child-directed En-
glish utterances. However, nothing in their model
distinguishes words from morphemes—indeed,
sometimes they do find morphemes instead—so
their model could be used in our task. For infer-
ence, they invert the finite-state Sθ like us to recon-
struct a lattice of possible UR strings. However,
they do this not within BP but within a block Gibbs
sampler that stochastically reanalyzes utterances
one at a time. Whereas our BP tries to find a con-
sensus UR for each given morpheme type, their
sampler posits morph tokens while trying to reuse
frequent morph types, which are interpreted as the
morphemes. With observed morphemes (our set-
ting), this sampler would fail to mix.

Dreyer and Eisner (2009, 2011) like us used
loopy BP and MAP-EM to predict morphologi-
cal SRs. Their 2011 paper was also able to ex-
ploit raw text without morphological supervision.
However, they directly modeled pairwise finite-
state relationships among the surface word forms
without using URs. Their model is a joint distribu-
tion over n variables: the word SRs of a single in-
flectional paradigm. Since it requires a fixed n, it
does not directly extend to derivational morphol-
ogy: deriving new words would require adding
new variables, which—for an undirected model
like theirs—changes the partition function and re-

14Elsner et al. (2012) used an Sθ quite similar to ours
though lacking bigram well-formedness features. Elsner et
al. (2013) simplified this for efficiency, disallowing segmen-
tal deletion and no longer modeling the context of changes.

quires retraining. By contrast, our trained directed
model is a productive phonological system that
can generate unboundedly many new words (see
section 4.1). By analogy, n samples from a Gaus-
sian would be described with a directed model,
and inferring the Gaussian parameters predicts any
number of future samples n + 1,n + 2, . . ..

Bouchard-Côté et al., in several papers from
2007 through 2013, have used directed graphi-
cal models over strings, like ours though without
loops, to model diachronic sound change. Some-
times they use belief propagation for inference
(Hall and Klein, 2010). Their goal is to recover la-
tent historical forms (conceptually, surface forms)
rather than latent underlying forms. The results
are evaluated against manual reconstructions.

None of this work has segmented words into
morphs, although Dreyer et al. (2008) did seg-
ment surface words into latent “regions.” Creutz
and Lagus (2005) and Goldsmith (2006) segment
an unannotated collection of words into reusable
morphs, but without modeling contextual sound
change, i.e., phonology.

9 Conclusions and Future Work

We have laid out a probabilistic model for gener-
ative phonology. This lets us infer likely expla-
nations of a collection of morphologically related
surface words, in terms of underlying morphs
and productive phonological changes. We do this
by combining well-motivated algorithms for in-
ference in graphical models and MAP estimation
from incomplete data, using weighted finite-state
machines to encode uncertainty. Throughout our
presentation, we were careful to point out various
limitations of our setup. But in each case, we also
outlined how future work could address these lim-
itations within the framework we propose here.

Finally, we proposed a detailed scheme for
quantitative evaluation of phonological learners.
Across 7 different languages, on both small and
larger datasets, our learner was able to predict
held-out surface forms with low error rates.

Acknowledgments

This material is based upon work supported by
the National Science Foundation under Grant No.
1423276, and by a Fulbright grant to the first au-
thor. We thank the anonymous reviewers and Reut
Tsarfaty for useful discussion of presentation, ter-
minology, and related work.


References

Diana Apoussidou. 2006. On-line learning of under-
lying forms. Technical Report ROA-835, Rutgers
Optimality Archive.

R Harald Baayen, Richard Piepenbrock, and Leon Gu-
likers. 1995. The CELEX lexical database on CD-
ROM.

Juliette Blevins. 1994. A phonological and morpho-
logical reanalysis of the maori passive. Te Reo,
37:29–53.

Paul Boersma and Bruce Hayes. 2001. Empirical tests
of the gradual learning algorithm. Linguistic In-
quiry, 32(1):45–86.

Paul Boersma. 1997. How we learn variation, option-
ality, and probability. In Proc. of the Institute of
Phonetic Sciences of the University of Amsterdam,
volume 21, pages 43–58.

Paul Boersma. 1998. How we learn variation, op-
tionality, and probability. In Functional Phonology:
Formalizing the Interactions Between Articulatory
and Perceptual Drives, chapter 15. Ph.D. Disserta-
tion, University of Amsterdam. Previously appeared
in IFA Proceedings (1997), pp. 43–58.

Alexandre Bouchard-Côté, Percy Liang, Thomas L.
Griffiths, and Dan Klein. 2007. A probabilistic ap-
proach to language change. In Proc. of NIPS.

Alexandre Bouchard-Côté, David Hall, Thomas L.
Griffiths, and Dan Klein. 2013. Automated re-
construction of ancient languages using probabilis-
tic models of sound change. Proceedings of the Na-
tional Academy of Sciences.

Noam Chomsky and Morris Halle. 1968. The Sound
Pattern of English. Harper and Row.

George N Clements and Elizabeth V Hume. 1995.
The internal organization of speech sounds. In John
Goldsmith, editor, Handbook of Phonological The-
ory. Oxford University Press, Oxford.

Ryan Cotterell and Jason Eisner. 2015. Penalized
expectation propagation for graphical models over
strings. In Proceedings of NAACL-HLT, pages 932–
942, Denver, June. Supplementary material (11
pages) also available.

Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2014.
Stochastic contextual edit distance and probabilistic
FSTs. In Proc. of ACL.

Mathias Creutz and Krista Lagus. 2005. Inducing the
morphological lexicon of a natural language from
unannotated text. In Proc. of the International and
Interdisciplinary Conference on Adaptive Knowl-
edge Representation and Reasoning (AKRR05), vol-
ume 1.

Rina Dechter, Bozhena Bidyuk, Robert Mateescu,
and Emma Rollon. 2010. On the power of
belief propagation: A constraint propagation per-
spective. In Rina Dechter, Hector Geffner, and
Joseph Y. Halpern, editors, Heuristics, Probability
and Causality: A Tribute to Judea Pearl. College
Publications.

Rina Dechter. 2003. Constraint Processing. Morgan
Kaufmann.

Markus Dreyer and Jason Eisner. 2009. Graphical
models over multiple strings. In Proc. of EMNLP,
pages 101–110.

Markus Dreyer and Jason Eisner. 2011. Discover-
ing morphological paradigms from plain text us-
ing a dirichlet process mixture model. In Proc. of
EMNLP, EMNLP ’11, pages 616–627.

Markus Dreyer, Jason R. Smith, and Jason Eisner.
2008. Latent-variable modeling of string transduc-
tions with finite-state methods. In Proc. of EMNLP,
pages 1080–1089.

Markus Dreyer. 2011. A Non-Parametric Model for
the Discovery of Inflectional Paradigms from Plain
Text Using Graphical Models over Strings. Ph.D.
thesis, Johns Hopkins University, Baltimore, MD,
April.

Sarah Eisenstat. 2009. Learning underlying forms
with maxent. Master’s thesis, Brown University,
Providence, RI.

Jason Eisner. 2002a. Comprehension and compilation
in Optimality Theory. In Proc. of ACL, pages 56–63,
Philadelphia, July.

Jason Eisner. 2002b. Discovering syntactic deep
structure via Bayesian statistics. Cognitive Science,
26(3):255–268, May-June.

Jason Eisner. 2015. Should linguists evaluate gram-
mars or grammar learners? In preparation.

Micha Elsner, Sharon Goldwater, and Jacob Eisenstein.
2012. Bootstrapping a unified model of lexical and
phonetic acquisition. In Proc. of ACL, pages 184–
193.

Micha Elsner, Sharon Goldwater, Naomi Feldman, and
Frank Wood. 2013. A joint learning model of word
segmentation, lexical acquisition, and phonetic vari-
ability. In Proc. of EMNLP, pages 42–54.

Alexander M. Fraser, Marion Weller, Aoife Cahill, and
Fabienne Cap. 2012. Modeling inflection and word-
formation in SMT. In Proc. of EACL, pages 664–
674.

J. Goldsmith. 2006. An algorithm for the unsupervised
learning of morphology. Natural Language Engi-
neering, 12(4):353–371.


Sharon Goldwater and Mark Johnson. 2003. Learning
OT constraint rankings using a maximum entropy
model. In Proc. of the Workshop on Variation within
Optimality Theory, pages 113–122, Stockholm Uni-
versity.

David Hall and Dan Klein. 2010. Finding cognate
groups using phylogenies. In Proc. of ACL.

Bruce Hayes and Colin Wilson. 2008. A maximum en-
tropy model of phonotactics and phonotactic learn-
ing. Linguistic Inquiry, 39(3):379–440.

Matt Hohensee and Emily M. Bender. 2012. Getting
more from morphology in multilingual dependency
parsing. In Proc. of NAACL-HLT, pages 315–326.

Mans Hulden. 2009. Revisiting multi-tape automata
for Semitic morphological analysis and generation.
In Proc. of the EACL 2009 Workshop on Computa-
tional Approaches to Semitic Languages, pages 19–
26, March.

Sharon Inkelas and Cheryl Zoll. 2005. Reduplication:
Doubling in Morphology. Number 106 in Cam-
bridge Studies in Linguistics. Cambridge University
Press.

Roman Jakobson. 1948. Russian conjugation. Word,
4:155–167.

Gaja Jarosz. 2006. Richness of the base and prob-
abilistic unsupervised learning in optimality theory.
In Proc. of the Eighth Meeting of the ACL Special In-
terest Group on Computational Phonology and Mor-
phology, pages 50–59.

Gaja Jarosz. 2013. Learning with hidden structure in
optimality theory and harmonic grammar: Beyond
robust interpretive parsing. Phonology, 30(01):27–
71.

C. Douglas Johnson. 1972. Formal Aspects of Phono-
logical Description. Mouton.

René Kager. 1999. Optimality Theory, volume 2. MIT
Press.

Ronald M. Kaplan and Martin Kay. 1994. Regu-
lar models of phonological rule systems. Compu-
tational Linguistics, 20(3):331–378.

Michael J Kenstowicz and Charles W Kisseberth.
1979. Generative Phonology. Academic Press San
Diego.

András Kornai. 1995. Formal Phonology. Garland
Publishing, New York.

Zhifei Li and Jason Eisner. 2009. First- and second-
order expectation semirings with applications to
minimum-risk training on translation forests. In
Proc. of EMNLP, pages 40–51, Singapore, August.

Dong C. Liu and Jorge Nocedal. 1989. On the limited
memory BFGS method for large scale optimization.
Mathematical Programming, 45(1-3):503–528.

Andrew McCallum, Dayne Freitag, and Fernando C. N.
Pereira. 2000. Maximum entropy Markov mod-
els for information extraction and segmentation. In
Proc. of ICML, pages 591–598.

Navarré Merchant. 2008. Discovering Underlying
Forms: Contrast Pairs and Ranking. Ph.D. thesis,
Rutgers University. Available on the Rutgers Opti-
mality Archive as ROA-964.

Kevin P. Murphy, Yair Weiss, and Michael I. Jordan.
1999. Loopy belief propagation for approximate in-
ference: An empirical study. In Proc. of UAI, pages
467–475.

Joe Pater, Karen Jesney, Robert Staubs, and Brian
Smith. 2012. Learning probabilities over underly-
ing representations. In Proc. of the Twelfth Meet-
ing of the Special Interest Group on Computational
Morphology and Phonology, pages 62–71.

Judea Pearl. 1988. Probabilistic Reasoning in In-
telligent Systems: Networks of Plausible Inference.
Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA.

Nanyun Peng, Ryan Cotterell, and Jason Eisner. 2015.
Dual decomposition inference for graphical models
over strings. In Proceedings of EMNLP, Lisbon,
September. To appear.

Janet Pierrehumbert. 2003. Probabilistic phonology:
Discrimination and robustness. In Probabilistic Lin-
guistics, pages 177–228. MIT Press.

Alan Prince and Paul Smolensky. 2004. Optimality
Theory: Constraint Interaction in Generative Gram-
mar. Wiley-Blackwell.

Jason A. Riggle. 2004. Generation, Recognition, and
Learning in Finite State Optimality Theory. Ph.D.
thesis, University of California at Los Angeles.

Jason Riggle. 2005. Phonological features. Available
online at http://www.mml.cam.ac.uk/
dtal/courses/ugrad/paper_support/
li8/riggle-feature-chart.pdf (re-
trieved 2014-09-28).

David Sankoff. 1978. Probability and linguistic varia-
tion. Synthese, 37(2):217–238.

Ferdinand de Saussure. 1916. Course in General Lin-
guistics. Columbia University Press. English edi-
tion of June 2011, based on the 1959 translation by
Wade Baskin.

Paul Smolensky and Géraldine Legendre. 2006. The
Harmonic Mind: From Neural Computation to
Optimality-Theoretic Grammar (Vol. 1: Cognitive
architecture). MIT Press.

Bruce Tesar and Paul Smolensky. 1998. Learnability
in Optimality theory. Linguistic Inquiry, 29(2):229–
268.

http://www.mml.cam.ac.uk/dtal/courses/ugrad/paper_support/li8/riggle-feature-chart.pdf
http://www.mml.cam.ac.uk/dtal/courses/ugrad/paper_support/li8/riggle-feature-chart.pdf
http://www.mml.cam.ac.uk/dtal/courses/ugrad/paper_support/li8/riggle-feature-chart.pdf


Bruce Tesar. 2004. Contrast analysis in phonological
learning. Technical Report ROA-695, Rutgers Opti-
mality Archive.

Bruce Tesar. 2014. Output-Driven Phonology: Theory
and Learning. Cambridge University Press.

Reyyan Yeniterzi. 2011. Exploiting morphology in
turkish named entity recognition system. In Proc. of
the ACL Student Session, pages 105–110.


	Introduction
	Formal Framework
	Probability Model
	Discussion: Why probability?
	Mapping URs to SRs: The phonology S
	Generating URs: The lexicon model M
	Prior Over the Parameters

	Inference
	A Bayesian network
	Loopy belief propagation
	Discussion: The finite-state requirement
	Loopy BP implementation details

	Parameter Learning
	Features of the Phonology Model
	Experimental Design
	Evaluation methodology
	Datasets
	Comparison systems
	Results

	Related Work
	Conclusions and Future Work