Unsupervised Discovery of Biographical Structure from Text

David Bamman
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213, USA
dbamman@cs.cmu.edu

Noah A. Smith
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213, USA
nasmith@cs.cmu.edu

Abstract

We present a method for discovering abstract
event classes in biographies, based on a prob-
abilistic latent-variable model. Taking as in-
put timestamped text, we exploit latent corre-
lations among events to learn a set of event
classes (such as BORN, GRADUATES HIGH
SCHOOL, and BECOMES CITIZEN), along
with the typical times in a person’s life when
those events occur. In a quantitative evalua-
tion at the task of predicting a person’s age
for a given event, we find that our genera-
tive model outperforms a strong linear regres-
sion baseline, along with simpler variants of
the model that ablate some features. The ab-
stract event classes that we learn allow us
to perform a large-scale analysis of 242,970
Wikipedia biographies. Though it is known
that women are greatly underrepresented on
Wikipedia—not only as editors (Wikipedia,
2011) but also as subjects of articles (Reagle
and Rhue, 2011)—we find that there is a bias
in their characterization as well, with biogra-
phies of women containing significantly more
emphasis on events of marriage and divorce
than biographies of men.

1 Introduction

The written text that we interact with on an everyday
basis—news articles, emails, social media, books—
contains a vast amount of information centered on
people: news (including common NLP corpora such
as the New York Times and the Wall Street Journal)
details the roles of actors in current events, social
media (including Twitter and Facebook) documents
the actions and attitudes of friends, and books chron-
icle the stories of fictional characters and real people

alike. This focus on people gives us an abundance of
information on how the lives of those portrayed un-
fold; for corpora that include historically deep bi-
ographical information (such as Wikipedia, book-
length biographies and autobiographies, and even
newspaper obituaries) this data includes the actors
involved in particular historical events and the times
and places in which they occur. The life events de-
scribed in these texts have natural structure: event
classes exhibit correlations with each other (e.g.,
those who DIVORCE must have been MARRIED),
can occur at roughly similar times in the lives of dif-
ferent individuals (MARRIAGE is more likely to oc-
cur earlier in one’s life than later), and can be bound
to historical moments as well (FIGHTS IN WORLD
WAR II peaks in the early 1940s).

Social scientists have long been interested in the
structure of these events in investigating the role
that individual agency and larger social forces play
in shaping the course of an individual’s life. Life
stages marking “transitions to adulthood” (such as
LEAVING SCHOOL, ENTERING THE WORKFORCE
and MARRIAGE) have important correlates with de-
mographic variables (Modell et al., 1976; Hogan
and Astone, 1986; Shanahan, 2000); and researchers
study the interactional effects that life events have
on each other, such as the relationship between di-
vorce and pre-marital cohabitation (Lillard et al.,
1995; Reinhold, 2010) or having children (Lillard
and Waite, 1993).

The data on which these studies draw, however,
has largely been restricted to categorical surveys
and observational data; we present here a latent-
variable model that exploits the correlations of event
descriptions in text to learn the structure of abstract
events, grounded in time, from text alone. While our

363

Transactions of the Association for Computational Linguistics, 2 (2014) 363–376. Action Editor: Jian Su.
Submitted 6/2014; Published 10/2014. c©2014 Association for Computational Linguistics.


model can be estimated on any set of texts where the
birth dates of a set of mentioned entities are known,
we illustrate our method on a large-scale dataset of
242,970 biographies extracted from Wikipedia.

This paper makes two contributions: first, we
present a general unsupervised model for learning
life event classes from biographical text, along with
the structure that binds them; second, in using this
method to learn event classes from Wikipedia, we
uncover evidence of systematic bias in the presen-
tation of male and female biographies (with biogra-
phies of women containing significantly dispropor-
tionate emphasis on the personal events of mar-
riage and divorce). In addition to these contribu-
tions, we also present a range of other analyses that
uncovering life events in text can make possible.
Data and code to support this work can be found at
http://www.ark.cs.cmu.edu/bio/.

2 Data

The data for this analysis originates in the January
2, 2014 dump of English-language Wikipedia.1 We
extract biographies by identifying all articles with
persondata metadata2 in which the DATE OF
BIRTH field is known. This results in a set of
927,403 biographies.

For each biography, we perform part-of-speech
tagging using the Stanford POS tagger (Toutanova
et al., 2003) and named entity recognition using
the Stanford named entity recognizer (Finkel et al.,
2005), cluster all mentions of co-referring proper
names (Davis et al., 2003; Elson et al., 2010) and
resolve pronominal co-reference (Bamman et al.,
2014), aided by gender inference for each entity as
the gender corresponding to the maximum number
of gendered pronouns (i.e., he and she) mentioned
in the article, as also used by Reagle and Rhue
(2011). In a random test set of 500 articles, this
method of gender inference is overwhelmingly ac-
curate, achieving 100% precision with 97.6% recall
(12 articles had no pronominal mentions and so gen-
der is not assigned).

1http://dumps.wikimedia.org/enwiki/
20140102/enwiki-20140102-pages-articles.
xml.bz2

2“Persondata is a special set of metadata that can and
should be added to biographical articles only” (http://en.
wikipedia.org/wiki/Wikipedia:Persondata).

As further preprocessing, we identify multiword
expressions in all texts as maximal sequences of
adjective + noun part of speech tags (yielding, for
example, New York, United States, early life and
high school), as first described in Justeson and Katz
(1995). For each biographical article, we then ex-
tract all sentences in which the subject of the ar-
ticle is mentioned along with a single date and re-
tain only the terms in each sentence that are among
the most frequent 10,000 unigrams and multiword
expressions in all documents, excluding stopwords
such as the and all numbers (including dates). An
“event” is the bag of these unigrams and multiword
expressions extracted from one such sentence, along
with a corresponding timestamp measured as the dif-
ference between the observed date in the sentence
and the date of birth of the entity.

Table 1 illustrates the actual form of the data with
a sample of extracted sentences from the biography
of Frank Lloyd Wright, along with the data as input
to the model. In the terminology of the model de-
scribed below, each sentence constitutes one “event”
in the subject’s life.

For the final dataset we retain all biographies
where the subject of the article is born after the
year 1800 and for which there exist at least 5 events
(242,970 people). The complete data consists of
2,313,867 events across these 242,970 people.

3 Model

The quantities of interest that we want to learn from
the data are: 1.) a broad set of major life events
recorded in Wikipedia biographies that people expe-
rience at similar stages in their lives (such as BEING
BORN, GRADUATING HIGH SCHOOL, SERVING IN
THE ARMY, GETTING MARRIED, and so on); 2.)
correlations among those life events (e.g., knowing
that if an individual WINS A NOBEL PRIZE that
they’re more likely to RECEIVE AN HONORARY
DOCTORATE); and 3.) an attribution of those classes
of events to particular moments in a specific indi-
vidual’s life (e.g., John Nash RECEIVED AN HON-
ORARY DOCTORATE in 1999).

We cast this problem as an unsupervised learning
one; given no labeled instances, can we infer these
quantities from text alone? One possible alterna-
tive approach would be to leverage the categorical

364


Original sentence
Data as input to model
Terms (w) Time (t)

He was admitted to the University of Wisconsin–
Madison as a special student in 1886.

admitted university wisconsin madison special
student

19

Wright first traveled to Japan in 1905, where he
bought hundreds of prints.

wright first traveled japan bought hundreds prints 38

After Wright’s return to the United States in Octo-
ber 1910, Wright persuaded his mother to buy land
for him in Spring Green, Wisconsin.

wright return united states wright persuaded
mother buy land spring green wisconsin

43

This philosophy was best exemplified by his design
for Fallingwater (1935), which has been called “the
best all-time work of American architecture”.

philosophy best design called best all-time work
american architecture

68

Already well known during his lifetime, Wright
was recognized in 1991 by the American Institute
of Architects as “ the greatest American architect
of all time.”

already well known lifetime wright recognized
american institute architects greatest american
architect time

124

Table 1: A sample of 5 of the 64 sentences (original and converted) that constitute the data for Frank Lloyd
Wright (born 1867). Each event is defined as one such temporally-scoped sentence.

information contained in Wikipedia biographies (or
its derivatives, such as Freebase; Google, 2014) as
a form of supervision (e.g., George Washington is a
member of the categories Presidents of the United
States and American cartographers, among others).
These manual categories, however, are often spo-
radically annotated and have a long tail (with most
categories appearing very few times); in learning
event structure directly from text, we avoid relying
on categories’ accuracy and being constrained by a
fixed ontology. One advantage of an unsupervised
approach is that we eliminate the need to define a
pre-determined set of event classes a priori, allow-
ing application across a variety of different domains
and time periods, such as full-text books from the
Internet Archive or Hathi Trust, or historical works
like the Oxford Dictionary of National Biography
(Matthew and Harrison, 2004).

Figure 1a illustrates the graphical form of our hi-
erarchical Bayesian model, which articulates the re-
lationship between an entity’s set of events (where
each event is an observation defined as the bag of
terms in text and the difference between the year
it was recorded as happening and the birth year),
an abstract set of event classes, correlations among
those abstract classes, and the distribution of vocab-
ulary terms that defines each one. To capture corre-
lations among different classes, we place a logistic
normal prior on each biography’s distribution over

event classes (Blei and Lafferty, 2006a; Blei and
Lafferty, 2007; Mimno et al., 2008); unlike a Dirich-
let, a logistic normal is able to capture arbitrary cor-
relations between elements through the structure of
the covariance matrix of its underlying multivariate
normal. We take a Bayesian approach to estimating
the mean µη and covariance Ση, drawing them from
a conjugate Normal-Inverse Wishart prior.

The generative story for the model runs as fol-
lows: let K be the number of latent event classes, P
be the number of biographies, and Ep be the number
of events in biography p.

• Draw event class means and covariances
µη ∈ RK, Ση ∈ RK×K ∼
Normal-Inverse Wishart(µ0,λ, Ψ,ν)
• For each event class i ∈{1, . . . ,K}:

- Draw event-term distribution φk ∼ Dir(γ)
• For each biography p:

- Draw ηp ∼N(µη, Ση)
- Convert ηp into biography-event

proportions βp through the softmax
function: βp,i =

exp(ηp,i)∑K
k=1 exp(ηp,k)

- For each event e in biography p:
- Draw event class index z ∼ Mult(βp)
- Draw timestamp t ∼N(µz,σ2z)
- For each token in event e:

- Draw term w ∼ Mult(φz)

365


z

w

t

η

µη Ση

NIW

φ

γ σ2

µ

W

E

P

(a) FULL.

z

w

t

η

α

φ

γ σ2

µ

W

E

P

(b) –CORRELATION. η ∼ Dir(α)

z

w

η

µη Ση

NIW

φ

γ

W

E

P

(c) –TIME

z

w

η

α

φ

γ

W

E

P

(d) –CORRELATION,
–TIME. η ∼ Dir(α)

Figure 1: Graphical form of the full model (described in §3) and models with ablations (described in §4).

Inference proceeds via stochastic EM: after ini-
tializing all variables to random values, we alter-
nate between collapsed Gibbs sampling for the la-
tent class indicators followed by maximization steps
over all other parameters:

1. Sample all z using collapsed Gibbs sampling
conditioned on current values for η and all
other z.

2. For each biography p, maximize likelihood
with respect to ηp via gradient ascent given the
current samples of z and priors µη and Ση.

3. Assign MAP estimates of µη and Ση given
current values of η and the Normal-Inverse
Wishart prior. Update µ and σ2 according to
its maximum likelihood estimate given z.

We describe the technical details of each step be-
low.

Sampling z. Given fixed biography-event class
proportions η, observed tokens w, timestamp t, and
current samples z− for all other events, the proba-
bility of a given event belonging to event class k is
as follows:

P(z = k | z−,w,t,η,γ,µ,σ2) ∝ exp(ηk)

×σ−1k exp
(
−(t−µk)

2

2σ2k

)

×
∏V
v=1

∏e(v)
i=1 (γ + c

−(k,v) + i− 1)
∏Ne
n=1 (V γ + c

−(k,?) + n− 1)

(1)

Here c−(k,v) is the count of the number of times
vocabulary term v shows up in all events whose cur-
rent sample z = k (excepting the current one being
sampled), c−(k,?) is the total count of all terms in
all events whose current z = k (again excepting the
current one), Ne is the number of terms in event e,
and e(v) is the count of vocabulary term v in the cur-
rent event. (Note the complexity of the last term is
due to drawing multiple observations from a single
collapsed multinomial; Carpenter, 2010.)

Maximizing η. Under our model, the terms in the
likelihood function that involve η include the likeli-
hood of the samples drawn from it and its own prob-
ability given the multivariate Normal prior:

L(η) ∝
N∏

n=1

exp(ηzn )∑K
k=1 exp(ηk)

×N(η | µη, Ση) (2)

The log likelihood is proportional to:

`(η) ∝
N∑

n=1

ηzn −
N∑

n=1

K∑

k=1

exp(ηk)

−1
2

(η −µη)> Σ−1η (η −µη)
(3)

Given samples of the latent event class z for all
events in biography p, we maximize the value of ηp
using gradient ascent. We can think of this as maxi-
mizing the likelihood of the observations z subject to
`2 (Gaussian) regularization, where the covariance

366


matrix in the regularizer encourages correlations in
η: if a document contains many examples of z = k
and zk is highly correlated with zj, then the optimal
η is encouraged to contain high weights at both ηk
and ηj rather than simply ηk alone.

Maximizing µη, Ση,µ,σ2. Given values for η, we
then find maximum a posteriori estimates of µη
and Ση conditioned on the Normal-Inverse Wishart
(NIW) prior. The NIW is a conjugate prior to a mul-
tivariate Gaussian, parameterized by dimensionality
K, initial mean µ0, positive-definite scale matrix Ψ,
and scalars ν > K − 1 and λ > 0. The prior
parameters Ψ and ν have an intuitive interpretation
as the scatter matrix

∑ν
i=1 (xi − x̄) (xi − x̄)

> for ν
pseudo-observations.

The expected value of the covariance matrix
drawn from a NIW distribution parameterized by Ψ
and ν is Ψ

ν−K−1 . To disprefer correlations among
topics in the absence of strong evidence, we fix
µ0 = 0 and set Ψ so that this prior expectation over
Ση is the product of a scalar value ρ and the iden-
tity matrix I: Ψ = (ν − K − 1)ρI; ρ defines the
expected variance, and the higher the value of ν, the
more strongly the prior dominates the posterior es-
timate of the covariance matrix (i.e., the more the
covariance matrix is shrunk toward ρI). λ likewise
has an intuitive understanding as a dampening pa-
rameter: the higher its value, the more the posterior
estimate of the mean µ̂ shrinks toward 0. For n data
points, we set λ = n/10, ν = K + 2, and ρ = 1.

Since the NIW is conjugate with the multivariate
normal, posterior updates to µη and Ση have closed-
form expressions given values of η (here, η̄ denotes
the mean value of η over all biographies).

µ̂η =
n

λ + n
η̄ (4)

Σ̂η =
Ψ +

∑N
i=1 (ηi − η̄) (ηi − η̄)

>
+ λn

λ+n
η̄η̄>

ν + n + K + 1
(5)

Since we have no meaningful prior information on
the values of µ and σ2, we calculate their maximum
likelihood estimate given current samples z.

4 Evaluation

While the goal of this work is to learn qualitative
categories of life events from text, we can quantita-

tively evaluate the performance of our model on the
empirical task of predicting the age in a person’s life
when an event occurs.

For this task, we compare the full model described
above with a strong baseline of `2-regularized linear
regression and also with comparable models with
feature ablations, in order to quantify the extent to
which various aspects of the full model are con-
tributing to its empirical performance. The compa-
rable ablated models include the following:

• –CORRELATION, figure 1b. Rather than a lo-
gistic normal prior on the entity-specific distri-
bution over event types (η), we draw η from a
symmetric Dirichlet distribution parameterized
by a global α. In a Dirichlet distribution, arbi-
trary correlations cannot be captured.
• –TIME, figure 1c. In the full model, the time-

stamps of the observed events influence the
event classes we learn by encouraging them to
be internally coherent and time-sensitive. To
test this design choice, we ablate time as a fea-
ture during inference.
• –CORRELATION,–TIME, figure 1d. We also

test a model that ablates both the correlation
structure in the prior and the influence of time;
this model corresponds to smoothed, unsuper-
vised naı̈ve Bayes.

As during inference, we define an event to be the
set of terms, excluding stopwords and numbers, that
are present in the vocabulary of the 10,000 most fre-
quent words and multiword expressions in the data
overall. Each event is accompanied by the year of its
occurrence, from which we calculate the gold target
prediction (the age of the person at the time of the
event) as the year minus the entity’s year of birth.
For all of the four models described above (the full
model and three ablations), we train the model on
4/5 of the biographies (194,376 entities, on average
1,851,094 events); we split the remaining 1/5 of the
biographies into development data (where t is ob-
served) and test data (where t is predicted). The de-
tails of inference for each model are as follows:

1. FULL. Inference as above for a burn-in period
of 100 iterations, using slice sampling (Neal,
2003) to optimize the value of the Dirichlet hy-
perparameter γ every 10 iterations; after infer-
ence, the parameters µη, Ση,µ,σ2 and φ are es-

367


timated from samples drawn at the final itera-
tion and held fixed. For test entities, we infer
the MAP value of η using development data,
and predict the age of each test event as the
mean time marginalizing over the event type in-
dicator z. t̂ = Ez[µz].

2. –CORRELATION. Here we perform collapsed
Gibbs sampling for 100 iterations, using slice
sampling to optimize the value of α and γ ev-
ery 10 iterations; after inference, the parame-
ters µ,σ2 and φ are estimated from single final
samples and held fixed. For development and
test data, we run Gibbs sampling on event indi-
cators z for 10 iterations and predict the age of
each test event as the mean time marginalizing
over the event type indicator z. t̂ = Ez[µz].

3. –TIME. Inference as above for 100 iterations,
using slice sampling to optimize the value of
γ every 10 iterations; after inference, the pa-
rameters µη, Ση and φ are estimated from sin-
gle final samples and held fixed. Since time is
not known to this model during inference, we
create post hoc estimates of µ̂z as the empirical
mean age of events sampled to event class z us-
ing single samples for each event in the training
data from the final sampling iteration. For test
entities, we infer the MAP value of η using de-
velopment data, and predict the age of each test
event as the average empirical age marginaliz-
ing over the event type indicator z. t̂ = Ez[µ̂z].

4. –CORRELATION,–TIME. We perform infer-
ence as above for the –CORRELATION model,
and time prediction as in the –TIME model.
t̂ = Ez[µ̂z].

To compare against a potentially more powerful
discriminative model, we also evaluate linear regres-
sion with `2 (ridge) regularization, using binary indi-
cators of the same unigrams and multiword expres-
sions available to the models above.

5. LINEAR REGRESSION. Train on training and
development data, optimizing the regulariza-
tion coefficient λ in three-fold cross-validation.

During training, linear regression learns that the
terms most indicative of events that take place later
in life are stamp, descendant, commemorated, died,
plaque, grandson, and lifetime achievement award,

while those that denote early events are born, bap-
tised, apprenticed, and acting debut.

We evaluate all models on identical splits using
5-fold cross validation. For an interpretable error
score, we use mean absolute error, which corre-
sponds to the number of years, on average, by which
each model is incorrect.

MAE =
1

N

N∑

i=1

∣∣t̂− ti
∣∣ (6)

Figure 2 presents the results of this evaluation for all
models and different choices of the number of latent
event classes K ∈ {10, 25, 50, 100, 250, 500}. Lin-
ear regression represents a powerful model, achiev-
ing a mean absolute error of 11.87 years across all
folds, but is eclipsed by the latent variable model at
K ≥ 50. The correlations captured by the logis-
tic normal prior make a clear difference, uniformly
yielding improvements over otherwise equivalent
Dirichlet models across all K. As expected, mod-
els trained without knowledge of time during infer-
ence perform less well than models that contain that
information.

12

13

14

10 25 50 100 250 500
Number of event classes (log scale)

M
ea

n 
ab

so
lu

te
 e

rr
or

 (
ye

ar
s)

Model

Linear Regression
Full Model
Ablation: -Time
Ablation: -Correlation
Ablation: -Correlation, -Time

Figure 2: Mean average error (in years) for time pre-
diction.

5 Analysis

To analyze the latent event classes in Wikipedia bi-
ographies, we train our full model (with a logis-
tic normal prior and time as an observable vari-
able) on the full dataset of 242,970 biographies with

368


Age µ Age σ % Fem. Most probable terms in class
18.00 0.67 15.6% high school, graduated, attended, graduating, school, born, early life, class, grew
21.89 1.83 0.2% drafted, nfl draft, round, professional career, draft, overall, selected
22.27 1.19 17.6% graduated, bachelor, degree, university, received, college, attended, earned, b. a.
22.67 4.33 3.6% joined, enlisted, army, served, world war ii, united states army, years, corps
25.81 3.47 11.1% law, university, graduated, received, school, law school, degree, law degree
32.32 8.19 12.0% thesis, received, university, phd, dissertation, doctorate, degree, ph. d., completed
38.24 15.29 17.0% citizen, became, citizenship, united states, american, u. s., british, granted, since
39.33 12.53 39.4% divorce, marriage, divorced, married, filed, wife, separated, years, ended, later
42.57 13.78 16.3% university, teaching, professor, college, taught, faculty, school, department, joined
43.79 15.54 13.8% trial, murder, case, court, charges, guilty, jury, judge, death, convicted
45.89 18.71 13.3% died, accident, killed, death, near, crash, car, involved, car accident, injured
46.22 16.30 11.2% prison, released, years, sentence, sentenced, months, parole, federal, serving
49.81 10.28 7.0% governor, candidate, unsuccessful candidate, congress, ran, reelection
51.41 11.23 1.2% bishop, appointed, archbishop, diocese, pope, consecrated, named, cathedral
54.91 12.04 7.9% chairman, board, president, ceo, became, company, directors, appointed, position
59.06 14.17 16.9% awarded, university, received, honorary doctorate, honorary degree, degree, doctor
62.81 24.16 11.1% fame, inducted, hall, sports hall, elected, national, football hall, international
72.52 13.69 12.4% died, hospital, age, death, complications, cancer, home, heart attack, washington
92.39 46.06 13.0% national, historic, park, state, house, named, memorial, home, honor, museum
95.29 42.65 12.1% statue, unveiled, memorial, plaque, anniversary, erected, monument, death, bronze

Table 2: Salient event classes learned from 242,970 Wikipedia biographies. All 500 event classes can be
viewed at http://www.ark.cs.cmu.edu/bio.

K = 500 event classes; as above, we run inference
for a burn-in period of 100 iterations and collect 50
samples from the posterior distributions for z (the
event class indicator for each event).

Table 2 illustrates a sample of 20 event classes
along with the mean time µ and standard deviation
σ, the gender distribution (calculated from the poste-
rior distribution over z for all entities whose gender
is known3) and the most probable terms in the class.

The latent classes that we learn span a mix of ma-
jor life events of Wikipedia notable figures (includ-
ing events that we might characterize as GRADU-
ATING HIGH SCHOOL, BECOMING A CITIZEN, DI-
VORCE, BEING CONVICTED OF A CRIME, and DY-
ING) and more fine-grained events (such as BE-
ING DRAFTED BY A SPORTS TEAM and BEING IN-
DUCTED INTO THE HALL OF FAME).

Emerging immediately from this summary is an
imbalance in the gender distribution for many of
these event classes. Among the 242,858 biographies
whose gender is known, 14.8% are of women; we
would therefore expect around 14.8% of the partic-

3Using our method of gender inference described in §2, we
are able to infer gender for 99.95% of biographies (242,858).

ipants in most event classes to be female. Figures
3 and 4 illustrate five of the most highly skewed
classes in both directions, ranked according to the z
score of a two-tailed binomial proportion test (H0 =
14.8).

While some of these classes reflect a biased world
in which more men are drafted into sports teams,
serve in the armed forces, and are ordained as
priests, one latent class that calls out for explanation
is that surrounding DIVORCE (divorce, marriage, di-
vorced, filed, married, wife, separated, years, ended,
later), whose female proportion of 39.4% is nearly
triple that of the data overall (and whose z-score re-
veals it to be strongly statistically different [p �
0.0001] from the H0 mean, even accounting for the
Bonferroni correction we must make when consider-
ing the K = 500 tests we implicitly perform when
ranking). While we did not approach this analy-
sis with any a priori hypotheses to test, our unsu-
pervised model reveals an interesting hypothesis to
pursue with confirmatory analysis: biographies of
women on Wikipedia disproportionately focus on
marriage and divorce compared to those of men.

To test this hypothesis with more traditional

369


z %Fem. Most frequent terms
60.46 76.9% miss, pageant, title, usa, miss universe, beauty, held, teen, crowned, competed
57.21 49.9% birth, gave, daughter, son, born, first child, named, wife, announced, baby
55.63 59.8% fashion, model, show, campaign, week, appeared, face, career, became, modeling
37.89 39.4% divorce, marriage, divorced, married, filed, wife, separated, years, ended, later
36.70 36.5% summer olympics, competed, olympics, team, finished, event, final, world championships

Table 3: Female-skewed event classes, ranked by z-score in a two-tailed binomial proportion test.

z %Fem. Most frequent terms
-31.64 0.2% drafted, nfl draft, round, professional career, draft, overall, selected, major league baseball
-23.81 2.1% promoted, rank, captain, retired, army, lieutenant, colonel, major, brigadier general
-20.93 3.7% bar, admitted, law, practice, called, commenced, studied, began, career, practiced
-20.48 1.0% infantry, civil war, regiment, army, enlisted, served, company, colonel, captain
-20.30 1.7% ordained, priest, seminary, priesthood, theology, theological, college, studies, rome

Table 4: Male-skewed event classes, ranked by z-score in a two-tailed binomial proportion test.

means, we estimated the empirical gender propor-
tions of biographies containing terms explicitly de-
noting divorce (divorced, divorce, divorces and di-
vorcing). The result of this analysis confirms that
of the model. Of the 4,608 biographies in which
at least one of these terms appears, 38.8% are
those of a woman, far more than the 14.8% we
would expect (in a two-tailed binomial proportion
test against H0 = 14.8, this difference is significant
at p < 0.0001); this corresponds to divorce being
mentioned in 5.0% of all 35,932 women’s biogra-
phies, and 1.4% of all 206,926 men’s; on average,
a woman’s biography is 3.66 times more likely to
mention divorce than a man’s.

We repeat the gender proportion experiment with
terms denoting marriage (married, marry, marries,
marrying and marriage) and find a similar trend: of
the 39,142 biographies where at least one of these
terms is mentioned, 23.6% belong to women; again,
in a two-tailed proportion test, this difference is sig-
nificant at p < 0.0001. This corresponds to marriage
appearing in 25.7% of all women’s biographies, and
14.5% of men’s; a woman’s biography is 1.78 times
more likely to mention marriage than a man’s.

6 Additional Analyses

The analysis above represents one substantive result
that mining life events from biographical data makes
possible. To illustrate the range of other analyses
that this method can occasion, we briefly present two

other directions that can be pursued: investigating
correlations among event classes and the distribution
of event classes over historical time.

6.1 Correlations among events

In our full model with a logistic normal prior over a
document’s set of events, correlations among latent
event classes are learned during inference. From the
covariance matrix Ση, we can directly read off cor-
relations among events; for other models (such as
those with a Dirichlet prior), we can infer correla-
tions using the posterior estimates for η.

Table 5 illustrates the event classes that have the
highest correlations to the event class defined by
family, boss, murder, crime, mafia, became, ar-
rested, john, gang, chicago. The structure that we
learn here neatly corresponds to a CRIMINAL AC-
TION frame, with common events for KILLING, BE-
ING SUBJECT TO FEDERAL INVESTIGATION, BE-
ING ARRESTED and BEING BROUGHT TO TRIAL.

6.2 Historical distribution of events

Figure 3 likewise illustrates the distribution over
time for a set of learned event classes. While the
only notion of time that our model has access to
during inference is that of time relative to a per-
son’s birth, we can estimate the empirical distribu-
tion of event classes in historical time by charting the
density plot of their observed absolute dates. Sev-
eral historically relevant event classes are legible,
including SERVING IN THE ARMY (with peaks dur-

370


0.00

0.01

0.02

0.03

0.04

1800 1850 1900 1950 2000

joined, enlisted, army, 
served, world war ii

0.000

0.003

0.006

0.009

0.012

1800 1850 1900 1950 2000

joined, became, member, 
party, communist party

0.000

0.003

0.006

0.009

1800 1850 1900 1950 2000

opera, debut, made, sang, 
la, role, di, metropolitan opera

0.000

0.005

0.010

1800 1850 1900 1950 2000

 river, expedition, fort, 
near, led, territory

0.000

0.005

0.010

0.015

0.020

1800 1850 1900 1950 2000

space, nasa, mission, 
flight, center, program

0.00

0.01

0.02

0.03

0.04

1800 1850 1900 1950 2000

band, guitar, bass, 
formed, album, drums, guitarist

Figure 3: Historical distributions of event classes.

r Event class
1.000 family, boss, murder, crime, mafia, became,

arrested, john, gang, chicago
0.031 killed, shot, police, home, two, car, ar-

rested, murder, death, -year-old
0.028 trial, murder, case, guilty, court, jury,

charges, convicted, death, judge
0.021 investigation, federal, charges, office, fraud,

campaign, state, commission, former, cor-
ruption

0.019 arrested, sentenced, years, prison, trial,
death, court, convicted, military, months

Table 5: Highest correlations between the family,
boss, murder, crime, mafia class and other events.

ing World War I and II, Vietnam and the later Iraq
wars), OPERA DEBUT (with peaks in the 1950s),
NASA (with peaks in 1960s and the turn of the mil-
lenium), JOINING THE COMMUNIST PARTY (with a
rise in the early 20th century), LEADING AN EXPE-
DITION (with a slow historical decline) and JOIN-
ING A BAND (with increasing historical presence).
Grounding specific life events in history has the po-
tential to enable analysis of how historical time af-
fects the life histories of individuals—including both
the influence of the general passage of time, as on
transitions to adulthood (Modell et al., 1976; Hogan,
1981; Modell, 1980), and the influence of specific
historical moments like the Great Depression (Elder,
1974) or World War II (Mayer, 1988; Elder, 1991).

7 Related Work

In learning general classes of events from text, our
work draws on a rich background spanning several
research traditions. By considering the structure that
exists between event classes, we draw on the origi-
nal work on procedural scripts and schemas (Min-
sky, 1974; Schank and Abelson, 1977) and narra-
tive chains (Chambers and Jurafsky, 2008; Cham-
bers and Jurafsky, 2009), including more recent ad-
vances in the unsupervised learning of frame seman-
tic representations (Modi et al., 2012; O’Connor,
2013; Cheung et al., 2013; Chambers, 2013).

In learning latent classes from text, our work is
also clearly related to research on topic modeling
(Blei et al., 2003; Griffiths and Steyvers, 2004).
This work differs from that tradition by scoping
our data only over text that we have reason to be-
lieve describes events (by including absolute dates).
While other topic models have leveraged temporal
information in the learning of latent topics, such as
the dynamic topic model (Blei and Lafferty, 2006b;
Wang et al., 2012) and “topics over time” (Wang
and McCallum, 2006), our model is the first to in-
fer classes of events whose contours are shaped by
the time in a person’s life that they take place.

While the information extraction tasks of template
filling (Hobbs et al., 1993) and relation detection
(Banko et al., 2007; Fader et al., 2011; Carlson et al.,
2010) generally fall into a paradigm of classifying

371


text segments into a predetermined ontology, they
too have been informed by unsupervised approaches
to learning relation classes (Yao et al., 2011) and
events (Ritter et al., 2012). Our work here differs
from this past work in leveraging explicit absolute
temporal information in the unsupervised learning
of event classes (and their structure). Reasoning
about the temporal ordering of events likewise has a
long tradition of its own, both in NLP (Pustejovsky
et al., 2003; Mani et al., 2006; Verhagen et al., 2007;
Chambers et al., 2007) and information extraction
(Talukdar et al., 2012). Rather than attempting to
model the ordering of events relative to each other,
we focus instead on their occurrence relative to the
beginning of a person’s life.

Wikipedia likewise has been used extensively in
NLP; Wikipedia biographies in particular have been
used for the task of training summarization models
(Biadsy et al., 2008), recognizing biographical sen-
tences (Conway, 2010), learning correlates of “suc-
cess” (Ng, 2012), and disambiguating named enti-
ties (Bunescu and Pasca, 2006; Cucerzan, 2007).
In our work in mining biographical structure from
it, we draw on previous research into automatically
uncovering latent structure in resumés (Mimno and
McCallum, 2007a) and approaches to learning life
path trajectories from categorical survey data (Mas-
soni et al., 2009; Ritschard et al., 2013).

In using Wikipedia as a dataset for analysis, we
must note that the subjects of biographies are not a
representative sample of the population, nor are their
contents unbiased representations. Nearly all ency-
clopedias necessarily prefer the historically notori-
ous (if due to nothing else than inherent biases in the
preservation of historical records); many, like Wiki-
pedia, also have disproportionately low coverage of
women, minorities, and other demographic groups,
in part because of biases in community member-
ship. Estimates of the percentage of female edi-
tors on Wikipedia, for example, ranges from 9% to
16.1% (Collier and Bear, 2012; Reagle and Rhue,
2011; Cassell, 2011; Hill and Shaw, 2013; Wiki-
pedia, 2011). Different language editions of Wiki-
pedia have a natural geographic bias in article se-
lection (Hecht and Gergle, 2009), with each empha-
sizing their own “local heroes” (Kolbitsch and Mau-
rer, 2006), and also differ in the kind of information
they present (Pfeil et al., 2006; Callahan and Her-

ring, 2011). This extends to selection of biographies
as well, with one study finding approximately 16%
of 1000 sampled biographies being those of women
(Reagle and Rhue, 2011), a figure very close to the
14.8% we observe in our analysis here.

8 Conclusion

We present a method for mining life events from
biographies, leveraging the correlation structure of
event descriptions. Unlike prior work that has fo-
cused on inferring “life trajectories” from categor-
ical survey data, we learn relevant structure in an
unsupervised manner directly from text, opening the
door to applying this method to a broad set of biogra-
phies beyond Wikipedia (including full-text books
from the Internet Archive or Hathi Trust, and other
encyclopedic biographies as well). In a quantitative
analysis, the model we present outperforms a strong
baseline at the task of event time prediction, and sur-
faces a substantive qualitative distinction in the con-
tent of the biographies of men and women on Wiki-
pedia: in contrast to previous work that uses com-
putational methods to measure a difference in cov-
erage, we show that such methods are able to tease
apart differences in characterization as well.

While the task of event time prediction provides
a quantitative means to compare different models,
we expect the real application of this work will lie
in the latent event classes themselves, and the in-
formation they provide both about the subjects and
authors of biographies. Latent topics have provided
one way of organizing large document collections
in the past (Mimno and McCallum, 2007b); in ad-
dition to occasioning data analysis of the kind we
describe here, we expect that personal event classes
can have a practical application in helping to orga-
nize data describing people as well. Data and code
to support this work, including an interface to ex-
plore event classes in Wikipedia, can be found at
http://www.ark.cs.cmu.edu/bio/.

9 Acknowledgments

We thank the anonymous reviewers, along with
Dallas Card, Brendan O’Connor, Bryan Routledge,
Yanchuan Sim and Ted Underwood, for their help-
ful comments. The research reported in this arti-
cle was supported by U.S. National Science Foun-

372


dation grant CAREER IIS-1054319 to N.A.S. and
Google’s support of the Reading is Believing project
at CMU. This work was made possible through the
use of computing resources made available by the
Open Science Data Cloud (OSDC), an Open Cloud
Consortium (OCC)-sponsored project.

References
David Bamman, Ted Underwood, and Noah A. Smith.

2014. A Bayesian mixed effects model of literary
character. In ACL.

Michele Banko, Michael J Cafarella, Stephen Soderland,
Matthew Broadhead, and Oren Etzioni. 2007. Open
information extraction for the web. In IJCAI, vol-
ume 7, pages 2670–2676.

Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008.
An unsupervised approach to biography production
using Wikipedia. In ACL ’08, pages 807–815.

David M. Blei and John D. Lafferty. 2006a. Correlated
topic models. In NIPS ’06.

David M. Blei and John D. Lafferty. 2006b. Dynamic
topic models. In ICML ’06, pages 113–120.

David M. Blei and John D. Lafferty. 2007. A correlated
topic model of Science. AAS, 1(1):17–35.

David M. Blei, Andrew Ng, and Michael Jordan. 2003.
Latent dirichlet allocation. JMLR, 3:993–1022.

Razvan Bunescu and Marius Pasca. 2006. Using ency-
clopedic knowledge for named entity disambiguation.
In EACL ’06, pages 9–16, Trento, Italy.

Ewa S. Callahan and Susan C. Herring. 2011. Cultural
bias in Wikipedia content on famous persons. J. Am.
Soc. Inf. Sci. Technol., 62(10):1899–1915, October.

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr
Settles, Estevam R. Hruschka Jr., and Tom M.
Mitchell. 2010. Toward an architecture for never-
ending language learning. In AAAI ’10.

Bob Carpenter. 2010. Integrating Out Multinomial
Parameters in Latent Dirichlet Allocation and Naive
Bayes for Collapsed Gibbs Sampling. Technical re-
port, LingPipe.

Justine Cassell. 2011. Editing wars behind the scenes.
New York Times, February 4.

Nathanael Chambers and Dan Jurafsky. 2008. Unsuper-
vised learning of narrative event chains. In ACL ’08.

Nathanael Chambers and Dan Jurafsky. 2009. Unsuper-
vised learning of narrative schemas and their partici-
pants. In ACL ’09, pages 602–610.

Nathanael Chambers, Shan Wang, and Dan Jurafsky.
2007. Classifying temporal relations between events.
In Proceedings of the 45th Annual Meeting of the ACL
on Interactive Poster and Demonstration Sessions,

ACL ’07, pages 173–176, Stroudsburg, PA, USA. As-
sociation for Computational Linguistics.

Nathanael Chambers. 2013. Event schema induction
with a probabilistic entity-driven model. In EMNLP
’13, pages 1797–1807, Seattle, Washington, USA, Oc-
tober. Association for Computational Linguistics.

Jackie Chi Kit Cheung, Hoifung Poon, and Lucy Van-
derwende. 2013. Probabilistic frame induction. In
NAACL ’13, pages 837–846, Atlanta, Georgia, June.
Association for Computational Linguistics.

Benjamin Collier and Julia Bear. 2012. Conflict, criti-
cism, or confidence: An empirical examination of the
gender gap in Wikipedia contributions. In CSCW ’12.

Mike Conway. 2010. Mining a corpus of biographical
texts using keywords. Literary and Linguistic Com-
puting, 25(1):23–35.

Silviu Cucerzan. 2007. Large-scale named entity dis-
ambiguation based on wikipedia data. In EMNLP-
CoNLL, volume 7, pages 708–716.

Peter T. Davis, David K. Elson, and Judith L. Klavans.
2003. Methods for precise named entity matching in
digital collections. In JCDL ’03.

Glen Elder. 1974. Children of the Great Depression.
University of Chicago Press.

Glen Elder. 1991. Talent, history, and the fulfillment of
promise. Psychiatry, 54(3):251–267.

David K. Elson, Nicholas Dames, and Kathleen R. McK-
eown. 2010. Extracting social networks from literary
fiction. In ACL ’10, pages 138–147.

Anthony Fader, Stephen Soderland, and Oren Etzioni.
2011. Identifying relations for open information ex-
traction. In EMNLP ’11, EMNLP ’11, pages 1535–
1545, Stroudsburg, PA, USA. Association for Compu-
tational Linguistics.

Jenny Rose Finkel, Trond Grenager, and Christopher
Manning. 2005. Incorporating non-local informa-
tion into information extraction systems by Gibbs sam-
pling. In ACL ’05, pages 363–370.

Google. 2014. Freebase data dumps. https://
developers.google.com/freebase/data.

Thomas L Griffiths and Mark Steyvers. 2004. Find-
ing scientific topics. Proceedings of the National
Academy of Sciences of the United States of America,
101(Suppl 1):5228–5235.

Brent Hecht and Darren Gergle. 2009. Measuring
self-focus bias in community-maintained knowledge
repositories. In C&T ’09, pages 11–20.

Benjamin Mako Hill and Aaron Shaw. 2013. The Wiki-
pedia gender gap revisited: Characterizing survey re-
sponse bias with propensity score estimation. PLoS
ONE, 8(6).

Jerry R Hobbs, Douglas Appelt, John Bear, David Israel,
Megumi Kameyama, and Mabry Tyson. 1993. Fas-
tus: A system for extracting information from text.

373


In Proceedings of the workshop on Human Language
Technology, pages 133–137. Association for Compu-
tational Linguistics.

Dennis P. Hogan and Nan Marie Astone. 1986. The
transition to adulthood. Annual Review of Sociology,
12(1):109–130.

Dennis Hogan. 1981. Transitions and Social Change:
The Early Lives of American Men. Academic, New
York.

John S Justeson and Slava M Katz. 1995. Technical ter-
minology: some linguistic properties and an algorithm
for identification in text. Natural Language Engineer-
ing, 1(1):9–27.

Josef Kolbitsch and Hermann A. Maurer. 2006. The
transformation of the web: How emerging commu-
nities shape the information we consume. J. UCS,
12(2):187–213.

Lee A. Lillard and Linda J. Waite. 1993. A joint model of
marital childbearing and marital disruption. Demogra-
phy, 30(4):pp. 653–681.

Lee A. Lillard, Michael J. Brien, and Linda J. Waite.
1995. Premarital cohabitation and subsequent marital
dissolution: A matter of self-selection? Demography,
32(3):pp. 437–457.

Inderjeet Mani, Marc Verhagen, Ben Wellner, Chong Min
Lee, and James Pustejovsky. 2006. Machine learning
of temporal relations. In Proceedings of the 21st In-
ternational Conference on Computational Linguistics
and the 44th Annual Meeting of the Association for
Computational Linguistics, ACL-44, pages 753–760,
Stroudsburg, PA, USA. Association for Computational
Linguistics.

Sébastien Massoni, Madalina Olteanu, and Patrick Rous-
set. 2009. Career-path analysis using optimal match-
ing and self-organizing maps. In WSOM ’09.

Henry Colin Gray Matthew and Brian Harrison. 2004.
The Oxford dictionary of national biography. Oxford
University Press.

Karl Ulrich Mayer. 1988. German survivors of World
War II: The impact on the life course of the collective
experience of birth cohorts. In Social Structure and
Human Lives, Newbury Park. Sage.

David Mimno and Andrew McCallum. 2007a. Model-
ing career path trajectories. Technical Report 2007-69,
University of Massachusetts, Amherst.

David Mimno and Andrew McCallum. 2007b. Organiz-
ing the OCA: Learning faceted subjects from a library
of digital books. In JCDL ’07, pages 376–385, New
York, NY, USA. ACM.

David Mimno, Hanna M. Wallach, and Andrew McCal-
lum. 2008. Gibbs sampling for logistic normal topic
models with graph-based priors. In NIPS Workshop on
Analyzing Graphs.

Marvin Minsky. 1974. A framework for representing
knowledge. Technical report, MIT-AI Laboratory.

John Modell, Frank F. Furstenberg Jr., and Theodore Her-
shberg. 1976. Social change and transitions to adult-
hood in historical perspective. Journal of Family His-
tory, 1(1):7–32.

John Modell. 1980. Normative aspects of american mar-
riage timing since World War II. Journal of Family
History, 5(2):210–234.

Ashutosh Modi, Ivan Titov, and Alexandre Klemen-
tiev. 2012. Unsupervised induction of frame-semantic
representations. In Proceedings of the NAACL-HLT
Workshop on the Induction of Linguistic Structure,
WILS ’12, pages 1–7, Stroudsburg, PA, USA. Asso-
ciation for Computational Linguistics.

Radford M Neal. 2003. Slice sampling. Annals of Statis-
tics, pages 705–741.

Pauline Ng. 2012. What Kobe Bryant and Britney Spears
have in common: Mining Wikipedia for characteristics
of notable individuals. In ICWSM ’12.

Brendan O’Connor. 2013. Learning frames from text
with an unsupervised latent variable model. ArXiv,
abs/1307.7382.

Ulrike Pfeil, Panayiotis Zaphiris, and Chee Siang Ang.
2006. Cultural differences in collaborative authoring
of Wikipedia. Journal of Computer-Mediated Com-
munication, 12(1):88–113.

James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew
See, Robert Gaizauskas, Andrea Setzer, Dragomir
Radev, Beth Sundheim, David Day, Lisa Ferro, et al.
2003. The timebank corpus. In Corpus linguistics,
volume 2003, page 40.

Joseph Reagle and Lauren Rhue. 2011. Gender bias
in Wikipedia and Britannica. International Journal of
Communication, 5(0).

Steffen Reinhold. 2010. Reassessing the link between
premarital cohabitation and marital instability. De-
mography, 47(3):719–733.

Gilbert Ritschard, Reto Bürgin, and Matthias Studer.
2013. Exploratory mining of life event histories. In
J. J. McArdle and G. Ritschard, editors, Contemporary
Issues in Exploratory Data Mining in the Behavioral
Sciences, pages 221–253. Routledge, New York.

Alan Ritter, Mausam, Oren Etzioni, and Sam Clark.
2012. Open domain event extraction from Twitter.
In KDD ’12, pages 1104–1112, New York, NY, USA.
ACM.

Roger C. Schank and Robert P. Abelson. 1977. Scripts,
plans, goals, and understanding: An inquiry into hu-
man knowledge structures. Lawrence Erlbaum, Hills-
dale, NJ.

Michael J. Shanahan. 2000. Pathways to adulthood
in changing societies: Variability and mechanisms in

374


life course perspective. Annual Review of Sociology,
26(1):667–692.

Partha Pratim Talukdar, Derry Wijaya, and Tom Mitchell.
2012. Acquiring temporal constraints between rela-
tions. In CIKM ’12, pages 992–1001, New York, NY,
USA. ACM.

Kristina Toutanova, Dan Klein, Christopher D. Manning,
and Yoram Singer. 2003. Feature-rich part-of-speech
tagging with a cyclic dependency network. In NAACL
’03, pages 173–180.

Marc Verhagen, Robert Gaizauskas, Frank Schilder,
Mark Hepple, Graham Katz, and James Pustejovsky.
2007. Semeval-2007 task 15: Tempeval temporal re-
lation identification. In Proceedings of the 4th Interna-
tional Workshop on Semantic Evaluations, pages 75–
80. Association for Computational Linguistics.

Xuerui Wang and Andrew McCallum. 2006. Topics over
time: a non-markov continuous-time model of topical
trends. In KDD ’06, pages 424–433.

Chong Wang, David M. Blei, and David Heckerman.
2012. Continuous time dynamic topic models. ArXiv.

Wikipedia. 2011. Wikipedia editors study: Results from
the editor survey.

Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew
McCallum. 2011. Structured relation discovery using
generative models. In EMNLP ’11, pages 1456–1466,
Stroudsburg, PA, USA. Association for Computational
Linguistics.

375


376