UC San Diego
UC San Diego Previously Published Works

Title
Temporal Annotation in the Clinical Domain.

Permalink
https://escholarship.org/uc/item/9sk3m36q

Journal
Transactions of the Association for Computational Linguistics, 2

ISSN
2307-387X

Authors
Styler, William F
Bethard, Steven
Finan, Sean
et al.

Publication Date
2014-04-01

DOI
10.1162/tacl_a_00172
 
Peer reviewed

eScholarship.org Powered by the California Digital Library
University of California

https://escholarship.org/uc/item/9sk3m36q
https://escholarship.org/uc/item/9sk3m36q#author
https://escholarship.org
http://www.cdlib.org/


Temporal Annotation in the Clinical Domain
William F. Styler IV1, Steven Bethard2, Sean Finan3, Martha Palmer1,
Sameer Pradhan3, Piet C de Groen4, Brad Erickson4, Timothy Miller3,

Chen Lin3, Guergana Savova3 and James Pustejovsky5
1 Department of Linguistics, University of Colorado at Boulder

2 Department of Computer and Information Sciences, University of Alabama at Birmingham
3 Children’s Hospital Boston Informatics Program and Harvard Medical School

4 Mayo Clinic College of Medicine, Mayo Clinic, Rochester, MN
5 Department of Computer Science, Brandeis University

Abstract

This article discusses the requirements of
a formal specification for the annotation of
temporal information in clinical narratives.
We discuss the implementation and extension
of ISO-TimeML for annotating a corpus of
clinical notes, known as the THYME cor-
pus. To reflect the information task and the
heavily inference-based reasoning demands
in the domain, a new annotation guideline
has been developed, “the THYME Guidelines
to ISO-TimeML (THYME-TimeML)”. To
clarify what relations merit annotation, we
distinguish between linguistically-derived and
inferentially-derived temporal orderings in the
text. We also apply a top performing Temp-
Eval 2013 system against this new resource to
measure the difficulty of adapting systems to
the clinical domain. The corpus is available to
the community and has been proposed for use
in a SemEval 2015 task.

1 Introduction

There is a long-standing interest in temporal reason-
ing within the biomedical community (Savova et al.,
2009; Hripcsak et al., 2009; Meystre et al., 2008;
Bramsen et al., 2006; Combi et al., 1997; Keravnou,
1997; Dolin, 1995; Irvine et al., 2008; Sullivan et
al., 2008). This interest extends to the automatic ex-
traction and interpretation of temporal information
from medical texts, such as electronic discharge sum-
maries and patient case summaries. Making effective
use of temporal information from such narratives is
a crucial step in the intelligent analysis of informat-
ics for medical researchers, while an awareness of
temporal information (both implicit and explicit) in a
text is also necessary for many data mining tasks.

It has also been demonstrated that the temporal in-
formation in clinical narratives can be usefully mined

to provide information for some higher-level tempo-
ral reasoning (Zhao et al., 2005). Robust temporal
understanding of such narratives, however, has been
difficult to achieve, due to the complexity of deter-
mining temporal relations among events, the diver-
sity of temporal expressions, and the interaction with
broader computational linguistic issues.

Recent work on Electronic Health Records (EHRs)
points to new ways to exploit and mine the informa-
tion contained therein (Savova et al., 2009; Roberts
et al., 2009; Zheng et al., 2011; Turchin et al., 2009).
We target two main use cases for extracted data. First,
we hope to enable interactive displays and summaries
of the patient’s records to the physician at the time of
visit, making a comprehensive review of the patient’s
history both faster and less prone to oversights. Sec-
ond, we hope to enable temporally-aware secondary
research across large databases of medical records
(e.g., “What percentage of patients who undergo pro-
cedure X develop side-effect Y within Z months?”).
Both of these applications require the extraction of
time and date associations for critical events and the
relative ordering of events during the patient’s period
of care, all from the various records which make up a
patient’s EHR. Although we have these two specific
applications in mind, the schema we have developed
is generalizable and could potentially be embedded
in a wide variety of biomedical use cases.

Narrative texts in EHRs are temporally rich doc-
uments that frequently contain assertions about the
timing of medical events, such as visits, laboratory
values, symptoms, signs, diagnoses, and procedures
(Bramsen et al., 2006; Hripcsak et al., 2009; Zhou
et al., 2008). Temporal representation and reason-
ing in the medical record are difficult due to: (1) the
diversity of time expressions; (2) the complexity of
determining temporal relations among events (which
are often left to inference); (3) the difficulty of han-
dling the temporal granularity of an event; and (4)

143

Transactions of the Association for Computational Linguistics, 2 (2014) 143–154. Action Editor: Ellen Riloff.
Submitted 9/2013; Revised 2/2014; Published 4/2014. c©2014 Association for Computational Linguistics.


general issues in natural language processing (e.g.,
ambiguity, anaphora, ellipsis, conjunction). As a re-
sult, the signals used for reconstructing a timeline can
be both domain-specific and complex, and are often
left implicit, requiring significant domain knowledge
to accurately detect and interpret.

In this paper, we discuss the demands on accurately
annotating such temporal information in clinical
notes. We describe an implementation and extension
of ISO-TimeML (Pustejovsky et al., 2010), devel-
oped specifically for the clinical domain, which we
refer to as the “THYME Guidelines to ISO-TimeML”
(“THYME-TimeML”), where THYME stands for
“Temporal Histories of Your Medical Events”. A sim-
plified version of these guidelines formed the basis
for the 2012 i2b2 medical-domain temporal relation
challenge (Sun et al., 2013a).

This is being developed in the context of the
THYME project, whose goal is to both create ro-
bust gold standards for semantic information in clini-
cal notes, as well as to develop state-of-the-art algo-
rithms to train and test on this dataset.

Deriving timelines from news text requires the con-
crete realization of context-dependent assumptions
about temporal intervals, orderings and organization,
underlying the explicit signals marked in the text
(Pustejovsky and Stubbs, 2011). Deriving patient
history timelines from clinical notes also involves
these types of assumptions, but there are special de-
mands imposed by the characteristics of the clinical
narrative. Due to both medical shorthand practices
and general domain knowledge, many event-event
relations are not signaled in the text at all, and rely
on a shared understanding and common conceptual
models of the progressions of medical procedures
available only to readers familiar with language use
in the medical community.

Identifying these implicit relations and temporal
properties puts a heavy burden on the annotation
process. As such, in the THYME-TimeML guideline,
considerable effort has gone into both describing and
proscribing the annotation of temporal orderings that
are inferable only through domain-specific temporal
knowledge.

Although the THYME guidelines describe a num-
ber of departures from the ISO-TimeML standard for
expediency and ease of annotation, this paper will
focus on those differences specifically motivated by
the needs of the clinical domain, and on the conse-
quences for systems built to extract temporal data in

both the clinical and general domain.

2 The Nature of Clinical Documents

In the THYME corpus, we have been examining
1,254 de-identified1 notes from a large healthcare
practice (the Mayo Clinic), representing two distinct
fields within oncology: brain cancer, and colon can-
cer. To date, we have principally examined two dif-
ferent general types of clinical narrative in our EHRs:
clinical notes and pathology reports.

Clinical notes are records of physician interactions
with a patient, and often include multiple, clearly
delineated sections detailing different aspects of the
patient’s care and present illness. These notes are
fairly generic across institutions and specialities, and
although some terms and inferences may be specific
to a particular type of practice (such as oncology),
they share a uniform structure and pattern. The ‘His-
tory of Present Illness’, for example, summarizes the
course of the patient’s chief complaint, as well as the
interventions and diagnostics which have been thus
far attempted. In other sections, the doctor may out-
line her current plan for the patient’s treatment, then
later describe the patient’s specific medical history,
allergies, care directives, and so forth.

Most critically for temporal reasoning, each clin-
ical note reflects a single time in the patient’s treat-
ment history at which all of the doctor’s statements
are accurate (the DOCTIME), and each section tends
to describe events of a particular timeframe. For
example, ‘History of Present illness’ predominantly
describes events occuring before DOCTIME, whereas
‘Medications’ provides a snapshot at DOCTIME and
‘Ongoing Care Orders’ discusses events which have
not yet occurred.2

Clinical notes contain rich temporal information
and background, moving fluidly from prior treat-
ments and symptoms to present conditions to future
interventions. They are also often rich with hypo-
thetical statements (“if the tumor recurs, we can...”),
each of which can form its own separate timeline.

By constrast, pathology notes are quite different.
Such notes are generated by a medical pathologist

1Although most patient information was removed, dates
and temporal information were not modified according to this
project’s specific data use agreement.

2One complication is the propensity of doctors and automated
systems to later update sections in a note without changing the
timestamp or metadata. We have added a SECTIONTIME to keep
these updated sections from affecting our overall timeline.

144


upon receipt and analysis of specimens (ranging from
tissue samples from biopsy to excised portions of
tumor or organs). Pathology notes provide crucial
information to the patient’s doctor confirming the
malignancy (cancer) in samples, describing surgi-
cal margins (which indicate whether a tumor was
completely excised), and classifying and ‘staging’ a
tumor, describing the severity and spread of the can-
cer. Because the information in such notes pertains
to samples taken at a single moment in time, they are
temporally sparse, seldom referring to events before
or after the examination of the specimen. However,
they contain critical information about the state of
the patient’s illness and about the cancer itself, and
must be interpreted to understand the history of the
patient’s illness.

Most importantly, in all EHRs, we must contend
with the results of a fundamental tension in mod-
ern medical records: hyper-detailed records provide
a crucial defense against malpractice litigation, but
including such detail takes enormous time, which
doctors seldom have. Given that these notes are writ-
ten by and for medical professionals (who form a
relatively insular speech community), a great many
non-standard expressions, abbreviations, and assump-
tions of shared knowledge are used, which are simul-
taneously concise and detail-rich for others who have
similar backgrounds.

These time-saving devices can range from tempo-
rally loaded acronyms (e.g., ‘qid’, Latin for quater in
die, ‘four times daily’), to assumed orderings (a diag-
nostic test for a disorder is assumed to come before
the procedure which treats it), and even to completely
implicit events and temporal details. For example,
consider the sentence in (1).
(1) Colonoscopy 3/12/10, nodule biopsies negative

We must understand that during the colonoscopy,
the doctor obtained biopsies of nodules, which were
packaged and sent to a pathologist, who reviewed
them and determined them to be ‘negative’ (non-
cancerous).

In such documents, we must recover as much tem-
poral detail as possible, even though it may be ex-
pressed in a way which is not easily understood out-
side of the medical community, let alone by linguists
or automated systems. We must also be aware of the
legal relevance of some events (e.g., “We discussed
the possible side effects”), even when they may not
seem relevant to the patient’s actual care.

Finally, each specialty and note type has separate

conventions. Within colon cancer notes, the Amer-
ican Joint Committee on Cancer (AJCC) Staging
Codes (e.g., T4N1, indicating the nature of the tumor,
lymph node and metastasis involvement) are metic-
ulously recorded, but are largely absent in the brain
cancer notes which make up the second corpus in
our project. So, although clinical notes share many
similarities, annotators without sufficient domain ex-
pertise may require additional training to adapt to the
inferences and nuances of a new clinical subdomain.

3 Interpreting ‘Event’ and Temporal
Expressions in the Clinical Domain

Much prior work has been done on standardizing
the annotation of events and temporal expressions
in text. The most widely used approach is the ISO-
TimeML specification (Pustejovsky et al., 2010), an
ISO standard that provides a common framework for
annotating and analyzing time, events, and event rela-
tions. As defined by ISO-TimeML, an EVENT refers
to anything that can be said “to obtain or hold true, to
happen or to occur”. This is a broad notion of event,
consistent with Bach’s use of the term “eventuality”
(Bach, 1986) as well as the notion of fluents in AI
(McCarthy, 2002).

Because the goals of the THYME project involve
automatically identifying the clinical timeline for
a patient from clincal records, the scope of what
should be admitted into the domain of events is inter-
preted more broadly than in ISO-TimeML3. Within
the THYME-TimeML guideline, an EVENT is any-
thing relevant to the clinical timeline, i.e., anything
that would show up on a detailed timeline of the pa-
tient’s care or life. The best single-word syntactic
head for the EVENT is then used as its span. For
example, a diagnosis would certainly appear on such
a timeline, as would a tumor, illness, or procedure.
On the other hand, entities that persist throughout
the relevant temporal period of the clinical timeline
(endurants in ontological circles) would not be con-
sidered as event-like. This includes the patient, other
humans mentioned (the patient’s mother-in-law or
the doctor), organizations (the emergency room),
non-anatomical objects (the patient’s car), or indi-
vidual parts of the patient’s anatomy (an arm is not
an EVENT unless missing or otherwise notable).

To meet our explicit goals, the THYME-TimeML
guideline introduces two additional levels of interpre-

3Our use of the term ‘EVENT’ corresponds with the less
specific ISO-TimeML term ‘Eventuality’

145


tation beyond that specified by ISO-TimeML: (i) a
well-defined task; and (ii) a clearly identified domain.
By focusing on the creation of a clinical timeline
from clinical narrative, the guideline imposes con-
straints that cannot be assumed for a broadly defined
and domain independent annotation schema.

Some EVENTs annotated under our guideline are
considered meaningful and eventive mostly by virtue
of a specific clinical or legal value. For example,
AJCC Staging Codes (discussed in Section 2) are
eventive only in the sense of the code being assigned
to a tumor at a given moment in the patient’s care.
However, they are of such critical importance and
informative value to doctors that we have chosen to
annotate them specifically so that they will show up
on the patient’s timeline in a clinical setting.

Similarly, because of legal pressures to establish in-
formed consent and patient knowledge of risk, entire
paragraphs of clinical notes are dedicated to docu-
menting the doctor’s discussion of risks, plans, and
alternative strategies. As such, we annotate verbs of
discussion (“We talked about the risks of this drug”),
consent (“She agreed with the current plan”), and
comprehension (“Mrs. Larsen repeated the potential
side effects back to me”), even though they are more
relevant to legal defense than medical treatment.

It is also because of this grounding in clinical lan-
guage that entities and other non-events are often
interpreted in terms of their associated eventive prop-
erties. There are two major types for which this is a
significant shift in semantic interpretation:

(2) a Medication as Event:
Orders: Lariam twice daily.

b Disorder as Event:
Tumor of the left lung.

In both these cases, entities which are not typically
marked as events are identified as such, because they
contribute significant information to the clinical time-
line being constructed. In (2a), for example, the
TIMEX3 “twice daily” is interpreted as scoping over
the eventuality of the patient taking the medication,
not the prescription event. In sentence (2b), the “tu-
mor” is interpreted as a stative eventuality of the
patient having a tumor located within an anatomical
region, rather than an entity within an entity.

Within the medical domain, these eventive inter-
pretations of medications, growths and status codes
are unambiguous and consistent. Doctors in clini-
cal notes (unlike in biomedical research texts) do

not discuss medications without an associated (im-
plicit) administering EVENT (though some mentions
may be hypothetical, generic or negated). Similarly,
mentions of symptoms or disorders reflect occur-
rences in a patient’s life, rather than abstract entities.
With these interpretations in mind, we can safely in-
fer, for instance, that all UMLS (Unified Medical
Language System, (Bodenreider, 2004)) entities of
the types Disorder, Chemical/Drug, Procedure and
Sign/Symptom will be EVENTs.

In general, in the medical domain, it is essential to
read “between the lines” of the shorthand expressions
used by the doctors, and recognize implicit events
that are being referred to by specific anatomical sites
or medications.

4 Modifications to ISO-TimeML for the
Clinical Domain

Overall, we have found that the specification required
for temporal annotation in the clinical domain does
not require substantial modification from existing
specifications for the general domain. The clinical
domain includes no shortage of inferences, short-
hands, and unusual use of language, but the structure
of the underlying timeline is not unique.

As a result of this, we have been able to adopt most
of the framework from ISO-TimeML, adapting the
guidelines where needed, as well as reframing the
focus of what gets annotated. This is reflected in a
comprehensive guideline, incorporating the specific
patterns and uses of events and temporal expressions
as seen in clinical data. This approach allows the
resulting annotations to be interoperable with exist-
ing solutions, while still accommodating the major
differences in the nature of the texts. Our guide-
lines, as well as the annotated data, are available at
http://thyme.healthnlp.org4

Our extensions of the ISO-TimeML specification
to the clinical domain are intended to address specific
constructions, meanings, and phenomena in medical
texts. Our schema differs from ISO-TimeML in a
few notable ways.

EVENT Properties We have both simplified the
ISO-TimeML coding of EVENTs, and extended it to
meet the needs of the clinical domain and the specific
language goals of the clinical narrative.

4Access to the corpus will require a data use agreement.
More information about this process is available from the corpus
website.

146


Consider, for example, how modal subordination is
handled in ISO-TimeML. This involves the semantic
characterization of an event as “likely”, “possible”, or
as presented by observation, evidence, or hearsay. All
of these are accounted for compositionally in ISO-
TimeML within the SLINK (Subordinating Link)
relation (Pustejovsky et al., 2005). While accept-
ing ISO-TimeML’s definition of event modality, we
have simplified the annotation task within the cur-
rent guideline, so that EVENTs now carry attributes
for “contextual modality”, “contextual aspect” and
“permanence”.

Contextual modality allows the values ACTUAL,
HYPOTHETICAL, HEDGED, and GENERIC. ACTUAL
covers EVENTs which have actually happened, e.g.,
“We’ve noted a tumor”. HYPOTHETICAL covers con-
ditionals and possibilities, e.g., “If she develops a
tumor”. HEDGED is for situations where doctors
proffer a diagnosis, but do so cautiously, to avoid
legal liability for an incorrect diagnosis or for over-
looking a correct one. For example:

(3) a. The signal in the MRI is not inconsistent
with a tumor in the spleen.

b. The rash appears to be measles, awaiting
antibody test to confirm.

These HEDGED EVENTs are more real than a hypo-
thetical diagnosis, and likely merit inclusion on a
timeline as part of the diagnostic history, but must
not be conflated with confirmed fact. These (and
other forms of uncertainty in the medical domain)
are discussed extensively in (Vincze et al., 2008). In
contrast, GENERIC EVENTs do not refer to the pa-
tient’s illness or treatment, but instead discuss illness
or treatment in general (often in the patient’s specific
demographic). For example:

(4) In other patients without significant comor-
bidity that can tolerate adjuvant chemother-
apy, there is a benefit to systemic adjuvant
chemotherapy.

These sections would be true if pasted into any pa-
tient’s note, and are often identical chunks of text
repeatedly used to justify a course of action or treat-
ment as well as to defend against liability.

Contextual Aspect (to distinguish from grammati-
cal aspect), allows the clinically-necessary category,
INTERMITTENT. This serves to distinguish intermit-
tent EVENTs (such as vomiting or seizures) from
constant, more stative EVENTs (such as fever or sore-
ness). For example, the bolded EVENT in (5a) would

be marked as INTERMITTENT, while that in (5b)
would not:
(5) a She has been vomiting since June.

b She has had swelling since June.
In the first case, we assume that her vomiting has
been intermittent, i.e., there were several points since
June in which she was not actively vomiting. In the
second case, unless made otherwise explicit (“she has
had occasional swelling”), we assume that swelling
was a constant state. This property is also used when
a particular instance of an EVENT is intermittent,
even though it generally would not be:
(6) Since starting her new regime, she has had occa-

sional bouts of fever, but is feeling much better.
The permanence attribute has two values, FINITE

and PERMANENT. Permanence is a property of dis-
eases themselves, roughly corresponding to the med-
ical concept of “chronic” vs. “acute” disease, which
marks whether a disease is persistent following diag-
nosis. For example, a (currently) uncurable disease
like Multiple Sclerosis would be classed as PERMA-
NENT, and thus, once mentioned in a patient’s note,
will be assumed to persist through the end of the
patient’s timeline. This is compared with FINITE
disorders like “Influenza” or “fever”, which, if not
mentioned in subsequent notes, should be considered
cured and no longer belongs on the patient’s time-
line. Because it requires domain-specific knowledge,
although present in the specification, Permanence
is not currently annotated. However, annotators are
trained on the basic idea and told about subsequent
axiomatic assignment. The addition of this property
to our schema is designed to relieve annotators of any
feeling of obligation to express this inferred informa-
tion in some other way.

TIMEX3 Types Temporal expressions (TIMEX3s)
in the clinical domain function the same as in the gen-
eral linguistic community, with two notable excep-
tions. ISO-TimeML SETs (statements of frequency)
occur quite frequently in the medical domain, par-
ticularly with regard to medications and treatments.
Medication sections within notes often contain long
lists of medications, each with a particular associated
set (“Claritin 30mg twice daily”), and further tempo-
ral specification is not uncommon (e.g., “three times
per day at meals”, “once a week at bedtime”).

The second major change for the medical domain
is a new type of TIMEX3 which we call PREPOS-
TEXP. This covers temporally complex terms like

147


“preoperative”, “postoperative”, and “intraoperative”.
These temporal expressions designate a span of time
bordered, usually only on one side, by the incorpo-
rated event (an operation, in the previous EVENTs).
In many cases, the referent is clear:
(7) She underwent hemicolectomy last week, and

had some postoperative bleeding.
Here we understand that “postoperative” refers to
“the period of time following the hemicolectomy”. In
these cases, the PREPOSTEXP makes explicit a tempo-
ral link between the bleeding and the hemicolectomy.
In other cases, no clear referent is present:
(8) Patient shows some post-procedure scarring.

In these situations, where no procedure is mentioned
(or the reference is never explicitly resolved), we
treat the PREPOSTEXP as a narrative container (see
Section 5), covering the span of time following the
unnamed procedure.

Finally, it is worth noting that the process of nor-
malizing those TIMEX3s is significantly more com-
plex relative to the general domain, because many
temporal expressions are anchored not to dates or
times, but to other EVENTs (whose dates are often
not mentioned or not known by the physician). As
we move towards a complete system, we are working
to expand the ISO-TimeML system for TIMEX3 nor-
malization to allow some value to be assigned to a
phrase like “in the months after her hemicolectomy”
when no referent date is present. ISO-TimeML, in
discussion with ISO TC 37SC 4, plans to reference
to such TIMEX3s in a future release of the standard.

5 Temporal Ordering and Narrative
Containers

The semantic content and informational impact of
a timeline is encoded in the ordering relations that
are identified between the temporal and event expres-
sions present in clinical notes. ISO-TimeML speci-
fies the standard thirteen “Allen relations” from the
interval calculus (Allen, 1983), which it refers to as
TLINK values. For unguided, general-purpose annota-
tion, the number of relations that could be annotated
grows quadratically with the number of events and
times, and the task quickly becomes unmanageable.
There are, however, strategies that we can adopt to
make this labeling task more tractable. Temporal
ordering relations in text are of three kinds:
1. Relations between two events
2. Relations between two times

3. Relations between a time and an event.

ISO-TimeML, as a formal specification of the tem-
poral information conveyed in language, makes no
distinction between these ordering types. Humans,
however, do make distinctions, based on local tempo-
ral markers and the discourse relations established in
a narrative (Miltsakaki et al., 2004; Poesio, 2004).

Because of the difficulty of humans capturing ev-
ery relationship present in the note (and the disagree-
ment which arises when annotators attempt to do so),
it is vital that the annotation guidelines describe an
approach that reduces the number of relations that
must be considered, but still results in maximally in-
formative temporal links. We have found that many
of the weaknesses in prior annotation approaches
stem from interaction between two competing goals:

• The guideline should specify certain types of an-
notations that should be performed;

• The guideline should not force annotations to be
performed when they need not be.

Failing in the first goal will result in under-annotation
and the neglect of relations which provide necessary
information for inference and analysis. Failure in the
second goal results in over-annotation, creating com-
plex webs of temporal relations which yield mostly
inferable information, but which complicate annota-
tion and adjudication considerably.

Our method of addressing both goals in tempo-
ral relations annotation is that of the narrative con-
tainer, discussed in Pustejovsky and Stubbs (2011).
A narrative container can be thought of as a temporal
bucket into which an EVENT or series of EVENTs
may fall, or a natural cluster of EVENTs around a
given time or situation. These narrative containers
are often represented (or “anchored”) by dates or
other temporal expressions (within which a variety
of different EVENTs occur), although they can also
be anchored to more abstract concepts (“recovery”
which might involve a variety of EVENTs) or even
durative EVENTs (many other EVENTs can occur dur-
ing a surgery). Rather than marking every possible
TLINK between each EVENT, we instead try to link
all EVENTs to their narrative containers, and then
link those containers so that the contained EVENTs
can be linked by inference.

First, annotators assign each event to one of four
broad narrative containers: before the DOCTIME, be-
fore and overlapping the DOCTIME, just overlapping
the DOCTIME or after the DOCTIME. This narrative

148


container is identified by the EVENT attribute Doc-
TimeRel. After the assignment of DocTimeRel, the
remainder of the narrative container relations must
be specified using temporal links (TLINKs). There
are five different temporal relations used for such
TLINKs: BEFORE, OVERLAP, BEGINS-ON, ENDS-ON
and CONTAINS5. Due to our narrative container ap-
proach, CONTAINS is the most frequent relation by a
large margin.

EVENTs serving as narrative container anchors are
not tagged as containers per-se. Instead, annotators
use the narrative container idea to help them visu-
alize the temporal relations within a document, and
then make a series of CONTAINS TLINK annotations
which establish EVENTs and TIMEX3s as anchors,
and specify their contents. If the annotators do their
jobs correctly, properly implementing DocTimeRel
and creating accurate TLINKs, a good understanding
of the narrative containers present in a document will
naturally emerge from the annotated text.

The major advantage introduced with narrative
containers is this: a narrative event is placed within a
bounding temporal interval which is explicitly men-
tioned in the text. This allows EVENTs within sep-
arate containers to be linked by post-hoc inference,
temporal reasoning, and domain knowledge, rather
than by explicit (and time-consuming) one-by-one
temporal relations annotation.

A secondary advantage is that this approach works
nicely with the general structure of story-telling in
both the general and clinical domains, and provides a
compelling and useful metaphor for interpreting time-
lines. Often, especially in clinical histories, doctors
will cluster discussions of symptoms, interventions
and diagnoses around a given date (e.g. a whole para-
graph starting “June 2009:”), a specific hospitaliza-
tion (“During her January stay at Mercy”), or a given
illness or treatment (“While she underwent Chemo”).
Even when specific EVENTs are not explicitly or-
dered within a cluster (often because the order can be
easily inferred with domain knowledge), it is often
quite easy to place the EVENTs into containers, and
just a few TLINKs can order the containers relative to
one another with enough detail to create a clinically
useful understanding of the overall timeline.

Narrative containers also allow the inference of re-
lations between sub-events within nested containers:

5This is a subset of the ISO-TimeML TLINK types, excluding
those seldom occurring in medical records, like ‘simultaneous’
as well as inverse relations like ‘during’ or ‘after’.

(9) December 19th: The patient underwent an MRI
and EKG as well as emergency surgery. Dur-
ing the surgery, the patient experienced mild
tachycardia, and she also bled significantly
during the initial incision.

1. December 19th CONTAINS MRI
2. December 19th CONTAINS EKG
3. December 19th CONTAINS surgery
a. surgery CONTAINS tachycardia
b. surgery CONTAINS incision
c. incision CONTAINS bled

Through our container nesting, we can automatically
infer that ‘bled’ occurred on December 19th (because
‘19th’ CONTAINS ‘surgery’ which CONTAINS ‘inci-
sion’ which CONTAINS ‘bled’). This also allows the
capture of EVENT/sub-event relations, and the rapid
expression of complex temporal interactions.

6 Explicit vs. Inferable Annotation

Given a specification language, there are essentially
two ways of introducing the elements into the docu-
ment (data source) being annotated:6

• Manual annotation: Elements are introduced into
the document directly by the human annotator fol-
lowing the guideline.

• Automatic (inferred) annotation: Elements are cre-
ated by applying an automated procedure that in-
troduces new elements that are derivable from the
human annotations.

As such, there is a complex interaction between spec-
ification and guideline, and we focus on how the
clinical annotation task has helped shape and refine
the annotation guidelines. It is important to note that
an annotation guideline does not necessarily force
the markup of certain elements in a text, even though
the specification language (and the eventual goal of
the project) might require those annotations to exist.

In some cases, these added annotations are derived
logically from human annotations. Explicitly marked
temporal relations can be used to infer others that are
not marked but exist implicitly through closure. For
instance, given EVENTs A, B and C and TLINKs ‘A
BEFORE B’ and ‘B BEFORE C’, the TLINK ‘A BE-
FORE C’ can be automatically inferred. Repeatedly
applying such inference rules allows all inferable

6We ignore the application of automatic techniques, such as
classifiers trained on external datasets, as our focus here is on
the preparation of the gold standard used for such classifiers.

149


TLINKs to be generated (Verhagen, 2005). We can
use this idea of closure to show our annotators which
annotations need not be marked explicitly, saving
time and effort. We have also incorporated these clo-
sure rules into our inter-annotator agreement (IAA)
calculation for temporal relations, described further
in Section 7.2.

The automatic application of rules following the
annotation of the text is not limited to the marking
of logically inferable relations or EVENTs. In the
clinical domain, the combination of within-group
shared knowledge and pressure towards concise writ-
ing leads to a number of common, inferred relations.
Take, for example, the sentence:
(10) Jan 2013: Colonoscopy, biopsies. Pathology

showed adenocarcinoma, resected at Mercy.
Diagnosis T3N1 Adenocarcinoma.

In this sentence, only the CONTAINS relations be-
tween “Jan 2013” and the EVENTs (in bold) are
explicitly stated. However, based on the known
progression-of-care for colon cancer, we can infer
that the colonoscopy occurs first, biopsies occur dur-
ing the colonoscopy, pathology happens afterwards,
a diagnosis (here, adenocarcinoma) is returned after
pathology, and resection of the tumor occurs after
diagnosis. The presence of the AJCC staging infor-
mation in the final sentence (along with the confir-
mation of the adenocarcinoma diagnosis) implies a
post-surgical pathology exam of the resected spec-
imen, as the AJCC staging information cannot be
determined without this additional examination.

These inferences come naturally to domain ex-
perts but are largely inaccessible to people outside
the medical community without considerable anno-
tator training. Making explicit our understanding of
these “understood orderings” is crucial; although they
are not marked by human annotators in our schema,
the annotators often found it initially frustrating to
leave these (purely inferential) relations unstated. Al-
though many of our (primarily linguistically trained)
annotators learned to see these patterns, we chose to
exclude them from the manual task since newer an-
notators with varying degrees of domain knowledge
may struggle if asked to manually annotate them.

Similar unspoken-but-understood orderings are
found throughout the clinical domain. As mentioned
in Section 3, both Permanence and Contextual As-
pect:Intermittent are properties of symptoms and dis-
eases themselves, rather than of the patient’s particu-
lar situation. As such, these properties could easily

Annotation Type Raw Count
EVENT 15,769
TIMEX3 1,426
LINK 7935
Total 25,130

Table 1: Raw Frequency of Annotation Types

TLINK Type Raw Count % of TLINKs
CONTAINS 5,112 64.42%
OVERLAP 1,205 15.19%
BEFORE 1,004 12.65%
BEGINS-ON 488 6.15%
ENDS-ON 126 1.59%
Total 7,935 100.00%

Table 2: Relative Frequency of TLINK types

be identified and marked across a medical ontology,
and then be automatically assigned to EVENTs rec-
ognized as specific medical named entities.

Finally, due to the peculiarities of EHR systems,
some annotations must be done programatically. Ex-
act dates of patient visit (or of pathology/radiology
consult) are often recorded as metadata on the EHR
itself, rather than within the text, making the canoni-
cal DOCTIME (or time of automatic section modifi-
cations) difficult to access in de-identified plaintext
data, but easy to find automatically.

7 Results

We report results on the annotations from the here-
released subset of the THYME colon cancer corpus,
which includes clinical notes and pathology reports
for 35 patients diagnosed with colon cancer for a
total of 107 documents. Each note was annotated
by a pair of graduate or undergraduate students in
Linguistics at the University of Colorado, then adju-
dicated by a domain expert. These clinical narratives
were sampled from the EHRs of a major healthcare
center (the Mayo Clinic). They were deidentified for
all patient-sensitive information; however, original
dates were retained.

7.1 Descriptive Statistics
Table 1 presents the raw counts for events, temporal
expressions and links in the adjudicated gold anno-
tations. Table 2 presents the number and percentage
of TLINKs by type in the adjudicated relations gold
annotations.

150


Annotation Type F1-Score Alpha
EVENT 0.8038 0.7899
TIMEX3 0.8047 0.6705
LINK: Participants only 0.5012 0.4999
LINK: Participants+type 0.4506 0.4503
LINK: CONTAINS 0.5630 0.5626

Table 3: IAA (F1-Score and Alpha) by annotation type

EVENT Property F1-Score Alpha
DocTimeRel 0.7189 0.6889
Cont.Aspect 0.9947 0.9930
Cont.Modality 0.9547 0.9420

Table 4: IAA (F1-Score and Alpha) for EVENT properties

7.2 Inter-annotator Agreement
We report inter-annotator agreement (IAA) results
on the THYME corpus. Each note was annotated by
two independent annotators. The final gold standard
was produced after disagreement adjudication by a
third annotator was performed.

We computed the IAA as F1-score and Krippen-
dorff’s Alpha (Krippendorff, 2012) by applying clo-
sure, using explicitly marked temporal relations to
identify others that are not marked but exist implicitly.
In the computation of the IAA, inferred-only TLINKs
do not contribute to the score, matched or unmatched.
For instance, if both annotators mark A BEFORE B
and B BEFORE C, to prevent artificially inflating the
agreement score, the inferred A BEFORE C is ignored.
Likewise, if one annotator marked A BEFORE B and
B BEFORE C and the other annotator did not, the
inferred A BEFORE C is not counted. However, if
one annotator did explicitly mark A BEFORE C, then
an equivalent inferred TLINK would be used to match
it. EVENT and TIMEX3 IAA was generated based
on exact and overlapping spans, respectively. These
results are reported in Table 3.

The THYME corpus also differs from ISO-
TimeML in terms of EVENT properties, with the
addition of DocTimeRel, ContextualModality and
ContextualAspect. IAA for these properties is in
Table 4.

7.3 Baseline Systems
To get an idea of how much work will be neces-
sary to adapt existing temporal information extrac-
tion systems to the clinical domain, we took the freely
available ClearTK-TimeML system (Bethard, 2013),

TempEval 2013 THYME Corpus
P R F1 P R F1

TIMEX3 83.2 71.7 77.0 59.3 42.8 49.7
EVENT 81.4 76.4 78.8 78.9 23.9 36.6
DocTimeRel - - - 47.4 47.4 47.4
LINK7 28.6 30.9 26.6 22.7 18.6 20.4

EVENT-TIMEX3 - - - 32.3 60.7 42.1
EVENT-EVENT - - - 7.0 3.0 4.2

Table 5: Performance of ClearTK-TimeML models, as
reported in the TempEval 2013 competition, and as applied
to the THYME Corpus development set.

which was among the top performing systems in
TempEval 2013 (UzZaman et al., 2013), and eval-
uated its performance on the THYME corpus.

ClearTK-TimeML uses support vector machine
classifiers trained on the TempEval 2013 training
data, employing a small set of features including
character patterns, tokens, stems, part-of-speech tags,
nearby nodes in the constituency tree, and a small
time word gazetteer. For EVENTs and TIMEX3s,
the ClearTK-TimeML system could be applied di-
rectly to the THYME corpus. For DocTimeRels, the
relation for an EVENT was taken from the TLINK
between that EVENT and the document creation time,
after mapping INCLUDES to OVERLAP. EVENTs
with no such TLINK were assumed to have a Doc-
TimeRel of OVERLAP. For other temporal relations,
INCLUDES was mapped to CONTAINS.

Results of this system on TempEval 2013 and the
THYME corpus are shown in Table 5. For time ex-
pressions, performance when moving to the clinical
data degrades about 25%, from F1 of 77.0 to 49.7.
For events, the degradation is much larger, about
40%, from 78.8 to 36.6, most likely because of the
large number of clinical symptoms, diseases, disor-
ders, etc. which have never been observed by the
system during training. Temporal relations are a bit
more difficult to compare because TempEval lumped
DocTimeRel and other temporal relations together
and had several differences in their evaluation met-
ric7. However, we at least can see that performance
of the ClearTK-TimeML system on temporal rela-
tions is low on clinical text, achieving only F1 of
20.4.

These results suggest that clinical narratives do

7The TempEval 2013 evaluation metric penalized systems
for parts of the text that were not examined by annotators, and
used different variants of closure-based precision and recall.

151


indeed present new challenges for temporal informa-
tion extraction systems, and that having access to
domain specific training data will be crucial for ac-
curate extraction in the clinical domain. At the same
time, it is encouraging that we were able to apply
existing ISO-TimeML-based systems to our corpus,
despite the several extensions to ISO-TimeML that
were necessary for clinical narratives.

8 Discussion

CONTAINS plays a large role in the THYME cor-
pus, representing 66% of TLINK annotations made,
compared with only 14.6% for OVERLAP, the second
most frequent type. We also see that BEFORE links
are relatively less common than OVERLAP and CON-
TAINS, illustrating that much of the temporal ordering
on the timeline is accomplished by using many ver-
tical links (CONTAINS, OVERLAP) to build contain-
ers, and few horizontal links (BEFORE, BEGINS-ON,
ENDS-ON) to order them.

IAA on EVENTs and Temporal Expressions is
strong, although differentiating implicit EVENTs
(which should not be marked) from explicit, mark-
able EVENTs remains one of the biggest sources of
disagreement. When compared to the data from the
2012 i2b2 challenge (Sun et al., 2013b), our IAA
figures are quite similar. Even with our more com-
plex schema, we achieved an F1-score of 0.8038 for
EVENTs (compared to the i2b2 score of 0.87 for par-
tial match). For TIMEX3s, our F1-score was 0.8047,
compared to an F1-score of 0.89 for i2b2.

TLINKing medical EVENTs remains a very diffi-
cult task. By using our narrative container approach
to constrain the number of necessary annotations and
by eliminating often-confusing inverse relations (like
‘after’ and ‘during’) (neither of which were done for
the i2b2 data), we were able to significantly improve
on the i2b2 TLINK span agreement F1-score of 0.39,
achieving an agreement score of 0.5012 for all LINKs
across our corpus. The majority of remaining an-
notator disagreement comes from different opinions
about whether any two EVENTs require an explicit
TLINK between them or an inferred one, rather than
what type of TLINK it would be (e.g. BEFORE vs.
CONTAINS). Although our results are still signifi-
cantly higher than the results reported for i2b2, and
in line with previously reported general news figures,
we are not satisfied. Improving IAA is an important
goal for future work, and with further training, speci-
fication, experience, and standardization, we hope to

clarify contexts for explicit TLINKS.
News-trained temporal information extraction sys-

tems see a significant drop in performance when ap-
plied to the clinical texts of the THYME corpus. But
as the corpus is an extension of ISO-TimeML, future
work will be able to train ISO-TimeML compliant
systems on the annotations of the THYME corpus to
reduce or eliminate this performance gap.

Some applications that our work may enable in-
clude (1) better understanding of event semantics,
such as whether a disease is chronic or acute and
its usual natural history, (2) typical event duration
for these events, (3) the interaction of general and
domain-specific events and their importance in the fi-
nal timeline, and, more generally, (4) the importance
of rough temporality and narrative containers as a
step towards finer-grained timelines.

We have several avenues of ongoing and future
work. First, we are working to demonstrate the utility
of the THYME corpus for training machine learning
models. We have designed support vector machine
models with constituency tree kernels that were able
to reach an F1-score of 0.737 on an EVENT-TIMEX3
narrative container identification task (Miller et al.,
2013), and we are working on training models to
identify events, times and the remaining types of
temporal relations. Second, as per our motivating
use cases, we are working to integrate this annotation
data with timeline visualization tools and to use these
annotations in quality-of-care research. For example,
we are using temporal reasoning built on this work to
investigate the liver toxicity of methotrexate across
a large corpus of EHRs (Lin et al., under review)].
Finally, we plan to explore the application of our
notion of an event (anything that should be visible on
a domain-appropriate timeline) to other domains. It
should transfer naturally to clinical notes about other
(non-cancer) conditions, and even to other types of
clinical notes, as certain basic events should always
be included in a patient’s timeline. Applying our
notion of event to more distant domains, such as legal
opinions, would require first identifying a consensus
within the domain about which events must appear
on a timeline.

9 Conclusion

Much of the information in clinical notes critical to
the construction of a detailed timeline is left implicit
by the concise shorthand used by doctors. Many
events are referred to only by a term such as “tu-

152


mor”, while properties of the event itself, such as
“intermittent”, may not be specified. In addition, the
ordering of events on a timeline is often left to the
reader to infer, based on domain-specific knowledge.
It is incumbent upon the annotation guideline to in-
dicate that only informative event orderings should
be annotated, while leaving domain-specific order-
ings to post-annotation inference. This document
has detailed our approach to adapting the existing
ISO-TimeML standard to this recovery of implicit
information, and defining guidelines that support an-
notation within this complex domain. Our guide-
lines, as well as the annotated data, are available at
http://thyme.healthnlp.org, and the full
corpus has been proposed for use in a SemEval 2015
shared task.

Acknowledgments

The project described is supported by Grant Num-
ber R01LM010090 and U54LM008748 from the Na-
tional Library Of Medicine. The content is solely the
responsibility of the authors and does not necessarily
represent the official views of the National Library
Of Medicine or the National Institutes of Health.

We would also like to thank Dr. Piet C. de Groen
and Dr. Brad Erickson at the Mayo Clinic, as well as
Dr. William F. Styler III, for their contributions to the
schema and to our understanding of the intricacies of
clinical language.

References

James F Allen. 1983. Maintaining knowledge about
temporal intervals. Communications of the ACM,
26(11):832–843.

Emmon Bach. 1986. The algebra of events. Linguistics
and philosophy, 9(1):5–16.

Steven Bethard. 2013. Cleartk-timeml: A minimalist ap-
proach to tempeval 2013. In Second Joint Conference
on Lexical and Computational Semantics (*SEM), Vol-
ume 2: Proceedings of the Seventh International Work-
shop on Semantic Evaluation (SemEval 2013), pages
10–14, Atlanta, Georgia, USA, June. Association for
Computational Linguistics.

Olivier Bodenreider. 2004. The Unified Medical
Language System (UMLS): integrating biomedical
terminology. Nucleic acids research, 32(Database
issue):D267–D270, January.

Philip Bramsen, Pawan Deshpande, Yoong Keok Lee,
and Regina Barzilay. 2006. Finding temporal order
in discharge summaries. In AMIA Annual Symposium
Proceedings, volume 2006, page 81. American Medical
Informatics Association.

Carlo Combi, Yuval Shahar, et al. 1997. Temporal reason-
ing and temporal data maintenance in medicine: issues
and challenges. Computers in biology and medicine,
27(5):353–368.

Robert H Dolin. 1995. Modeling the temporal complex-
ities of symptoms. Journal of the American Medical
Informatics Association, 2(5):323–331.

George Hripcsak, Nicholas D Soulakis, Li Li, Frances P
Morrison, Albert M Lai, Carol Friedman, Neil S Cal-
man, and Farzad Mostashari. 2009. Syndromic surveil-
lance using ambulatory electronic health records. Jour-
nal of the American Medical Informatics Association,
16(3):354–361.

Ann K Irvine, Stephanie W Haas, and Tessa Sullivan.
2008. Tn-ties: A system for extracting temporal infor-
mation from emergency department triage notes. In
AMIA Annual Symposium proceedings, volume 2008,
page 328. American Medical Informatics Association.

Elpida T Keravnou. 1997. Temporal abstraction of med-
ical data: Deriving periodicity. In Intelligent Data
Analysis in Medicine and Pharmacology, pages 61–79.
Springer.

Klaus H. Krippendorff. 2012. Content Analysis: An
Introduction to Its Methodology. SAGE Publications,
Inc, third edition edition, April.

Chen Lin, Elizabeth Karlson, Dmitriy Dligach, Mon-
ica Ramirez, Timothy Miller, Huan Mo, Natalie
Braggs, Andrew Cagan, Joshua Denny, and Guer-
gana. Savova. under review. Automatic identification
of methotrexade-induced liver toxicity in rheumatoid
arthritis patients from the electronic medical records.
Journal of the Medical Informatics Association.

John McCarthy. 2002. Actions and other events in sit-
uation calculus. In Proceedings of the International
conference on Principles of Knowledge Representation
and Reasoning, pages 615–628. Morgan Kaufmann
Publishers; 1998.

Stéphane M Meystre, Guergana K Savova, Karin C Kipper-
Schuler, John F Hurdle, et al. 2008. Extracting infor-
mation from textual documents in the electronic health
record: a review of recent research. Yearb Med Inform,
35:128–44.

Timothy Miller, Steven Bethard, Dmitriy Dligach, Sameer
Pradhan, Chen Lin, and Guergana Savova. 2013. Dis-
covering temporal narrative containers in clinical text.
In Proceedings of the 2013 Workshop on Biomedical
Natural Langua ge Processing, pages 18–26, Sofia,
Bulgaria, August. Association for Computational Lin-
guistics.

153


Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bon-
nie Webber. 2004. The penn discourse treebank. In In
Proceedings of LREC 2004.

Massimo Poesio. 2004. Discourse annotation and seman-
tic annotation in the gnome corpus. In In Proceedings
of the ACL Workshop on Discourse Annotation.

James Pustejovsky and Amber Stubbs. 2011. Increasing
informativeness in temporal annotation. In Proceedings
of the 5th Linguistic Annotation Workshop, pages 152–
160. Association for Computational Linguistics.

James Pustejovsky, Robert Knippen, Jessica Littman, and
Roser Sauri. 2005. Temporal and event information in
natural language text. Language Resources and Evalu-
ation, 39(2-3):123–164.

James Pustejovsky, Kiyong Lee, Harry Bunt, and Laurent
Romary. 2010. Iso-timeml: An international standard
for semantic annotation. In Proceedings of the Seventh
International Conference on Language Resources and
Evaluation (LREC 2010), Valletta, Malta.

Angus Roberts, Robert Gaizauskas, Mark Hepple, George
Demetriou, Yikun Guo, and Ian Roberts. 2009. Build-
ing a semantically annotated corpus of clinical texts.
Journal of biomedical informatics, 42(5):950–966.

Guergana Savova, Steven Bethard, Will Styler, James Mar-
tin, Martha Palmer, James Masanz, and Wayne Ward.
2009. Towards temporal relation discovery from the
clinical narrative. In AMIA Annual Symposium Pro-
ceedings, volume 2009, page 568. American Medical
Informatics Association.

Tessa Sullivan, Ann Irvine, and Stephanie W Haas. 2008.
It’s all relative: usage of relative temporal expressions
in triage notes. Proceedings of the American Society
for Information Science and Technology, 45(1):1–8.

Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. 2013a.
Evaluating temporal relations in clinical text: 2012 i2b2
challenge. Journal of the American Medical Informat-
ics Association.

Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. 2013b.
Evaluating temporal relations in clinical text: 2012 i2b2
challenge. Journal of the American Medical Informat-
ics Association, 20(5):806–813.

Alexander Turchin, Maria Shubina, Eugene Breydo,
Merri L Pendergrass, and Jonathan S Einbinder. 2009.
Comparison of information content of structured and
narrative text data sources on the example of medica-
tion intensification. Journal of the American Medical
Informatics Association, 16(3):362–370.

Naushad UzZaman, Hector Llorens, Leon Derczynski,
James Allen, Marc Verhagen, and James Pustejovsky.
2013. Semeval-2013 task 1: Tempeval-3: Evaluating
time expressions, events, and temporal relations. In Sec-
ond Joint Conference on Lexical and Computational
Semantics (*SEM), Volume 2: Proceedings of the Sev-
enth International Workshop on Semantic Evaluation

(SemEval 2013), pages 1–9, Atlanta, Georgia, USA,
June. Association for Computational Linguistics.

Marc Verhagen. 2005. Temporal Closure in an Annota-
tion Environment. Language Resources and Evalua-
tion, 39(2):211–241.

Veronika Vincze, Gyrgy Szarvas, Richrd Farkas, Gyrgy
Mra, and Jnos Csirik. 2008. The bioscope corpus:
biomedical texts annotated for uncertainty, negation
and their scopes. BMC Bioinformatics, 9(Suppl 11):1–
9.

Ying Zhao, George Karypis, and Usama M. Fayyad.
2005. Hierarchical clustering algorithms for docu-
ment datasets. Data Mining and Knowledge Discovery,
10:141–168.

Jiaping Zheng, Wendy W Chapman, Rebecca S Crowley,
and Guergana K Savova. 2011. Coreference resolution:
A review of general methodologies and applications in
the clinical domain. Journal of biomedical informatics,
44(6):1113–1122.

Li Zhou, Simon Parsons, and George Hripcsak. 2008. The
evaluation of a temporal reasoning system in processing
clinical discharge summaries. Journal of the American
Medical Informatics Association, 15(1):99–106.

154