Unsupervised Declarative Knowledge Induction for Constraint-Based
Learning of Information Structure in Scientific Documents

Yufan Guo
DTAL

University of Cambridge, UK
yg244@cam.ac.uk

Roi Reichart
Technion - IIT
Haifa, Israel

roiri@ie.technion.ac.il

Anna Korhonen
DTAL

University of Cambridge, UK
alk23@cam.ac.uk

Abstract

Inferring the information structure of scien-
tific documents is useful for many NLP appli-
cations. Existing approaches to this task re-
quire substantial human effort. We propose
a framework for constraint learning that re-
duces human involvement considerably. Our
model uses topic models to identify latent top-
ics and their key linguistic features in input
documents, induces constraints from this in-
formation and maps sentences to their domi-
nant information structure categories through
a constrained unsupervised model. When
the induced constraints are combined with a
fully unsupervised model, the resulting model
challenges existing lightly supervised feature-
based models as well as unsupervised mod-
els that use manually constructed declarative
knowledge. Our results demonstrate that use-
ful declarative knowledge can be learned from
data with very limited human involvement.

1 Introduction

Automatic analysis of scientific text can help scien-
tists find information from literature faster, saving
valuable research time. In this paper we focus on
the analysis of the information structure (IS) of sci-
entific articles where the aim is to assign each unit of
an article (typically a sentence) into a category that
represents the information type it conveys. By infor-
mation structure we refer to a particular type of dis-
course structure that focuses on the functional role
of a unit in the discourse (Webber et al., 2011). For
instance, in the scientific literature, the functional

role of a sentence could be the background or moti-
vation of the research, the methods used, the experi-
ments carried out, the observations on the results, or
the author’s conclusions.

Readers of scientific literature find information in
IS-annotated articles much faster than in unanno-
tated articles (Guo et al., 2011b). Argumentative
Zoning (AZ) – an information structure scheme that
has been applied successfully to many scientific do-
mains (Teufel et al., 2009) – has improved tasks
such as summarization and information extraction
and retrieval (Teufel and Moens, 2002; Tbahriti et
al., 2006; Ruch et al., 2007; Liakata et al., 2012;
Contractor et al., 2012).

Existing approaches to information structure anal-
ysis require substantial human effort. Most use
feature-based machine learning, such as SVMs and
CRFs (e.g. (Teufel and Moens, 2002; Lin et al.,
2006; Hirohata et al., 2008; Shatkay et al., 2008;
Guo et al., 2010; Liakata et al., 2012)) which rely
on thousands of manually annotated training sen-
tences. Also the performance of such methods is
rather limited: Liakata et al. (2012) reported per-
class F-scores ranging from .53 to .76 in the bio-
chemistry and chemistry domains and Guo et al.
(2013a) reported substantially lower numbers for the
challenging Introduction and Discussion sections in
biomedical domain.

Guo et al. (2013a) recently applied the General-
ized Expectation (GE) criterion (Mann and McCal-
lum, 2007) to information structure analysis using
expert knowledge in the form of discourse and lexi-
cal constraints. Their model produces promising re-
sults, especially for sections and categories where

131

Transactions of the Association for Computational Linguistics, vol. 3, pp. 131–143, 2015. Action Editor: Masaaki Nagata.
Submission batch: 10/2014; Revision batch 1/2015; Published 3/2015. c©2015 Association for Computational Linguistics.


feature-based models perform poorly. Even the
unsupervised version which uses constraints under
a maximum-entropy criterion without any feature-
based model, outperforms fully-supervised feature-
based models in detecting challenging low fre-
quency categories across sections. However, this ap-
proach still requires substantial human effort in con-
straint generation. Particularly, lexical constraints
were constructed by creating a detailed word list for
each information structure category. For example,
words such as “assay” were carefully selected and
used as a strong indicator of the “Method” category:
p(Method|assay) was constrained to be high (above
0.9). Such a constraint (developed for the biomedi-
cal domain) may not be applicable to a new domain
(e.g. computer science) with a different vocabulary
and writing style.

In fact, most existing works on learning with
declarative knowledge rely on manually constructed
constraints. Little work exists on automatic declar-
ative knowledge induction. A notable exception
is (McClosky and Manning, 2012) that proposed
a constraint learning model for timeline extraction.
This approach, however, requires human supervi-
sion in several forms including task specific con-
straint templates (see Section 2).

We present a novel framework for learning declar-
ative knowledge which requires very limited human
involvement. We apply it to information structure
analysis, based on two key observations: 1) Each
information structure category defines a distribution
over a section-specific and an article-level set of lin-
guistic features. 2) Each sentence in a scientific doc-
ument, while having a dominant category, may con-
sist of features mostly related to other categories.
This flexible view enables us to make use of topic
models which have not proved useful in previous re-
lated works (Varga et al., 2012; Reichart and Korho-
nen, 2012).

We construct topic models at both the individual
section and article level and apply these models to
data, identifying latent topics and their key linguis-
tic features. This information is used to constrain or
bias unsupervised models for the task in a straight-
forward way: we automatically generate constraints
for a GE model and a bias term for a graph clus-
tering objective, such that the resulting models as-
sign each of the input sentences to one information

Zone Definition
Background (BKG) the background of the study
Problem (PROB) the research problem
Method (METH) the methods used
Result (RES) the results achieved
Conclusion (CON) the authors’ conclusions
Connection (CN) work consistent with the current work
Difference (DIFF) work inconsistent with the current work
Future work (FUT) the potential future direction of the research
Table 1: The AZ categorization scheme of this paper

structure category. Both models provide high qual-
ity sentence-based classification, demonstrating the
generality of our approach.

We experiment with the AZ scheme for the anal-
ysis of the logical structure, scientific argumenta-
tion and intellectual attribution of scientific papers
(Teufel and Moens, 2002), using an eight-category
version of this scheme for biomedicine ((Mizuta et
al., 2006; Guo et al., 2013b), Table 1). In evalu-
ation against gold standard annotations, our model
rivals the model of Guo et al. (2013a) which relies
on manually constructed constraints, as well as a
strong supervised feature-based model trained with
up to 2000 sentences. In task-based evaluation we
measure the usefulness of the induced categories for
customized summarization (Contractor et al., 2012)
from specific types of information in an article. The
AZ categories induced by our model prove more
valuable than those of (Guo et al., 2013a) and those
in the gold standard. Our work demonstrates the
great potential of automatically induced declarative
knowledge in both improving the performance of in-
formation structure analysis and reducing reliance of
human supervision.

2 Previous Work

Automatic Declarative Knowledge Induction
Learning with declarative knowledge offers effective
means of reducing human supervision and improv-
ing performance. This framework augments feature-
based models with domain and expert knowledge
in the form of, e.g., linear constraints, posterior
probabilities and logical formulas (e.g. (Chang et
al., 2007; Mann and McCallum, 2007; Mann and
McCallum, 2008; Ganchev et al., 2010)). It has
proven useful for many NLP tasks including unsu-
pervised and semi-supervised POS tagging, parsing
(Druck et al., 2008; Ganchev et al., 2010; Rush et
al., 2012) and information extraction (Chang et al.,

132


2007; Mann and McCallum, 2008; Reichart and Ko-
rhonen, 2012; Reichart and Barzilay, 2012).

However, declarative knowledge is still created in
a costly manual process. We propose inducing such
knowledge directly from text with minimal human
involvement. This idea could be applied to almost
any NLP task. We apply it here to information struc-
ture analysis of scientific documents.

Little prior work exists on automatic constraint
learning. Recently, (McClosky and Manning, 2012)
investigated the approach for timeline extraction.
They used a set of gold relations and their temporal
spans and applied distant learning to find approxi-
mate instances for classifier training. A set of con-
straint templates specific to temporal learning were
also specified. In contrast, we do not use manually
specified guidance in constraint learning. Particu-
larly, we construct constraints from latent variables
(topics in topic modeling) estimated from raw text
rather than applying maximum likelihood estimation
over observed variables (fluents and temporal ex-
pressions) in labeled data. Our method is therefore
less dependent on human supervision. Even more
recently, (Anzaroot et al., 2014) presented a super-
vised dual-decomposition based method, in the con-
text of citation field extraction, which automatically
generates large families of constraints and learn their
costs with a convex optimization objective during
training. Our work is unsupervised, as opposed to
their model which requires a manually annotated
training corpus for constraint learning.

Information Structure Analysis Various
schemes have been proposed for analysing the
information structure of scientific documents, in
particular the patterns of topics, functions and re-
lations at sentence level. Existing schemes include
argumentative zones (Teufel and Moens, 2002;
Mizuta et al., 2006; Teufel et al., 2009), discourse
structure (Burstein et al., 2003; Webber et al.,
2011), qualitative dimensions (Shatkay et al., 2008),
scientific claims (Blake, 2009), scientific concepts
(Liakata et al., 2010), and information status (Mark-
ert et al., 2012), among others. Most previous work
on automatic analysis of information structure relies
on supervised learning (Teufel and Moens, 2002;
Burstein et al., 2003; Mizuta et al., 2006; Shatkay
et al., 2008; Guo et al., 2010; Liakata et al., 2012;
Markert et al., 2012). Given the prohibitive cost

of manual annotation, unsupervised and minimally
supervised techniques such as clustering (Kiela et
al., 2014) and topic modeling (Varga et al., 2012;
Ó Séaghdha and Teufel, 2014) are highly important.
However, the performance of such approaches
shows a large room for improvement. Our work is
specifically aimed at addressing this problem.

Information Structure Learning with Declar-
ative Knowledge Recently, Reichart and Korhonen
(2012) and Guo et al. (2013a) developed constrained
models that integrate rich linguistic knowledge (e.g.
discourse patterns, syntactic features and sentence
similarity information) for more reliable unsuper-
vised or transductive learning of information cate-
gories in scientific abstracts and articles. Guo et al.
(2013a) used detailed lexical constraints developed
via human supervision. Whether automatically in-
duced declarative knowledge can rival such manual
constraints is a question we address in this work.
While Reichart and Korhonen (2012) used more
general constraints, their most effective discourse
constraints were tailored to scientific abstracts and
are less relevant to full papers.

3 Model

We introduce a topic-model based approach to
declarative knowledge (DK) acquisition and describe
how this knowledge can be applied to two unsuper-
vised models for our task. Section 3.1 describes how
topic models are used to induce topics that serve as
the main building blocks of our DK. Section 3.2 ex-
plains how the resulting topics and their key features
are transformed into DK – constraints in the general-
ized expectation (GE) model and bias functions in a
graph clustering algorithm.

3.1 Inducing Information Structure Categories
with Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) LDA is a gener-
ative process widely used for discovering latent top-
ics in text documents (Blei et al., 2003). It assumes
the following generative process for each document:
1. Choose θi ∼ Dirichlet(α), i ∈{1, ...,M}
2. Choose φk ∼ Dirichlet(β),k ∈{1, ...,K}
3. For each word wij,j ∈{1, ...,Ni}
(a) Choose a topic zij ∼ Multinomial(θi)
(b) Choose a word wij ∼ Multinomial(φzij),

133


where θi is the distribution of topics in document i,
φk is the distribution of observed features (usually
words) for topic k, zij is the topic of the j-th word
in document i, and wij is the j-th word in document
i. A number of inference techniques have been pro-
posed for the parameter estimation of this process,
e.g. variational Bayes (Blei et al., 2003) and Gibbs
sampling (Griffiths and Steyvers, 2004) which we
use in this work.

Topics and Information Structure Categories
A key challenge in the application of LDA to in-
formation structure analysis is defining the observed
features generated by the model. Topics are usually
defined to be distributions over all the words in a
document, but in our task this can lead to undesired
topics. Consider, for example, the following sen-
tences from the Introduction section of an article:
- First, exposure to BD-diol via inhalation causes an increase

in Hprt mutation frequency in both mice and rats (25).

- Third, BD-diol is a precursor to MI, an important urinary

metabolite in humans exposed to BD (19).

In a word-based topic model we can expect that most
of the content words in these sentences will be gen-
erated by a single topic that can be titled as “BD-
diol”, or by two different topics related to “mice
rat” and “human”. However, information structure
categories should reflect the role of the sentence in
e.g. the discourse or argument structure of the pa-
per. For example, given the AZ scheme both sen-
tences should belong to the background (BKG) cate-
gory (Table 1). The same requirement applies to the
topics induced by the topic models.

Features In applying LDA to AZ, we define top-
ics as distributions over: (a) words of particular syn-
tactic categories; (b) syntactic (POS tag) patterns;
and (c) discourse markers (citations, tables and fig-
ures). Below we list our features, among which Pro-
noun, Conjunction, Adjective and Adverb are novel
and the rest are adapted from (Guo et al., 2013a):
Citation A single feature that aggregates together
the various citation formats in scientific articles (e.g.
[20] or (Tudek 2007)).
Table, Figure A single feature representing any ref-
erences to tables or figures in a sentence.
Verb Verbs are central to the meaning of a sentence.
Each of the base forms of the verbs in the corpus is
a unique feature.
Pronoun Personal (e.g. “we”) and possessive pro-

nouns (e.g. “our”) and the following adjectives (as
in e.g. “our recent” or “our previous”) may indicate
the ownership of the work (e.g. the author’s own
vs. other people’s work), which is important for our
task. Each of the above words or word combinations
is a unique feature.
Conjunction Conjunctions indicate the relationship
between different sentences in text. We consider two
types of conjunctions: (1) coordinating conjunctions
(indicated by the POS tag “CC” in the output of the
C&C POS tagger); and (2) saturated clausal modi-
fiers (indicated by the POS tag “IN” and the corre-
sponding grammatical relation “cmod” in the output
of the C&C parser). Each word that forms a con-
junction according to this definition is a unique fea-
ture.
Adjective and Adverb Adjectives provide descrip-
tive information about objects, while adverbs may
change or qualify the meaning of verbs or adjectives.
Each adverb and adjective that appears in more than
5 articles in the corpus is a unique feature.1

Modal, Tense, Voice Previous work has demon-
strated a strong correlation between tense, voice,
modals and information categories (e.g. (Guo et al.,
2011a; Liakata et al., 2012)). These features are in-
dicated by the part-of-speech (POS) tag of verbs.
For example, the phrase “may have been investi-
gated” is represented as “may-MD have-VBZ be-
VBN verb-VBN”.

As a pre-processing step, each sentence in the in-
put corpus was represented with the list of features
it consists of. Consider, for example, the following
sentence from a Discussion section in our data-set:
- In a previous preliminary study we reported that the results

of a limited proof of concept human clinical trial using sulin-

dac (1-5%) and hydrogen peroxide (25%) gels applied daily for

three weeks on actinic keratoses (AK) involving the upper ex-

tremities [27]. Before running the Discussion section
topic model (see below for the features considered
by this model), this sentence is converted to the fol-
lowing representation:
[cite] previous preliminary we limited

The topic models we construct are assumed to gen-

1We collapsed adverbs ending with -ly into the correspond-
ing adjectives to reduce data sparsity. Verbs were spared the
frequency cut-off because rarely occurring verbs are likely to
correspond to domain-specific actions that are probably indica-
tive of the METH category.

134


Model Features
Article Verb, Table, Figure, Modal, Tense, Voice
Introduction Citation, Pronoun, Verb, Modal, Tense, Voice
Discussion Citation, Pronoun, Conjunction, Adjective, Adverb

Table 2: The features used in the article-level and the
section-specific topic models in this paper

erate these features rather than bag-of-words.
Topic Models Construction Looking at the cate-

gories in Table 1 it is easy to see that different com-
binations of the features in topic model generation
will be relevant for different category distinctions.
For example, personal pronouns are particularly rel-
evant for distinguishing between categories related
to current vs. previous works.

Some distinctions between categories are, in turn,
more relevant for some sections than for others. For
example, the distinction between the background
(BKG) and the definition of the research problem
(PROB) is important for the Introduction section, but
less important for the results section. Similarly the
distinction between conclusions (CON) and differ-
ence from previous work (DIFF) is more relevant for
the Discussion section than other sections.

We therefore constructed two types of topic mod-
els: section-specific and article-level models, rea-
soning that some distinctions apply globally at the
article level while some apply more locally at the
section level. Section-specific models were con-
structed for the Introduction section and for the Dis-
cussion section.2 Table 2 presents the features that
are used with each topic model.

A key issue in the application of topic models to
our task is the definition of the unit of text for which
θi, the distribution over topics, is drawn from the
Dirichlet distribution (step 1 of the algorithm). This
choice is data dependent, and the standard choice
is the document level. However, for scientific arti-
cles the paragraph level is a better choice, because a
paragraph contains only a small subset of informa-
tion structure categories while in a full article cat-
egories are more evenly distributed. We therefore
adopted the paragraph as our basic unit of text. The
section-level and the article-level models are applied

2The Methods section is less suitable for a section-level
topic model as 97.5% of its sentences belong to its dominant
category (METH) (Table 3). Preliminary experiments with
section-level topic models for the Methods and Results sections
did not lead to improved performance.

to the collection of paragraphs in the specific section
across the test set articles or in the entire set of test
articles, respectively.

3.2 Declarative Knowledge Induction
Most sentence-based information structure analysis
approaches associate each sentence with a unique
category. However, since the MAP assignment of
topics to features associates each sentence with mul-
tiple topics, we cannot directly interpret the resulting
topics as categories of input sentences.3

In this section we present two methods for in-
corporating the information conveyed by the topic
models (see Section 3.1) in unsupervised models.
The first method biases a graph clustering algorithm
while the second generates constraints that can be
used with a GE criterion.

Graph Clustering We use the graph clustering
objective of Dhillon et al. (2007) which can be opti-
mized efficiently, without eigenvalues calculations:

max
Ỹ

trace(Ỹ
T
W
−1/2

AW
−1/2

Ỹ )

where A is a similarity matrix, W is a diagonal
matrix of the weight of each cluster, and Ỹ is an
orthonormal matrix, indicating cluster membership,
which is proportional to the square root of W .

To make use of topics to bias the graph clustering
towards the desired solution, we define the similarity
matrix A, whose (i,j)−th entry corresponds to the
i-th and j-th test set sentences as follows:
A(i,j) = f(Si,Sj) + γg(Si,Sj,T), where

Si = {All the features extracted from sentence i }
T = {Tk|Tk = {top N features associated with topic k}}
f(Si,Sj) = |Si ∩Sj|

g(Si,Sj,T) =

{
1 ∃x ∈ Si∃y ∈ Sj∃k x ∈ Tk ∧y ∈ Tk
0 Otherwise

where Tk consists of the N features that are as-
signed the maximum probability according to the k-
th topic. Under this formulation, the topic model
term g(·) is defined to be the indicator of whether
two sentences share features associated with the
same topic. If this is true, the algorithm is encour-
aged to assign these sentences to the same cluster.

Generalized Expectation A generalized expecta-
tion (GE) criterion is a term in an objective function

3Our preliminary experiments demonstrated that assigning
the learned topics to the test sentences performs poorly.

135


that assigns a score to model expectations (Mann
and McCallum, 2008; Druck et al., 2008; Bellare
et al., 2009). Given a score function g(·), a discrim-
inative model pλ(y|x), a vector of feature functions
f∗(·), and an empirical distribution p̃(x), the value
of a GE criterion is:

g(Ep̃(x)[Epλ(y|x)[f
∗
(x,y)]])

A popular choice of g(·) is a measure of distance
(e.g. L2 norm) between model and reference expec-
tations. The feature functions f∗(·) and the refer-
ence expectations of f∗(·) are traditionally specified
by experts, which provides a way to integrate declar-
ative knowledge into machine learning.

Consider a Maximum Entropy (MaxEnt) model
pλ(y|x) = 1Zλexp(λ · f(x,y)), where f(·) is a vec-
tor of feature functions, λ the feature weights, and
Zλ the partition function. The following objective
function can be used for training MaxEnt with GE
criteria on unlabeled data:

max
λ
−g(Ep̃(x)[Epλ(y|x)[f

∗
(x,y)]])−

∑

j

λ2j
2σ2

where the second term is a zero-mean σ2-variance
Gaussian prior on parameters.

Let the k-th feature function f∗k(·) be an indicator
function:

f
∗
k(x,y) = 1{xik=1∧y=yk}

(x,y)

where xik is the ik-th element/feature in the feature
vector x. The model expectation of f∗k(·) becomes:

Ep̃(x)[Epλ(y|x)[f
∗
k(x,y)]] = p̃(xik = 1)pλ(yk|xik = 1)

To calculate g(·), a reference expectation of f∗k(·)
can be obtained after specifying (the upper and
lower limits of) p(yk|xik = 1):

lk ≤ p(yk|xik = 1) ≤ uk
This type of constraints, for example, 0.9 ≤
p(CON|suggest) ≤ 1, have been successfully ap-
plied to GE-based information structure analysis by
Guo et al. (2013a). Here we build on their frame-
work and our contribution is the automatic induction
of such constraints by topic modeling.

The association between features and topics can
be transformed into constraints as follows. Let Wz
be a set of top N key features associated with topic
z – the N features that are assigned the maximum
probability according to the topic. We compute the

following topic-specific feature sets:
Az = {w|w ∈ Wz ∧∀t 6= z w 6∈ Wt} – the set of
features associated with topic z but not with any of
the other topics;
Bz =

⋃
t6=z

Wt – the set of features associated with at

least one topic other than z.
For every topic-feature pair (zk,wk) we therefore

write the following constraint:

lk ≤ p(zk|wk = 1) ≤ uk
We set the probability range for the k-th pair as fol-
lows:

If wk ∈ Azk then lk = 0.9,uk = 1,
If wk ∈ Bzk then lk = 0,uk = 0.1,
In any other case lk = 0,uk = 1.

The values of lk and uk were selected such that they
reflect the strong association between the key fea-
tures and their topics. Our basic reasoning is that if
a sentence is represented by one of the key unique
features of a given topic, it is highly likely to be as-
sociated with that topic. Likewise, a sentence is un-
likely to be associated with the topic of interest if it
has a key feature for any other topics.

3.3 Summary of Contribution

Learning with declarative knowledge is an active re-
cent research avenue in the NLP community. In
this framework feature-based models are augmented
with domain and expert knowledge encoded most
often by constraints of various types. The human
effort involved with this framework is the manual
specification of the declarative knowledge. This re-
quires deep understanding of the domain and task in
question. The resulting constraints typically spec-
ify detailed associations between lexical, grammat-
ical and discourse elements and the information to
be learned (see, e.g., tables 2 and 3 of (Guo et al.,
2013a) and table 1 of (Chang et al., 2007)).

Our key contribution is the automatic induction of
declarative knowledge that can be easily integrated
into unsupervised models in the form of constraints
and bias functions. Our model requires minimal do-
main and task knowledge. We do not specify lists
of words or discourse markers (as in (Guo et al.,
2013a)) but, instead, our model automatically asso-
ciates latent variables both with linguistic features,
taken from a very broad and general feature set (e.g.

136


BKG PROB METH RES CON CN DIFF FUT
Article 16.9 2.8 34.8 17.9 22.3 4.3 0.8 0.2
(8171)
Introduction 74.8 13.2 5.4 0.6 5.9 0.1 - -
(1160)
Methods 0.5 0.2 97.5 1.4 0.2 0.2 0.1 -
(2557)
Results 4.0 2.1 11.7 68.9 12.1 1.1 0.1 -
(2054)
Discussion 16.9 1.1 0.7 1.5 63.5 13.3 2.4 0.7
(2400)

Table 3: Distribution of sentences (shown in percentages)
in articles and individual sections in the AZ-annotated
corpus. The total number of sentences in each section
appears in parentheses below the section name.

all the words that belong to a given set of POS tags),
and with sentences in the input text. In the next sec-
tion we present our experiments which demonstrate
the usefulness of this declarative knowledge.

4 Experiments

Data and Models We used the full paper cor-
pus earlier employed in (Guo et al., 2013a) which
includes 8171 annotated sentences (with reported
inter-annotator agreement: κ = .83) from 50
biomedical journal articles from the cancer risk as-
sessment domain. One third of this corpus was saved
for a development set on which our model was de-
signed and its hyperparameters were tuned (see be-
low). The corpus is annotated according to the Argu-
mentative Zoning (AZ) scheme (Teufel and Moens,
2002; Mizuta et al., 2006) described in Table 1. Ta-
ble 3 shows the distribution of AZ categories and the
total number of sentences in each individual section.
Since section names vary across articles, we grouped
similar sections before calculating the statistics (e.g.
Materials and Methods sections were grouped under
Method). The table demonstrates that although there
is a dominant category in each section (e.g. BKG in
Introduction), up to 36.5% of the sentences in each
section fall into other categories.

Feature Extraction We used the C&C POS tag-
ger and parser trained on biomedical literature (Cur-
ran et al., 2007; Rimell and Clark, 2009) in the fea-
ture extraction process. Lemmatization was done
with Morpha (Minnen et al., 2001).

Baselines We compared our models (TopicGC
and TopicGE) against the following baselines: (a)
an unconstrained unsupervised model – the unbiased
version of the graph clustering we use for TopicGC

(i.e. where g(·) is omitted, GC); (b) the unsuper-
vised constrained GE method of (Guo et al., 2013a)
where the constraints were created by experts (Ex-
pertGE); (c) supervised unconstrained Maximum
Entropy models, each trained to predict categories
in a particular section using 150 sentences from that
section, as in the lightly supervised case in (Guo et
al., 2013a) (MaxEnt); and (d) a baseline that assigns
all the sentences in a given section to the most fre-
quent gold-standard category of that section (Table
3). This baseline emulates the use of section names
for information structure classification.

Our constraints, which we use in the TopicGE
and TopicGC models, are based on topics that are
learned on the test corpus. While having access to
the raw test text at training time is a standard as-
sumption in many unsupervised NLP works (e.g.
(Klein and Manning, 2004; Goldwater and Grif-
fiths, 2007; Lang and Lapata, 2014)), it is impor-
tant to quantify the extent to which our method de-
pends on its access to the test set. We therefore con-
structed the TopicGE* model which is identical to
TopicGE except that the topics are learned from an-
other collection of 47 biomedical articles contain-
ing 9352 sentences. Like our test set, these articles
are from the cancer risk assessment domain - all of
them were published in the Toxicol. Sci. journal
in the years 2009-2012 and were retrieved using the
PubMed search engine with the key words “cancer
risk assessment”. There is no overlap between this
new dataset and our test set (Guo et al., 2013a).

Models and Parameters For graph clustering,
we used the Graclus software (Dhillon et al., 2007).
For GE and MaxEnt, we used the Mallet software
(McCallum, 2002). The γ parameter in the graph
clustering was set to 10 using the development data.
Several values of this parameter in the range of
[10,1000] yielded very similar performance. The
number of key features considered for each topic,
N, was set to 40, 20 and 15 for the article, Introduc-
tion section, and Discussion section topic models,
respectively. This difference reflects the number of
feature types (Table 2) and the text volume (Table 3)
of the respective models.

Evaluation We evaluated the overall accuracy as
well as the category-level precision, recall and F-
score for each section. TopicGC, TopicGE, Top-
icGE* and the baseline GC methods are unsuper-

137


Introduction Method Result Discussion
GC TGC TGE TGE* EGE MFC GC TGC TGE TGE* EGE MFC GC TGC TGE TGE* EGE MFC GC TGC TGE TGE* EGE MFC

F1
BKG .78 .83 .89 .86 .87 .86 - - - - .07 - - - - - .46 - .47 .47 .45 .49 .46 -
PROB .34 .16 .31 .19 .24 - - - - - .33 - - - - - .04 - - - - - .32 -
METH - .16 .12 .16 .35 - .98 .98 .98 .98 .93 .99 .29 - .25 .32 .29 - - - - - .14 -
RES - - - - .07 - - - - - .27 - .67 .82 .81 .77 .80 .82 - - - - .14 -
CON - .10 .26 .03 .28 - - - - - - - .39 .28 .27 .29 .42 - .82 .83 .82 .82 .71 .78
CN - - - - - - - - - - - - - - - - .25 - - .21 .23 .11 .20 -
DIFF - - - - - - - - - - - - - - - - - - - - - - .12 -
FUT - - - - - - - - - - - - - - - - - - - - - - .36 -
Acc. .61 .68 .77 .74 .72 .75 .97 .97 .97 .97 .87 .97 .51 .68 .67 .62 .64 .69 .66 .67 .67 .67 .56 .63

Table 4: Performance (class based F1-score and overall accuracy (Acc.)) of unbiased Graph Clustering (GC), Graph
Clustering with declarative knowledge learned from topic modeling (TopicGC model, TGC column), Generalized
Expectation using constraints learned from topic modeling (TopicGE, TGE) and the same model where constraints are
learned using an external set of articles (TopicGE*, TGE*), GE with constraints created by experts (ExpertGE, EGE -
a replication of (Guo et al., 2013a)) and the most frequent gold standard category of the section (MFC)

vised and therefore induce unlabeled categories. To
evaluate their output against the gold standard AZ
annotation we first apply a standard greedy many-
to-one mapping (naming) scheme in which each in-
duced category is mapped to the gold category that
shares the highest number of elements (sentence)
with it (Reichart and Rappoport, 2009). The to-
tal number of induced topics was 9 with each topic
model inducing three topics.4 For light supervision,
a ten-fold cross-validation scheme was applied.

In addition, we compare the quality of the auto-
matically induced and manually constructed declar-
ative knowledge in the context of customized sum-
marization (Contractor et al., 2012) where sum-
maries of specific types of information in an article
are to be generated (we focused on the article’s con-
clusions). While an intuitive solution would be to
summarize the Discussion section of a paper, only
63.5% of its sentences belong to the gold standard
Conclusion category (Table 3).

For our experiment, we first generated five sets
of sentences. The first four sets consist of the ar-
ticle sentences annotated with the CON category ac-
cording to: TopicGE or TopicGC or ExpertGE or the
gold standard annotation. The fifth set is the Discus-
sion section. We then used Microsoft AutoSumma-
rize (Microsoft, 2007) to select sentences from each
of the five sets such that the number of words in each
summary amounts for 10% of the words in the input.

4The number of gold standard AZ categories is 8. However,
we wanted each of our topic models to induce the same number
of topics in order to reduce the number of parameters to the
required minimum.

For evaluation, we asked an expert to summarize
the conclusions of each article in the corpus. We
then evaluated the five summaries against the gold-
standard summaries written by the expert in terms of
various ROUGE scores (Lin, 2004).

5 Results

We report here the results for our constrained unsu-
pervised models compared to the baselines. We start
with quantitative evaluation and continue with qual-
itative demonstration of the topics learned by the
topic models and their key features which provide
the substance for the constraints and bias functions
used in our information structure models.

Unsupervised Learning Results Table 4 presents
the performance of the four main unsupervised
learning models discussed in this paper: GC, Top-
icGC, TopicGE, and ExpertGE of (Guo et al.,
2013a). Our models (TopicGC and TopicGE) out-
perform the ExpertGE when considering category
based F-score for the dominant categories of each
section. ExpertGE is most useful in identifying the
less frequent categories of each section (Table 3),
which is in line with (Guo et al., 2013a). The overall
sentence-based accuracy of TopicGE is significantly
higher than that of ExpertGE for all four sections
(bottom line of the table). Furthermore, for all four
sections it is one of our models (TopicGC or Top-
icGE) that provides the best result under this mea-
sure, among the unsupervised models.

The table further provides a comparison of the un-
supervised models to the MFC baseline which as-
signs all the sentences of a section to its most fre-

138


Introduction Method Result Discussion
TopicGE Light TopicGE Light TopicGE Light TopicGE Light

P R F P R F P R F P R F P R F P R F P R F P R F
BKG .84 .95 .89 .78 .99 .87 - - - - - - - - - - - - .41 .51 .45 .38 .19 .25
PROB .33 .30 .31 .57 .11 .18 - - - - - - - - - .25 .02 .04 - - - - - -
METH .40 .07 .12 .50 .21 .30 .97 1 .98 .97 1 .98 .34 .20 .25 .62 .14 .23 - - - - - -
RES - - - - - - - - - - - - .74 .90 .81 .71 .98 .82 - - - - - -
CON .44 .18 .26 .80 .06 .11 - - - - - - .30 .25 .27 .57 .16 .25 .78 .87 .82 .69 .96 .80
CN - - - - - - - - - - - - - - - - - - .32 .18 .23 .35 .06 .10
DIFF - - - - - - - - - - - - - - - - - - - - - - - -
FUT - - - - - - - - - - - - - - - - - - - - - - - -
Acc. 0.77 0.77 0.97 0.97 0.67 0.70 0.67 0.66

Table 5: Performance (class based Precision, Recall and F-score as well as overall accuracy (Acc.)) of the TopicGE
model and of an unconstrained MaxEnt model trained with Light supervision (total of 600 sentences - 150 training
sentences for each section-level model). The same pattern of results holds when the MaxEnt is trained with up to 2000
sentences (500 sentences for each section-level model).

TopicGE TopicGC ExpertGE Section Gold
R P F R P F R P F R P F R P F

ROUGE-1 45.2 54.0 46.8 43.5 55.1 46.1 43.7 49.1 43.8 46.7 43.8 42.6 43.3 55.4 46.2
ROUGE-2 30.0 35.8 30.8 28.4 35.7 29.8 25.5 28.2 25.2 28.6 26.3 25.8 27.8 35.1 29.3
ROUGE-L 43.3 51.6 44.8 41.6 52.6 44.1 41.3 46.2 41.3 44.2 41.3 40.3 41,1 52.3 43.7

Table 6: ROUGE scores of zone (TopicGE, TopicGC, ExpertGE or gold standard) and Discussion section based sum-
maries. TopicGE provides the best summaries. TopicGC outperforms ExpertGE and the Discussion section systems
and in two measures the gold categorization based system as well. Result patterns with ROUGE(3,4,W-1.2, S* and
SU*) are very similar to those of the table. The differences between TopicGE and ExpertGE are statistically significant
using t-test with p < 0.05. The differences between TopicGE and gold, as well as between ExpertGE and gold are not
statistically significant.

quent category according to the gold standard. This
baseline sheds light on the usefulness of section
names for our task. As is evident from the table,
while this baseline is competitive with the unsuper-
vised models in terms of accuracy, its class-based
F-score performance is quite poor. Not only does it
lag behind the unsupervised models in terms of the
F-score of the most frequent classes of the Introduc-
tion and Discussion sections, but it does not iden-
tify any of the classes except from the most frequent
ones in any of the sections - a task the unsupervised
models often perform with reasonable quality.

Finally, the table also presents the performance
of the TopicGE* model for which constraints are
leaned from an external data set - different from the
test set. The results show that there is no substantial
difference between the performance of the TopicGE
and TopicGE* models. While TopicGE achieves
better F-scores in five of the cases in the table, Top-
icGE* is better in four cases and the performance is
identical in two cases. Section level accuracies are
better for TopicGE in two of the four sections, but
the difference is only 3-5%.

Comparison with Supervised Learning Table 5

compares the quality of unsupervised constrained-
based learning with that of lightly supervised
feature-based learning. Since our models, TopicGC
and TopicGE, perform quite similarly, we included
only TopicGE in this evaluation. The lightly su-
pervised models (MaxEnt classifiers) were trained
with a total of 600 sentences - 150 for each section-
specific classifier. The table demonstrates that Top-
icGE outperforms MaxEnt with light supervision in
terms of class based F-scores in the Introduction and
Discussion sections. In the Methods section, where
97.5% of the sentences belong to the most frequent
category, and in the Results section, the models per-
form quite similarly. Overall accuracy numbers are
quite similar for both models with MaxEnt doing
better for the Results section and TopicGE for the
Discussion section. These results further demon-
strate that unsupervised constrained learning pro-
vides a practical solution to information structure
analysis of scientific articles.

Extractive Summarization Evaluation Table
6 presents the average ROUGE scores for zone-
based (TopicGE, TopicGC, ExpertGE and gold) and
section-based summaries across our test set articles.

139


Topic Features
1 {do} be {done}{doing}{be done}{have been done} induce {may do}{to do} show have {have done} increase

{did} suggest indicate report cause include inhibit find observe involve associate activate demonstrate result use
lead play {could do} know {do do} form contribute {can do}{would do} promote reduce

2 {were done} {done} {doing} {did} use be describe contain perform incubate {do} determine analyze follow add
isolate purchase wash accord {to do} treat collect remove prepare obtain measure store stain centrifuge transfer
detect purify assess supplement carry dissolve plate receive kill

3 {did}{done} be {doing}{were done} [tab fig] {do} show increase observe compare {to do} expose use have find
{did do} treat {be done} report follow drink reduce result administer decrease determine measure include evaluate
affect detect induce indicate associate provide reveal suggest occur

Table 7: Topics and key features extracted by the article-level model (including modal, tense and voice marked in
curly brackets, reference to tables or figures marked in square brackets, and verbs in the base form)

Topic Features
1 [no cite] {did} (we) {done}{do}{doing} use {were done} (present) {to do} investigate be (mammary) determine

provide (our) treat compare examine

2 {did}{done} [cite] {doing}{were done} be expose find [no cite] drink increase report (recent) (previous) admin-
ister {do} contain evaluate (early)

3 {do} [cite] be {done} [no cite] {doing} {be done} {have been done} induce {have done} (it) show {may do}
have {to do} include increase (their) associate

Table 8: Topics and key features extracted by the section-specific topic model of the Introduction section (including
citations marked in square brackets, pronouns and the follow-up adjective modifiers marked in parentheses, modal,
tense and voice marked in curly brackets, and verbs in their base form)

Topic Features
1 (we) [no cite] (our) higher (mammary) as because (first) significant possible high (early) (positive) most

2 [cite] present (present) (previous) similar different (its) although consistent furthermore greater due most whereas

3 [no cite] not also (it) but however more (their) both therefore only thus significant lower
Table 9: Topics and key features extracted by the section-specific topic model of the Discussion section (including
citations marked in square brackets, pronouns and the follow-up adjective modifiers marked in parentheses, and con-
junctions, adjectives and adverbs)

TopicGE and TopicGC based summaries outperform
the other systems, even the one that uses gold stan-
dard information structure categorization. A poten-
tial explanation for the better performance of our
models compared to ExpertGE is that the relative
strength of our models is in identifying the major
category of each section while ExpertGE is better at
identifying low or medium frequency categories.

Qualitative Analysis We next provide a qualita-
tive analysis of the topics induced by our topic mod-
els — the article-level model as well as the section-
level models — and their key features. Note that
both our models, TopicGE and TopicGC, assume
that the induced topics provide a good approxima-
tion of the information structure categories and build
their constraints (expert knowledge) from these top-
ics accordingly. Below we examine this assumption.

Table 7 presents the topics and key features ob-
tained from global topic modeling applied to full ar-
ticles. The table reveals a strong correlation between
present/future tense and topic 1, and between past
tense and topics 2 and 3 (Modal, Tense and Voice
features). The table further demonstrates that top-
ics 1 and 3 are linked to verbs that describe research
findings, such as “show” and “demonstrate” in topic
1, and “report” and “indicate” in topic 3, whereas
topic 2 seems related to verbs that describe methods
and experiments such as “use” and “prepare“. The
feature corresponding to tables and figures [tab fig]
is only seen in topic 3. Based on these observations,
topics 1, 2 and 3 seem to be related to AZ categories
CON, METH and RES respectively.

Tables 8 and 9 present the topics and the key fea-
tures obtained from the section-specific topic mod-

140


eling for the Introduction and Discussion sections.
Due to space limitations we cannot provide a de-
tailed analysis of the information included in these
tables, but it is easy to see that they provide evi-
dence for the correlation between topics in the sec-
tion specific models and AZ categories. Table 8
demonstrates that for the Introduction section topic
1 correlates with the author’s work and topics 2 and
3 with previous work. Table 9 shows that for the
Discussion section topics 1 and 3 well correlate with
the AZ CON category and topic 2 with the BKG, CN
and DIFF categories. Our analysis therefore demon-
strates that the induced topics are well aligned with
the actual categories of the AZ classification scheme
or with distinctions (e.g. the author’s own work
vs. works of others) that are very relevant for this
scheme. Note that we have not seeded our models
with word-lists and the induced topics are therefore
purely data-driven.

6 Discussion

We presented a new framework for automatic in-
duction of declarative knowledge and applied it to
constraint-based modeling of the information struc-
ture analysis of scientific documents. Our main con-
tribution is a topic-model based method for unsuper-
vised acquisition of lexical, syntactic and discourse
knowledge guided by the notion of topics and their
key features. We demonstrated that the induced top-
ics and key features can be used with two differ-
ent unsupervised learning methods – a constrained
unsupervised generalized expectation model and a
graph clustering formulation. Our results show that
this novel framework rivals more supervised alterna-
tives. Our work therefore contributes to the impor-
tant challenge of automatically inducing declarative
knowledge that can reduce the dependence of ML
algorithms on manually annotated data.

The next natural step in this research is generaliz-
ing our framework and make it applicable to more
applications, domains and machine learning mod-
els. We are currently investigating a number of ideas
which will hopefully lead to better natural language
learning with reduced human supervision.

References
Sam Anzaroot, Alexandre Passos, David Belanger, and

Andrew McCallum. 2014. Learning soft linear con-
straints with application to citation field extraction. In
ACL, pages 593–602.

Kedar Bellare, Gregory Druck, and Andrew McCallum.
2009. Alternating projections for learning with expec-
tation constraints. In Proceedings of the 25th Con-
ference on Uncertainty in Artificial Intelligence, pages
43–50.

Catherine Blake. 2009. Beyond genes, proteins, and
abstracts: Identifying scientific claims from full-text
biomedical articles. Journal of Biomedical Informat-
ics, 43(2):173–189.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
2003. Latent dirichlet allocation. Journal of Machine
Learning Research, 3:993–1022.

Jill Burstein, Daniel Marcu, and Kevin Knight. 2003.
Finding the write stuff: Automatic identification of
discourse structure in student essays. IEEE Intelligent
Systems, 18(1):32–39.

Ming-Wei Chang, Lev Ratinov, and Dan Roth.
2007. Guiding semi-supervision with constraint-
driven learning. In ACL, pages 280–287.

Danish Contractor, Yufan Guo, and Anna Korhonen.
2012. Using argumentative zones for extractive sum-
marization of scientific articles. In COLING, pages
663–678.

James Curran, Stephen Clark, and Johan Bos. 2007.
Linguistically motivated large-scale nlp with c&c and
boxer. In Proceedings of the ACL 2007 Demo and
Poster Sessions, pages 33–36.

Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis.
2007. Weighted graph cuts without eigenvectors:
A multilevel approach. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 29(11):1944–
1957.

Gregory Druck, Gideon Mann, and Andrew McCallum.
2008. Learning from labeled features using gener-
alized expectation criteria. In Proceedings of the
31st annual international ACM SIGIR conference on
Research and development in information retrieval,
pages 595–602.

Kuzman Ganchev, João Graça, Jennifer Gillenwater, and
Ben Taskar. 2010. Posterior regularization for struc-
tured latent variable models. Journal of Machine
Learning Research, 11:2001–2049.

Sharon Goldwater and Tom Griffiths. 2007. A fully
bayesian approach to unsupervised part-of-speech tag-
ging. In ACL, pages 744–751.

Thomas L Griffiths and Mark Steyvers. 2004. Find-
ing scientific topics. Proceedings of the National
Academy of Sciences, 101(suppl 1):5228–5235.

141


Yufan Guo, Anna Korhonen, Maria Liakata, Ilona Silins
Karolinska, Lin Sun, and Ulla Stenius. 2010. Identify-
ing the information structure of scientific abstracts: an
investigation of three different schemes. In BioNLP,
pages 99–107.

Yufan Guo, Anna Korhonen, and Thierry Poibeau.
2011a. A weakly-supervised approach to argumenta-
tive zoning of scientific documents. In EMNLP, pages
273–283.

Yufan Guo, Anna Korhonen, Ilona Silins, and Ulla Ste-
nius. 2011b. Weakly-supervised learning of infor-
mation structure of scientific abstracts–is it accurate
enough to benefit real-world tasks in biomedicine?
Bioinformatics, 27(22):3179–3185.

Yufan Guo, Roi Reichart, and Anna Korhonen. 2013a.
Improved information structure analysis of scientific
documents through discourse and lexical constraints.
In NAACL HLT, pages 928–937.

Yufan Guo, Ilona Silins, Ulla Stenius, and Anna Korho-
nen. 2013b. Active learning-based information struc-
ture analysis of full scientific articles and two applica-
tions for biomedical literature review. Bioinformatics,
29(11):1440–1447.

Kenji Hirohata, Naoaki Okazaki, Sophia Ananiadou, and
Mitsuru Ishizuka. 2008. Identifying sections in sci-
entific abstracts using conditional random fields. In
IJCNLP, pages 381–388.

Douwe Kiela, Yufan Guo, Ulla Stenius, and Anna Ko-
rhonen. 2014. Unsupervised discovery of informa-
tion structure in biomedical documents. Bioinformat-
ics, page btu758.

Dan Klein and Christopher Manning. 2004. Corpus-
based induction of syntactic structure: Models of de-
pendency and constituency. In ACL, pages 478–485.

Joel Lang and Mirella Lapata. 2014. Similarity-driven
semantic role induction via graph partitioning. Com-
putational Linguistics, 40(3):633–669.

Maria Liakata, Simone Teufel, Advaith Siddharthan, and
Colin Batchelor. 2010. Corpora for conceptualization
and zoning of scientific papers. In LREC, pages 2054–
2061.

Maria Liakata, Shyamasree Saha, Simon Dobnik, Colin
Batchelor, and Dietrich Rebholz-Schuhmann. 2012.
Automatic recognition of conceptualization zones in
scientific articles and two life science applications.
Bioinformatics, 28(7):991–1000.

Jimmy Lin, Damianos Karakos, Dina Demner-Fushman,
and Sanjeev Khudanpur. 2006. Generative content
models for structural analysis of medical abstracts. In
BioNLP, pages 65–72.

Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summarization
Branches Out: Proceedings of the ACL-04 Workshop,
pages 74–81.

Gideon S Mann and Andrew McCallum. 2007. Simple,
robust, scalable semi-supervised learning via expecta-
tion regularization. In ICML, pages 593–600.

Gideon S. Mann and Andrew McCallum. 2008. General-
ized expectation criteria for semi-supervised learning
of conditional random fields. In ACL, pages 870–878.

Katja Markert, Yufang Hou, and Michael Strube. 2012.
Collective classification for fine-grained information
status. In ACL, pages 795–804.

Andrew Kachites McCallum. 2002. MAL-
LET: A machine learning for language toolkit.
http://mallet.cs.umass.edu.

David McClosky and Christopher D. Manning. 2012.
Learning constraints for consistent timeline extraction.
In EMNLP-CoNLL, pages 873–882.

Microsoft. 2007. AutoSummarize: Automatically
summarize a document. https://support.office.com/en-
us/article/Automatically-summarize-a-document-
b43f20ae-ec4b-41cc-b40a-753eed6d7424.

Guido Minnen, John Carroll, and Darren Pearce. 2001.
Applied morphological processing of english. Natural
Language Engineering, 7(3):207–223.

Yoko Mizuta, Anna Korhonen, Tony Mullen, and Nigel
Collier. 2006. Zone analysis in biology articles as a
basis for information extraction. International Journal
of Medical Informatics on Natural Language Process-
ing in Biomedicine and Its Applications, 75(6):468–
487.

Diarmuid Ó Séaghdha and Simone Teufel. 2014. Unsu-
pervised learning of rhetorical structure with un-topic
models. In Proceedings of COLING 2014: Technical
Papers, pages 2–13.

Roi Reichart and Regina Barzilay. 2012. Multi-event ex-
traction guided by global constraints. In NAACL HLT,
pages 70–79.

Roi Reichart and Anna Korhonen. 2012. Document and
corpus level inference for unsupervised and transduc-
tive learning of information structure of scientific doc-
uments. In COLING, pages 995–1006.

Roi Reichart and Ari Rappoport. 2009. The nvi cluster-
ing evaluation measure. In CoNLL, pages 165–173.

Laura Rimell and Stephen Clark. 2009. Porting a
lexicalized-grammar parser to the biomedical domain.
Journal of Biomedical Informatics, 42(5):852–865.

Patrick Ruch, Clia Boyer, Christine Chichester, Imad
Tbahriti, Antoine Geissbhler, Paul Fabry, Julien Gob-
eill, Violaine Pillet, Dietrich Rebholz-Schuhmann,
Christian Lovis, and Anne-Lise Veuthey. 2007. Using
argumentation to extract key sentences from biomedi-
cal abstracts. International Journal of Medical Infor-
matics, 76(2-3):195–200.

142


Alexander Rush, Roi Reichart, Michael Collins, and
Amir Globerson. 2012. Improved parsing and pos tag-
ging using inter-sentence consistency constraints. In
EMNLP-CoNLL, pages 1434–1444.

Hagit Shatkay, Fengxia Pan, Andrey Rzhetsky, and
W John Wilbur. 2008. Multi-dimensional classifica-
tion of biomedical text: Toward automated, practical
provision of high-utility text to diverse users. Bioin-
formatics, 24(18):2086–2093.

Imad Tbahriti, Christine Chichester, Frédérique Lisacek,
and Patrick Ruch. 2006. Using argumentation to
retrieve articles with similar citations. International
Journal of Medical Informatics, 75(6):488–495.

Simone Teufel and Marc Moens. 2002. Summariz-
ing scientific articles: Experiments with relevance
and rhetorical status. Computational Linguistics,
28(4):409–445.

Simone Teufel, Advaith Siddharthan, and Colin Batche-
lor. 2009. Towards domain-independent argumenta-
tive zoning: Evidence from chemistry and computa-
tional linguistics. In EMNLP, pages 1493–1502.

Andrea Varga, Daniel Preotiuc-Pietro, and Fabio
Ciravegna. 2012. Unsupervised document zone iden-
tification using probabilistic graphical models. In
LREC, pages 1610–1617.

Bonnie Webber, Markus Egg, and Valia Kordoni. 2011.
Discourse structure and language technology. Natural
Language Engineering, 18(4):437–490.

143


144