Edinburgh Research Explorer 
 
 
Learning Structured Text Representations

Citation for published version:
Liu, Y & Lapata, M 2018, 'Learning Structured Text Representations', Transactions of the Association for
Computational Linguistics, vol. 6, pp. 63-76. <https://transacl.org/ojs/index.php/tacl/article/view/1185>

Link:
Link to publication record in Edinburgh Research Explorer

Document Version:
Publisher's PDF, also known as Version of record

Published In:
Transactions of the Association for Computational Linguistics

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.

Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.

Download date: 06. Apr. 2021

https://transacl.org/ojs/index.php/tacl/article/view/1185
https://www.research.ed.ac.uk/portal/en/publications/learning-structured-text-representations(8cfe7cfc-1bee-4bb2-89bd-87c6d7e94b3e).html


Learning Structured Text Representations

Yang Liu and Mirella Lapata
Institute for Language, Cognition and Computation

School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB

yang.liu2@ed.ac.uk,mlap@inf.ed.ac.uk

Abstract

In this paper, we focus on learning structure-
aware document representations from data
without recourse to a discourse parser or addi-
tional annotations. Drawing inspiration from
recent efforts to empower neural networks
with a structural bias (Cheng et al., 2016; Kim
et al., 2017), we propose a model that can en-
code a document while automatically induc-
ing rich structural dependencies. Specifically,
we embed a differentiable non-projective pars-
ing algorithm into a neural model and use at-
tention mechanisms to incorporate the struc-
tural biases. Experimental evaluations across
different tasks and datasets show that the pro-
posed model achieves state-of-the-art results
on document modeling tasks while inducing
intermediate structures which are both inter-
pretable and meaningful.

1 Introduction

Document modeling is a fundamental task in Natural
Language Processing useful to various downstream
applications including topic labeling (Xie and Xing,
2013), summarization (Chen et al., 2016; Wolf and
Gibson, 2006), sentiment analysis (Bhatia et al.,
2015), question answering (Verberne et al., 2007),
and machine translation (Meyer and Webber, 2013).

Recent work provides strong evidence that better
document representations can be obtained by incor-
porating structural knowledge (Bhatia et al., 2015; Ji
and Smith, 2017; Yang et al., 2016). Inspired by ex-
isting theories of discourse, representations of docu-
ment structure have assumed several guises in the lit-
erature, such as trees in the style of Rhetorical Struc-

ture Theory (RST; Mann and Thompson, 1988),
graphs (Lin et al., 2011; Wolf and Gibson, 2006),
entity transitions (Barzilay and Lapata, 2008), or
combinations thereof (Lin et al., 2011; Mesgar and
Strube, 2015). The availability of discourse anno-
tated corpora (Carlson et al., 2001; Prasad et al.,
2008) has led to the development of off-the-shelf
discourse parsers (e.g., Feng and Hirst, 2012; Liu
and Lapata, 2017), and the common use of trees as
representations of document structure. For example,
Bhatia et al. (2015) improve document-level senti-
ment analysis by reweighing discourse units based
on the depth of RST trees, whereas Ji and Smith
(2017) show that a recursive neural network built on
the output of an RST parser benefits text categoriza-
tion in learning representations that focus on salient
content.

Linguistically motivated representations of doc-
ument structure rely on the availability of anno-
tated corpora as well as a wider range of standard
NLP tools (e.g., tokenizers, pos-taggers, syntactic
parsers). Unfortunately, the reliance on labeled data,
which is both difficult and highly expensive to pro-
duce, presents a major obstacle to the widespread
use of discourse structure for document modeling.
Moreover, despite recent advances in discourse pro-
cessing, the use of an external parser often leads to
pipeline-style architectures where errors propagate
to later processing stages, affecting model perfor-
mance.

It is therefore not surprising that there have been
attempts to induce document representations di-
rectly from data without recourse to a discourse
parser or additional annotations. The main idea is

63

Transactions of the Association for Computational Linguistics, vol. 6, pp. 63–75, 2018. Action Editor: Bo Pang.
Submission batch: 5/2017; Published 1/2018.

c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


to obtain hierarchical representations by first build-
ing representations of sentences, and then aggre-
gating those into a document representation (Tang
et al., 2015a,b). Yang et al. (2016) further demon-
strate how to implicitly inject structural knowledge
onto the representation using an attention mecha-
nism (Bahdanau et al., 2015) which acknowledges
that sentences are differentially important in differ-
ent contexts. Their model learns to pay more or less
attention to individual sentences when constructing
the representation of the document.

Our work focus on learning deeper structure-
aware document representations, drawing inspira-
tion from recent efforts to empower neural networks
with a structural bias (Cheng et al., 2016). Kim
et al. (2017) introduce structured attention networks
which are generalizations of the basic attention pro-
cedure, allowing to learn sentential representations
while attending to partial segmentations or subtrees.
Specifically, they take into account the dependency
structure of a sentence by viewing the attention
mechanism as a graphical model over latent vari-
ables. They first calculate unnormalized pairwise at-
tention scores for all tokens in a sentence and then
use the inside-outside algorithm to normalize the
scores with the marginal probabilities of a depen-
dency tree. Without recourse to an external parser,
their model learns meaningful task-specific depen-
dency structures, achieving competitive results in
several sentence-level tasks. However, for docu-
ment modeling, this approach has two drawbacks.
Firstly, it does not consider non-projective depen-
dency structures, which are common in document-
level discourse analysis (Hayashi et al., 2016; Lee
et al., 2006). As illustrated in Figure 1, the tree struc-
ture of a document can be flexible and the depen-
dency edges may cross. Secondly, the inside-outside
algorithm involves a dynamic programming process
which is difficult to parallelize, making it impracti-
cal for modeling long documents.1

In this paper, we propose a new model for rep-
resenting documents while automatically learning
richer structural dependencies. Using a variant of
Kirchhoff’s Matrix-Tree Theorem (Tutte, 1984), our
model implicitly considers non-projective depen-

1In our experiments, adding the inside-outside pass in-
creases training time by a factor of 10.

1  The next time you hear a Member of Congress moan about the deficit, 
    consider what Congress did Friday.
2  The Senate, 84-6 ,voted to increase to $124,000 the ceiling on insured 
    mortgages from the FHA, which lost $4.2 billion in loan defaults last year.
3  Then , by voice vote , the Senate voted a porkbarrel bill, approved  
    Thursday by the House, for domestic military construction.
4  The Bush request to what the Senators gave themselves.

1 2 3 4

Figure 1: The document is analyzed in the style of
Rhetorical Structure Theory (Mann and Thompson,
1988), and represented as a dependency tree fol-
lowing the conversion algorithm of Hayashi et al.
(2016).

dency tree structures. We keep each step of the
learning process differentiable, so the model can
be trained in an end-to-end fashion and induce dis-
course information that is helpful to specific tasks
without an external parser. The inside-outside model
of Kim et al. (2017) and our model both have
a O(n3) worst case complexity. However, major
operations in our approach can be parallelized ef-
ficiently on GPU computing hardware. Although
our primary focus is on document modeling, there
is nothing inherent in our model that prevents its ap-
plication to individual sentences. Advantageously,
it can induce non-projective structures which are re-
quired for representing languages with free or flexi-
ble word order (McDonald and Satta, 2007).

Our contributions in this work are threefold: a
model for learning document representations whilst
taking structural information into account; an effi-
cient training procedure which allows to compute
document level representations of arbitrary length;
and a large scale evaluation study showing that
the proposed model performs competitively against
strong baselines while inducing intermediate struc-
tures which are both interpretable and meaningful.

2 Background

In this section, we describe how previous work uses
the attention mechanism for representing individual
sentences. The key idea is to capture the interaction
between tokens within a sentence, generating a con-
text representation for each word with weak struc-
tural information. This type of intra-sentence at-
tention encodes relationships between words within

64


u1 u2 u3 u4 u5

a41
a42

a43 a45

r4r1 r2 r3 r5

Figure 2: Intra-sentential attention mechanism;
aij denotes the normalized attention score between
tokens ui and uj.

each sentence and differs from inter-sentence at-
tention which has been widely applied to sequence
transduction tasks like machine translation (Bah-
danau et al., 2015) and learns the latent alignment
between source and target sequences.

Figure 2 provides a schematic view of the intra-
sentential attention mechanism. Given a sen-
tence represented as a sequence of n word vectors
[u1,u2, · · · ,un], for each word pair 〈ui,uj〉, the
attention score aij is estimated as:

fij = F(ui,uj) (1)

aij =
exp(fij)∑n
k=1 exp(fik)

(2)

where F() is a function for computing the unnor-
malized score fij which is then normalized by cal-
culating a probability distribution aij. Individual
words collect information from their context based
on aij and obtain a context representation:

ri =
n∑

j=1

aijuj (3)

where attention score aij indicates the (dependency)
relation between the i-th and the j-th-words and how
information from uj should be fed into ui.

Despite successful application of the above atten-
tion mechanism in sentiment analysis (Cheng et al.,
2016) and entailment recognition (Parikh et al.,
2016), the structural information under considera-
tion is shallow, limited to word-word dependencies.
Since attention is computed as a simple probabil-
ity distribution, it cannot capture more elaborate

structural dependencies such as trees (or graphs).
Kim et al. (2017) induce richer internal structure
by imposing structural constraints on the probabil-
ity distribution computed by the attention mecha-
nism. Specifically, they normalize fij with a pro-
jective dependency tree using the inside-outside al-
gorithm (Baker, 1979):

fij = F(ui,uj) (4)

a = inside-outside(f) (5)

ri =
n∑

j=1

aijuj (6)

This process is differentiable, so the model can be
trained end-to-end and learn structural information
without relying on a parser. However, efficiency
is a major issue, since the inside-outside algorithm
has time complexity O(n3) (where n represents the
number of tokens) and does not lend itself to easy
parallelization. The high order complexity renders
the approach impractical for real-world applications.

3 Encoding Text Representations

In this section we present our document representa-
tion model. We follow previous work (Tang et al.,
2015a; Yang et al., 2016) in modeling documents
hierarchically by first obtaining representations for
sentences and then composing those into a document
representation. Structural information is taken into
account while learning representations for both sen-
tences and documents and an attention mechanism
is applied on both words within a sentence and sen-
tences within a document. The general idea is to
force pair-wise attention between text units to form
a non-projective dependency tree, and automatically
induce this tree for different natural language pro-
cessing tasks in a differentiable way. In the follow-
ing, we first describe how the attention mechanism
is applied to sentences, and then move on to present
our document-level model.

3.1 Sentence Model

Let T = [u1,u2, · · · ,un] denote a sentence con-
taining a sequence of words, each represented by a
vector u, which can be pre-trained on a large cor-
pus. Long Short-Term Memory Neural Networks
(LSTMs; Hochreiter and Schmidhuber, 1997) have

65


u1 u2 u3 u4

d1 d2 d3 d4

e1 e2 e3 e4
Calculat e

St ruct ured 
At t ent ion

Updat e Sem ant ic 
Vect ors

r1 r2 r3 r4

Figure 3: Sentence representation model: ut is the
input vector for the t-th word, et and dt are semantic
and structure vectors, respectively.

been successfully applied to various sequence mod-
eling tasks ranging from machine translation (Bah-
danau et al., 2015), to speech recognition (Graves
et al., 2013), and image caption generation (Xu
et al., 2015). In this paper we use bidirectional
LSTMs as a way of representing elements in a se-
quence (i.e., words or sentences) together with their
contexts, capturing the element and an “infinite”
window around it. Specifically, we run a bidirec-
tional LSTM over sentence T , and take the output
vectors [h1,h2, · · · ,hn] as the representations of
words in T , where ht ∈ Rk is the output vector for
word ut based on its context.

We then exploit the structure of T which we in-
duce based on an attention mechanism detailed be-
low to obtain more precise representations. Inspired
by recent work (Daniluk et al., 2017; Miller et al.,
2016), which shows that the conventional way of us-
ing LSTM output vectors for calculating both atten-
tion and encoding word semantics is overloaded and
likely to cause performance deficiencies, we decom-
pose the LSTM output vector in two parts:

[et,dt] = ht (7)

where et ∈ Rkt , the semantic vector, encodes se-
mantic information for specific tasks, and dt ∈ Rks ,
the structure vector, is used to calculate structured
attention.

We use a series of operations based on the Matrix-
Tree Theorem (Tutte, 1984) to incorporate the struc-

tural bias of non-projective dependency trees into
the attention weights. We constrain the probabil-
ity distributions aij (see Equation (2)) to be the
posterior marginals of a dependency tree structure.
We then use the normalized structured attention,
to build a context vector for updating the semantic
vector of each word, obtaining new representations
[r1,r2, · · · ,rn]. An overview of the model is pre-
sented in Figure 3. We describe the attention mech-
anism in detail in the following section.

3.2 Structured Attention Mechanism
Dependency representations of natural language are
a simple yet flexible mechanism for encoding words
and their syntactic relations through directed graphs.
Much work in descriptive linguistics (Melc̆uk, 1988;
Tesniére, 1959) has advocated their suitability for
representing syntactic structure across languages. A
primary advantage of dependency representations is
that they have a natural mechanism for representing
discontinuous constructions arising from long dis-
tance dependencies or free word order through non-
projective dependency edges.

More formally, building a dependency tree
amounts to finding latent variables zij for all i 6= j,
where word i is the parent node of word j, un-
der some global constraints, amongst which the
single-head constraint is the most important, since
it forces the structure to be a rooted tree. We use
a variant of Kirchhoff’s Matrix-Tree Theorem (Koo
et al., 2007; Tutte, 1984) to calculate the marginal
probability of each dependency edge P(zij = 1) of
a non-projective dependency tree, and this probabil-
ity is used as the attention weight that decides how
much information is collected from child unit j to
the parent unit i.

We first calculate unnormalized attention
scores fij with structure vector d (see Equation (7))
via a bilinear function:

tp = tanh(Wpdi) (8)

tc = tanh(Wcdj) (9)

fij = t
T
p Watc (10)

where Wp ∈ Rks∗ks and Wc ∈ Rks∗ks are the
weights for building the representation of parent and
child nodes. Wa ∈ Rks∗ks is the weight for the bi-
linear transformation. f ∈ Rn∗n can be viewed as

66


a weighted adjacency matrix for a graph G with n
nodes where each node corresponds to a word in a
sentence. We also calculate the root score fri , indi-
cating the unnormalized possibility of a node being
the root:

fri = Wrdi (11)

where Wr ∈ R1∗ks . We calculate P(zij = 1), the
marginal probability of the dependency edge, fol-
lowing Koo et al. (2007):

Aij =

{
0 if i = j
exp(fij) otherwise

(12)

Lij =

{∑n
i′=1 Ai′j if i = j

−Aij otherwise
(13)

L̄ij =

{
exp(fri ) i = 1

Lij i > 1
(14)

P(zij = 1) = (1 − δ1,j)Aij[L̄−1]jj
− (1 −δi,1)Aij[L̄−1]ji (15)

P(root(i)) = exp(fir)[L̄
−1]i1

where 1 ≤ i ≤ n, 1 ≤ j ≤ n. L ∈ Rn∗n is the
Laplacian matrix for graph G and L̄ ∈ Rn∗n is a
variant of L that takes the root node into consid-
eration, and δ is the Kronecker delta. The key for
the calculation to hold is for Lii, the minor of the
Laplacian matrix L with respect to row i and col-
umn i, to be equal to the sum of the weights of
all directed spanning trees of G which are rooted
at i. P(zij = 1) is the marginal probability of the
dependency edge between the i-th and j-th words.
P(root(i) = 1) is the marginal probability of the i-
th word headed by the root of the tree. Details of the
proof can be found in Koo et al. (2007).

We denote the marginal probabilities P(zij = 1)
as aij and P(root(i)) as ari . This can be inter-
preted as attention scores which are constrained to
converge to a structured object, a non-projective de-
pendency tree, in our case. We update the semantic

vector ei of each word with structured attention:

pi =
n∑

k=1

akiek + a
r
i eroot (16)

ci =
n∑

k=1

aikei (17)

ri = tanh(Wr[ei,pi,ci]) (18)

where pi ∈ Rke is the context vector gathered from
possible parents of ui and ci ∈ Rke the context vec-
tor gathered from possible children, and eroot is a
special embedding for the root node. The context
vectors are concatenated with ei and transformed
with weights Wr ∈ Rke∗3ke to obtain the updated
semantic vector ri ∈ Rke with rich structural infor-
mation (see Figure 3).

3.3 Document Model
We build document representations hierarchically:
sentences are composed of words and documents are
composed of sentences. Composition on the doc-
ument level also makes use of structured attention
in the form of a dependency graph. Dependency-
based representations have been previously used for
developing discourse parsers (Hayashi et al., 2016;
Li et al., 2014) and in applications such as summa-
rization (Hirao et al., 2013).

As illustrated in Figure 4, given a document
with n sentences [s1,s2, · · · ,sn], for each sen-
tence si, the input is a sequence of word embed-
dings [ui1,ui2, · · · ,uim], where m is the number
of tokens in si. By feeding the embeddings into
a sentence-level bi-LSTM and applying the pro-
posed structured attention mechanism, we obtain the
updated semantic vector [ri1,ri2, · · · ,rim]. Then
a pooling operation produces a fixed-length vec-
tor vi for each sentence. Analogously, we view
the document as a sequence of sentence vectors
[v1,v2, · · · ,vn] whose embeddings are fed to a
document-level bi-LSTM. Application of the struc-
tured attention mechanism creates new semantic
vectors [q1,q2, · · · ,qn] and another pooling oper-
ation yields the final document representation y.

3.4 End-to-End Training
Our model can be trained in an end-to-end fashion
since all operations required for computing struc-
tured attention and using it to update the semantic

67


u i1 u i2 u im

y

St ruct ured At t ent ion     

r i1 r i2 r im

v1 vi vn

q1 q i qn

Pooling

Pooling

St ruct ured At t ent ion     

Figure 4: Document representation model.

vectors are differentiable. In contrast to in Kim et al.
(2017), training can be done efficiently. The major
complexity of our model lies in the computation of
the gradients of the the inverse matrix. Let A denote
a matrix depending on a real parameter x; assuming
all component functions in A are differentiable, and
A is invertible for all possible values, the gradient
of A with respect respect to x is:

dA−1

dx
= −A−1 dA

dx
A−1 (19)

Multiplication of the three matrices and matrix in-
version can be computed efficiently on modern par-
allel hardware architectures such as GPUs. In our
experiments, computation of structured attention
takes only 1/10 of training time.

4 Experiments

In this section we present our experiments for eval-
uating the performance of our model. Since sen-
tence representations constitute the basic building
blocks of our document model, we first evalu-
ate the performance of structured attention on a
sentence-level task, namely natural language infer-
ence. We then assess the document-level repre-
sentations obtained by our model on a variety of
classification tasks representing documents of dif-
ferent length, subject matter, and language. Our

code is available at https://github.com/
nlpyang/structured.

4.1 Natural Language Inference
The ability to reason about the semantic relation-
ship between two sentences is an integral part of
text understanding. We therefore evaluate our model
on recognizing textual entailment, i.e., whether
two premise-hypothesis pairs are entailing, con-
tradictory, or neutral. For this task we used
the Stanford Natural Language Inference (SNLI)
dataset (Bowman et al., 2015), which contains
premise-hypothesis pairs and target labels indicat-
ing their relation. After removing sentences with
unknown labels, we obtained 549,367 pairs for train-
ing, 9,842 for development and 9,824 for testing.

Sentence-level representations obtained by our
model (with structured attention) were used to en-
code the premise and hypothesis by modifying the
model of Parikh et al. (2016) as follows. Let
[x

p
1, · · · ,x

p
n] and [xh1, · · · ,xhm] be the input vectors

for the premise and hypothesis, respectively. Appli-
cation of structured attention yields new vector rep-
resentations [rp1, · · · ,r

p
n] and [rh1, · · · ,rhm]. Then

we combine these two vectors with inter-sentential
attention, and apply an average pooling operation:

oij = MLP(r
p
i )

T MLP(rhj ) (20)

r̄
p
i = [r

p
i ,

m∑

j=1

exp(oij)∑m
k=1 exp(oik)

] (21)

r̄hi = [r
h
i ,

m∑

i=1

exp(oij)∑m
k=1 exp(okj)

] (22)

rp =
n∑

i=1

g(r̄
p
i ), r

h =
m∑

i=1

g(r̄hi ) (23)

where MLP() is a two-layer perceptron with
a ReLU activation function. The new representa-
tions rp,rh are then concatenated and fed into an-
other two-layer perceptron with a softmax layer to
obtain the predicted distribution over the labels.

The hidden size of the LSTM was set to 150. The
dimensions of the semantic vector were 100 and the
dimensions of structure vector were 50. We used
pretrained 300-D Glove 840B (Pennington et al.,
2014) vectors to initialize the word embeddings. All
parameters (including word embeddings) were up-
dated with Adagrad (Duchi et al., 2011), and the

68


Models Acc θ
Classifier with handcrafted features (Bowman et al., 2015) 78.2 —
300D LSTM encoders (Bowman et al., 2015) 80.6 3.0M
300D Stack-Augmented Parser-Interpreter Neural Net (Bowman et al., 2016) 83.2 3.7M
100D LSTM with inter-attention (Rocktäschel et al., 2016) 83.5 252K
200D Matching LSTMs (Wang and Jiang, 2016) 86.1 1.9M
450D LSTMN with deep attention fusion (Cheng et al., 2016) 86.3 3.4M
Decomposable Attention over word embeddings (Parikh et al., 2016) 86.8 582K
Enhanced BiLSTM Inference Model (Chen et al., 2017) 88.0 4.3M
175D No Attention 85.3 600K
175D Simple intra-sentence attention 86.2 1.1M
100D Structured intra-sentence attention with Inside-Outside 86.8 1.2M
175D Structured intra-sentence attention with Matrix Inversion 86.9 1.1M

Table 1: Test accuracy on the SNLI dataset and number of parameters θ (excluding embeddings). Wherever
available we also provide the size of the recurrent unit.

Models
Speed

Max Avg
No Attention 0.0050 0.0033
Simple Attention 0.0057 0.0042
Matrix Inversion 0.0070 0.0045
Inside-Outside 0.1200 0.0380

Table 2: Comparison of speed of different models
on the SNLI testset. The unit of measurement is
seconds per instance. All results were obtained on
a GeForce GTX TITAN X (Pascal) GPU.

learning rate was set to 0.05. The hidden size of
the two-layer perceptron was set to 200 and dropout
was used with ratio 0.2. The mini-batch size was 32.

We compared our model (and variants thereof)
against several related systems. Results (in terms of
3-class accuracy) are shown in Table 1. Most pre-
vious systems employ LSTMs and do not incorpo-
rate a structured attention component. Exceptions
include Cheng et al. (2016) and Parikh et al. (2016)
whose models include intra-attention encoding rela-
tionships between words within each sentence (see
Equation (2)). It is also worth noting that some
models take structural information into account in
the form of parse trees (Bowman et al., 2016; Chen
et al., 2017). The second block of Table 1 presents a
version of our model without an intra-sentential at-
tention mechanism as well as three variants with at-
tention, assuming the structure of word-to-word re-

Dataset #class #docs #s/d #w/d
Yelp 5 335K 8.9 151.6
IMDB 10 348K 14.0 325.6
CZ Movies 3 92K 3.5 51.2
Debates 2 1.6K 22.7 519.2

Table 3: Dataset statistics; #class is the number
of classes per dataset, #docs denotes the number
of documents; #s/d and #w/d represent the average
number of sentences and words per document.

lations and dependency trees. In the latter case we
compare our matrix inversion based model against
Kim et al.’s (2017) inside-outside attention model.
Consistent with previous work (Cheng et al., 2016;
Parikh et al., 2016), we observe that simple attention
brings performance improvements over no attention.
Structured attention further enhances performance.
Our own model with tree matrix inversion slightly
outperforms the inside-outside model of Kim et al.
(2017), overall achieving results in the same ball-
park with related LSTM-based models (Chen et al.,
2017; Cheng et al., 2016; Parikh et al., 2016).

Table 2 compares the running speed of the mod-
els shown in the second block of Table 1. As can
be seen matrix inversion does not increase running
speed over the simpler attention mechanism and is
considerably faster compared to inside-outside. The
latter is 10–20 times slower than our model on the
same platform.

69


Models Yelp IMDB CZ Movies Debates θ
Feature-based classifiers 59.8 40.9 78.5 74.0 —
Paragraph vector (Tang et al., 2015a) 57.7 34.1 — —- —
Convolutional neural network (Tang et al., 2015a) 59.7 — — — —
Convolutional gated RNN (Tang et al., 2015a) 63.7 42.5 — — —
LSTM gated RNN (Tang et al., 2015a) 65.1 45.3 — — —
RST-based recursive neural network (Ji and Smith, 2017) — — — 75.7 —
75D Hierarchical attention networks (Yang et al., 2016) 68.2 49.4 80.8 74.0 273K
75D No Attention 66.7 47.5 80.5 73.7 330K
100D Simple Attention 67.7 48.2 81.4 75.3 860K
100D Structured Attention (sentence-level) 68.0 48.8 81.5 74.6 842K
100D Structured Attention (document-level) 67.8 48.6 81.1 75.2 842K
100D Structured Attention (both levels) 68.6 49.2 82.1 76.5 860K

Table 4: Test accuracy on four datasets and number of parameters θ (excluding embeddings). Regarding
feature-based classification methods, results on Yelp and IMDB are taken from Tang et al. (2015a), on CZ
movies from Brychcın and Habernal (2013), and Debates from Yogatama and Smith (2014). Wherever
available we also provide the size of the recurrent unit (LSTM or GRU).

4.2 Document Classification
In this section, we evaluate our document-level
model on a variety of classification tasks. We se-
lected four datasets which we describe below. Ta-
ble 3 summarizes some statistics for each dataset.

Yelp reviews were obtained from the 2013 Yelp
Dataset Challenge. This dataset contains restaurant
reviews, each associated with human ratings on a
scale from 1 (negative) to 5 (positive) which we used
as gold labels for sentiment classification; we fol-
lowed the preprocessing introduced in Tang et al.
(2015a) and report experiments on their training, de-
velopment, and testing partitions (80/10/10).

IMDB reviews were obtained from Diao et al.
(2014), who randomly crawled reviews for 50K
movies. Each review is associated with user ratings
ranging from 1 to 10.

Czech reviews were obtained from Brychcın and
Habernal (2013). The dataset contains reviews from
the Czech Movie Database2 each labeled as positive,
neutral, or negative. We include Czech in our exper-
iments since it has more flexible word order com-
pared to English, with non-projective dependency
structures being more frequent. Experiments on this
dataset perform 10-fold cross-validation following
previous work (Brychcın and Habernal, 2013).

2http://www.csfd.cz/

Congressional floor debates were obtained from
a corpus originally created by Thomas et al. (2006)
which contains transcripts of U.S. floor debates in
the House of Representatives for the year 2005.
Each debate consists of a series of speech segments,
each labeled by the vote (“yea” or “nay”) cast for the
proposed bill by the the speaker of each segment.
We used the pre-processed corpus from Yogatama
and Smith (2014).3

Following previous work (Yang et al., 2016),
we only retained words appearing more than five
times in building the vocabulary and replaced words
with lesser frequencies with a special UNK to-
ken. Word embeddings were initialized by training
word2vec (Mikolov et al., 2013) on the training and
validation splits of each dataset. In our experiments,
we set the word embedding dimension to be 200 and
the hidden size for the sentence-level and document-
level LSTMs to 100 (the dimensions of the semantic
and structure vectors were set to 75 and 25, respec-
tively). We used a mini-batch size of 32 during train-
ing and documents of similar length were grouped in
one batch. Parameters were optimized with Adagrad
(Duchi et al., 2011), the learning rate was set to 0.05.
We used L2 regularization for all parameters except
word embeddings with regularization constant set to
1e−4. Dropout was applied on the input and output

3http://www.cs.cornell.edu/˜ainur/data.
html

70


Three men drink at a reflective bar

Three men are socializing during happy hour

Premise

Hypothesis

Workers at Basking Robbins are filling orders

Workers filling orders at Basking Robbins

Figure 5: Dependency trees induced by our model on the SNLI test set.

layers with dropout rate 0.3.
Our results are summarized in Table 4. We com-

pared our model against several related models cov-
ering a wide spectrum of representations including
word-based ones (e.g., paragraph vector and CNN
models) as well as hierarchically composed ones
(e.g., a CNN or LSTM provides a sentence vector
and then a recurrent neural network combines the
sentence vectors to form a document level represen-
tation for classification). Previous state-of-the-art
results on the three review datasets were achieved
by the hierarchical attention network of Yang et al.
(2016), which models the document hierarchically
with two GRUs and uses an attention mechanism to
weigh the importance of each word and sentence.
On the debates corpus, Ji and Smith (2017) obtained
best results with a recursive neural network model
operating on the output of an RST parser. Table 4
presents three variants4 of our model, one with struc-
tured attention on the sentence level, another one
with structured attention on the document level and a
third model which employs attention on both levels.
As can be seen, the combination is beneficial achiev-
ing best results on three out of four datasets. Further-
more, structured attention is superior to the simpler
word-to-word attention mechanism, and both types
of attention bring improvements over no attention.
The structured attention approach is also very effi-
cient, taking only 20 minutes for one training epoch
on the largest dataset.

4.3 Analysis of Induced Structures

To gain further insight on structured attention,
we inspected the dependency trees it produces.
Specifically, we used the Chu-Liu-Edmonds algo-

4We do not report comparisons with the inside-outside ap-
proach on document classification tasks due to its prohibitive
computation cost leading to 5 hours of training for one epoch.

Parser Attention
Projective — 51.4%
Height 8.99 5.78

Nodes

depth 1 9.8% 8.4%
depth 2 15.0% 19.7%
depth 3 12.8% 22.4%
depth 4 12.5% 23.4%
depth 5 12.0% 14.4%
depth 6 10.3% 4.5%

Same Edges 38.7%

Table 5: Descriptive statistics for dependency trees
produced by our model and the Stanford parser
(Manning et al., 2014) on the SNLI test set.

rithm (Chu and Liu, 1965; Edmonds, 1967) to ex-
tract the maximum spanning tree from the attention
scores. We report various statistics on the character-
istics of the induced trees across different tasks and
datasets. We also provide examples of tree output,
in an attempt to explain how our model uses depen-
dency structures to model text.

Sentence Trees We compared the dependency
trees obtained from our model with those produced
by a state-of-the-art dependency parser trained on
the English Penn Treebank. Table 5 presents various
statistics on the depth of the trees produced by our
model on the SNLI test set and the Stanford depen-
dency parser (Manning et al., 2014). As can be seen,
the induced dependency structures are simpler com-
pared to those obtained from the Stanford parser.
The trees are generally less deep (their height is 5.78
compared to 8.99 for the Stanford parser), with the
majority being of depth 2–4. Almost half of the
induced trees have a projective structure, although
there is nothing in the model to enforce this con-
straint. We also calculated the percentage of head-
dependency edges that are identical between the two

71


Yelp IMDB CZ Movies Debates
Projective 79.6% 74.9% 82.8% 62.4%
Height 2.81 3.34 1.50 3.58

Nodes

depth 2 15.1% 13.6% 25.7% 12.8%
depth 3 55.6% 46.8% 57.1% 30.2%
depth 4 22.3% 32.5% 11.3% 40.8%
depth 5 3.2% 4.1% 5.8% 14.8%

Table 6: Descriptive statistics for induced
document-level dependency trees across datasets.

sets of trees. Although our model is not exposed to
annotated trees during training, a large number of
edges agree with the output of the Stanford parser.

Figure 5 shows examples of dependency trees in-
duced on the SNLI dataset. Although the model is
trained without ever being exposed to a parse tree,
it is able to learn plausible dependency structures
via the attention mechanism. Overall we observe
that the induced trees differ from linguistically mo-
tivated ones in the types of dependencies they cre-
ate which tend to be of shorter length. The depen-
dencies obtained from structured attention are more
direct as shown in the first premise sentence in Fig-
ure 5 where words at and bar are directly connected
to the verb drink. This is perhaps to be expected
since the attention mechanism uses the dependency
structures to collect information from other words,
and the direct links will be more effective.

Document Trees We also used the Chu-Liu-
Edmonds algorithms to obtain document-level de-
pendency trees. Table 6 summarizes various charac-
teristics of these trees. For most datasets, document-
level trees are not very deep, they mostly contain
up to nodes of depth 3. This is not surprising as
the documents are relatively short (see Table 3) with
the exception of debates which are longer and the
induced trees more complex. The fact that most
documents exhibit simple discourse structures is fur-
ther corroborated by the large number (over 70%)
of projective trees induced on Yelp, IMBD, and CZ
Movies datasets. Unfortunately, our trees cannot
be directly compared with the output of a discourse
parser which typically involves a segmentation pro-
cess splitting sentences into smaller units. Our trees
are constructed over entire sentences, and there is no
mechanism currently in the model to split sentences

(a)

1  first of all, i did not expect to come into a cafeteria style eatery. 
2  they serve the basics of bbq, nothing too fancy. 
3  a few appetizers and side options, like cheesy potatoes, baked             
    mac 'n' 4 cheese, fresh corn bread, etc.. 
4  all were very tasty. 
5  for entree, they have a wide variety of meats and combos and 
    samplers. 
6  overall, this is a great place,... meat was well prepared, a 
    little pricey for what i was expecting. 

1 2 3 4 5 6

(b) 1 2 3 4 5

1  great instruction by ryan
2  clean workout facility and friendly people
3  they have a new student membership for 60 per month and   
    classes are mon , weds and fri 6pm 7pm
4  it 's definitely worth money if you want to learn brazilian jiu jitsu
5  i usually go to classes on mondays and fridays , and it 's the 
    best workout i 've had in years

(c)

1  Ud?lat parodii tak, aby nebyla je?t? trapn?j?í ne? p?vodní
    film, není zrovna legrace, o tom jsem se p?esv?d?ila u? 
    n?kolikrát (Bul?it, Scary Movie...). 
    To make a parody so that it ends up being even more 
    embarrassing than the original movie is not exactly trivial, this 
    I have been convinced of several times already (Bullshit, 
    Scary Movie...). 
2  Jen?e Top Secret? 
    But Top Secret? 
3  Nevím, jestli to v?bec m??u ?íct, ale mo?ná tenhle film
    p?ekonal i skv?lé ?havé výst?ely! 
    I don't know if I can actually say it, but maybe this movie has scored
    even better than the fantastic Hot Shots!
4  Bo?e, to jsem se nasmála! 
    God, I laughed a lot!  
5  Nutno uznat, ?e je to docela síla, kdy? si Ameri?ané d?lali 
    p?ed revolucí takovou srandu z N?mc?. 
    I must admit, that it's pretty cool, when Americans were making so
    much fun of Germans before the revolution.
 6  Ty nará?ky byly vá?n? skv?lé... Jedna z nejlep?ích parodií, 
     co jsem kdy vid?la!
    The innuendos were really great... One of the best parodies I
    have ever seen!
 

3 4 5 61 2

Figure 6: Induced dependency trees for three docu-
ments taken from Yelp (a,b) and the Czech Movies
dataset (c). English translations are in italics.

into discourse units.

Figure 6 shows examples of document-level trees
taken from Yelp and the Czech Movie dataset. In
the first tree, most edges are examples of the “elab-
oration” discourse relation, i.e., the child presents

72


additional information about the parent. The sec-
ond tree is non-projective, the edges connecting sen-
tences 1 and 4 and 3 and 5 cross. The third review,
perhaps due to its colloquial nature, is not entirely
coherent. However, the model manages to link sen-
tences 1 and 3 to sentence 2, i.e., the movie being
discussed; it also relates sentence 6 to 4, both of
which express highly positive sentiment.

5 Conclusions

In this paper we proposed a new model for rep-
resenting documents while automatically learning
rich structural dependencies. Our model normalizes
intra-attention scores with the marginal probabilities
of a non-projective dependency tree based on a ma-
trix inversion process. Each operation in this pro-
cess is differentiable and the model can be trained
efficiently end-to-end, while inducing structural in-
formation. We applied this approach to model doc-
uments hierarchically, incorporating both sentence-
and document-level structure. Experiments on sen-
tence and document modeling tasks show that the
representations learned by our model achieve com-
petitive performance against strong comparison sys-
tems. Analysis of the induced tree structures re-
vealed that they are meaningful, albeit different from
linguistics ones, without ever exposing the model to
linguistic annotations or an external parser.

Directions for future work are many and varied.
Given appropriate training objectives (Linzen et al.,
2016), it should be possible to induce linguistically
meaningful dependency trees using the proposed at-
tention mechanism. We also plan to explore how
document-level trees can be usefully employed in
summarization, e.g., as a means to represent or even
extract important content.

Acknowledgments The authors gratefully ac-
knowledge the support of the European Research
Council (award number 681760). We also thank the
anonymous TACL reviewers and the action editor
whose feedback helped improve the present paper,
members of EdinburghNLP for helpful discussions
and suggestions, and Barbora Skarabela for translat-
ing the Czech document for us.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In Proceed-
ings of the ICLR Conference.

James K. Baker. 1979. Trainable grammars for
speech recognition. The Journal of the Acousti-
cal Society of America 65(S1):S132–S132.

Regina Barzilay and Mirella Lapata. 2008. Mod-
eling local coherence: An entity-based approach.
Computational Linguistics 34(1):1–34.

Parminder Bhatia, Yangfeng Ji, and Jacob Eisen-
stein. 2015. Better document-level sentiment
analysis from RST discourse parsing. In Pro-
ceedings of the EMNLP Conference. pages 2212–
2218.

Samuel R. Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A large
annotated corpus for learning natural language in-
ference. In Proceedings of the EMNLP Confer-
ence. pages 632–642.

Samuel R. Bowman, Jon Gauthier, Abhinav Ras-
togi, Raghav Gupta, Christopher D. Manning, and
Christopher Potts. 2016. A fast unified model for
parsing and sentence understanding. In Proceed-
ings of the ACL Conference. pages 1466–1477.

Tomáš Brychcın and Ivan Habernal. 2013. Unsu-
pervised improving of sentiment analysis using
global target context. In Proceedings of the Inter-
national Conference on Recent Advances in Nat-
ural Language Processing. pages 122–128.

Lynn Carlson, Daniel Marcu, and Mary Ellen
Okurowski. 2001. Building a discourse-tagged
corpus in the framework of rhetorical structure
theory. In Proceedings of the Second SIGdial
Workshop on Discourse and Dialogue.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei,
Hui Jiang, and Diana Inkpen. 2017. Enhanced
LSTM for natural language inference. In Pro-
ceedings of the ACL Conference. pages 1657–
1668.

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei,
and Hui Jiang. 2016. Distraction-based neural
networks for modeling documents. In Proceed-
ings of the IJCAI Conference. pages 2754–2760.

73


Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016.
Long short-term memory-networks for machine
reading. In Proceedings of the EMNLP Confer-
ence. pages 551–561.

Yoeng-Jin Chu and Tseng-Hong Liu. 1965. On
shortest arborescence of a directed graph. Scien-
tia Sinica 14(10):1396.

Michał Daniluk, Tim Rocktäschel, Johannes Welbl,
and Sebastian Riedel. 2017. Frustratingly short
attention spans in neural language modeling. Pro-
ceedings of the ICLR Conference .

Qiming Diao, Minghui Qiu, Chao-Yuan Wu,
Alexander J Smola, Jing Jiang, and Chong Wang.
2014. Jointly modeling aspects, ratings and senti-
ments for movie recommendation (JMARS). In
Proceedings of the ACM SIGKDD Conference.
pages 193–202.

John Duchi, Elad Hazan, and Yoram Singer. 2011.
Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine
Learning Research 12(Jul):2121–2159.

Jack Edmonds. 1967. Optimum branchings. Journal
of Research of the national Bureau of Standards B
71(4):233–240.

Vanessa Wei Feng and Graeme Hirst. 2012. Text-
level discourse parsing with rich linguistic fea-
tures. In Proceedings of the ACL Conference.
pages 60–68.

Alex Graves, Abdel-Rahman Mohamed, and Geof-
frey Hinton. 2013. Speech recognition with deep
recurrent neural networks. In Proceedings of the
IEEE ICASSP Conference. pages 6645–6649.

Katsuhiko Hayashi, Tsutomu Hirao, and Masaaki
Nagata. 2016. Empirical comparison of depen-
dency conversions for RST discourse trees. In 
Pro-ceedings of the Annual Meeting of SIGDIAL. 
page 128.

Tsutomu Hirao, Yasuhisa Yoshida, Masaaki
Nishino, Norihito Yasuda, and Masaaki Nagata.
2013. Single-document summarization as a tree
knapsack problem. In Proceedings of the EMNLP
Conference. pages 1515–1520.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory. Neural Computation
9(8):1735–1780.

Yangfeng Ji and Noah Smith. 2017. Neural dis-
course structure for text categorization. In Pro-
ceedings of the ACL Conference.

Yoon Kim, Carl Denton, Luong Hoang, and Alexan-
der M. Rush. 2017. Structured attention networks.
In Proceedings of the ICLR Conference.

Terry Koo, Amir Globerson, Xavier Carreras Pérez,
and Michael Collins. 2007. Structured prediction
models via the matrix-tree theorem. In Proceed-
ings of the EMNLP Conference. pages 141–150.

Alan Lee, Rashmi Prasad, Aravind Joshi, Nikhil Di-
nesh, and Bonnie Webber. 2006. Complexity of
dependencies in discourse: Are dependencies in
discourse more complex than in syntax. In Pro-
ceedings of the International Workshop on Tree-
banks and Linguistic Theories. page 12.

Sujian Li, Liang Wang, Ziqiang Cao, and Wenjie Li.
2014. Text-level discourse dependency parsing.
In Proceedings of the ACL Conference. pages 25–
35.

Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan.
2011. Automatically evaluating text coherence
using discourse relations. In Proceedings of the
ACL Conference. pages 997–1006.

Tal Linzen, Emmanuel Dupoux, and Yoav Gold-
berg. 2016. Assessing the ability of LSTMs to
learn syntax-sensitive dependencies. Transac-
tions of the Association for Computational Lin-
guistics 4:521–535.

Yang Liu and Mirella Lapata. 2017. Learning con-
textually informed representations for linear-time
discourse parsing. In Proceedings of the EMNLP
Conference. pages 1300–1309.

William C. Mann and Sandra A. Thompson. 1988.
Rhetorical structure theory: Toward a functional
theory of text organization. Text-Interdisciplinary
Journal for the Study of Discourse 8(3):243–281.

Christopher D. Manning, Mihai Surdeanu, John
Bauer, Jenny Rose Finkel, Steven Bethard, and
David McClosky. 2014. The Stanford CoreNLP
Natural Language Processing Toolkit. In Pro-
ceedings of the ACL Conference (System Demon-
strations). pages 55–60.

Ryan McDonald and Giorgio Satta. 2007. On the
complexity of non-projective data-driven depen-

74


dency parsing. In Proceedings of the 10th Interna-
tional Conference on Parsing Technologies. pages
121–132.

Igor A. Melc̆uk. 1988. Dependency Syntax: Theory
and Practice. State University of New York Press.

Mohsen Mesgar and Michael Strube. 2015. Graph-
based coherence modeling for assessing readabil-
ity. In Proceedings of the 4th Joint Conference
on Lexical and Computational Semantics. pages
309–318.

Thomas Meyer and Bonnie Webber. 2013. Im-
plicitation of discourse connectives in (machine)
translation. In Proceedings of the Workshop on
Discourse in Machine Translation. pages 19–26.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S.
Corrado, and Jeff Dean. 2013. Distributed rep-
resentations of words and phrases and their com-
positionality. In Proceedings of the NIPS Confer-
ence. pages 3111–3119.

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-
Hossein Karimi, Antoine Bordes, and Jason We-
ston. 2016. Key-value memory networks for di-
rectly reading documents. In Proceedings of the
EMNLP Conference. pages 1400–1409.

Ankur Parikh, Oscar Täckström, Dipanjan Das, and
Jakob Uszkoreit. 2016. A decomposable atten-
tion model for natural language inference. In Pro-
ceedings of the EMNLP Conference. pages 2249–
2255.

Jeffrey Pennington, Richard Socher, and Christo-
pher D. Manning. 2014. Glove: Global vectors
for word representation. In Proceedings of the
EMNLP Conference. pages 1532–1543.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni
Miltsakaki, Livio Robaldo, Aravind K. Joshi, and
Bonnie L. Webber. 2008. The Penn discourse
TreeBank 2.0. In LREC.

Tim Rocktäschel, Edward Grefenstette, Karl Moritz
Hermann, Tomáš Kočiskỳ, and Phil Blunsom.
2016. Reasoning about entailment with neural at-
tention. In Proceedings of the ICLR Conference.

Duyu Tang, Bing Qin, and Ting Liu. 2015a. Doc-
ument modeling with gated recurrent neural net-
work for sentiment classification. In Proceedings
of the EMNLP Conference. pages 1422–1432.

Duyu Tang, Bing Qin, and Ting Liu. 2015b. Learn-
ing semantic representations of users and prod-
ucts for document level sentiment classification.
In Proceedings of the ACL Conference. pages
1014–1023.

Louis Tesniére. 1959. Éléments de Syntaxe Struc-
turale. Editions Klincksieck.

Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get
out the vote: Determining support or opposition
from congressional floor-debate transcripts. In
Proceedings of the EMNLP Conference. pages
327–335.

William Thomas Tutte. 1984. Graph theory.

Suzan Verberne, Lou Boves, Nelleke Oostdijk,
and Peter-Arno Coppen. 2007. Discourse-based
answering of why-questions. Traitement Au-
tomatique des Language, Discours et Document:
Traitements Automatics 47(2):21–41.

Shuohang Wang and Jing Jiang. 2016. Learning nat-
ural language inference with LSTM. In Proceed-
ings of NAACL Conference. pages 1442–1451.

Florian Wolf and Edward Gibson. 2006. Coherence
in Natural Language: Data Structures and Appli-
cations. The MIT Press.

Pengtao Xie and Eric P. Xing. 2013. Integrating doc-
ument clustering and topic modeling. In Proceed-
ings of the Conference on Uncertainty in Artificial
Intelligence. pages 694–703.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun
Cho, Aaron Courville, Ruslan Salakhudinov, Rich
Zemel, and Yoshua Bengio. 2015. Show, attend
and tell: Neural image caption generation with
visual attention. In International Conference on
Machine Learning. pages 2048–2057.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
Alex Smola, and Eduard Hovy. 2016. Hierar-
chical attention networks for document classifica-
tion. In Proceedings of the NAACL Conference.
pages 1480–1489.

Dani Yogatama and Noah A. Smith. 2014. Linguis-
tic structured sparsity in text categorization. In
Proceedings of the ACL Conference. pages 786–
796.

75


76