warwick.ac.uk/lib-publications 
 

Manuscript version: Author’s Accepted Manuscript 
The version presented in WRAP is the author’s accepted manuscript and may differ from the 
published version or Version of Record. 
 
Persistent WRAP URL: 
http://wrap.warwick.ac.uk/109440                             
 
How to cite: 
Please refer to published version for the most recent bibliographic citation information.  
If a published version is known of, the repository item page linked to above, will contain 
details on accessing it. 
 
Copyright and reuse: 
The Warwick Research Archive Portal (WRAP) makes this work by researchers of the 
University of Warwick available open access under the following conditions. 
 
© 2018, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-
NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/. 
 

Publisher’s statement: 
Please refer to the repository item page, publisher’s statement section, for further 
information. 
 
For more information, please contact the WRAP Team at: wrap@warwick.ac.uk. 
 

http://go.warwick.ac.uk/lib-publications
http://go.warwick.ac.uk/lib-publications
http://wrap.warwick.ac.uk/109440
http://creativecommons.org/licenses/by-nc-nd/4.0/
mailto:wrap@warwick.ac.uk


Hierarchical Viewpoint Discovery from Tweets Using Bayesian Modelling

Lixing Zhua, Yulan Heb, Deyu Zhoua,∗

aSchool of Computer Science and Engineering, Southeast University, China
bDepartment of Computer Science, University of Warwick, UK

Abstract

When users express their stances towards a topic in social media, they might elaborate their viewpoints or reasoning. Oftentimes,
viewpoints expressed by different users exhibit a hierarchical structure. Therefore, detecting this kind of hierarchical viewpoints
offers a better insight to understand the public opinion. In this paper, we propose a novel Bayesian model for hierarchical viewpoint
discovery from tweets. Driven by the motivation that a viewpoint expressed in a tweet can be regarded as a path from the root to
a leaf of a hierarchical viewpoint tree, the assignment of the relevant viewpoint topics is assumed to follow two nested Chinese
restaurant processes. Moreover, opinions in text are often expressed in un-semantically decomposable multi-terms or phrases, such
as ‘economic recession’. Hence, a hierarchical Pitman-Yor process is employed as a prior for modelling the generation of phrases
with arbitrary length. Experimental results on two Twitter corpora demonstrate the effectiveness of the proposed Bayesian model
for hierarchical viewpoint discovery.

Keywords: Natural language processing, Opinion mining, Bayesian modelling

1. Introduction

Stance classification aims to predict one’s stance in a two-
sided debate or a controversial hot topic and has been inten-
sively studied in recent years (Hasan and Ng, 2013; Elfardy
et al., 2015). However, apart from detecting one’s stance, we
are more interested in figuring out the reasons or key viewpoints
why the person supports or opposes an issue of interest. More-
over, viewpoints expressed by different users could be related
and exhibit a hierarchical structure. Figure 1 illustrates an ex-
ample hierarchical viewpoint tree, in which both User A and
User B are supporters of Trump. However, the former just ex-
pressed his support for Trump without mentioning any reasons
behind, while the latter stated the reason that Trump is a charis-
matic leader. We can also see that both User C and User D
support Trump due to his economic policy, however with differ-
ent reasons (‘higher employment rate’ vs. ‘trade protection’).
Such a hierarchical viewpoint tree enables a better understand-
ing of user opinions and allows a quick glimpse of reasons be-
hind users’ stances.

Mining hierarchical viewpoints from tweets is challenging
for the reasons below: (1) In comparison with that the stance
is either ‘Support’ or ‘Oppose’, the hierarchical structure of
viewpoints is unknown a priori; (2) People tend to express
their opinions in many different ways and even with infor-
mal or ungrammatical language; (3) Opinion expressions often
contain multi-word phrases, for example, ‘economic recession’
and ‘economic growth’. Simply decomposing phrases into uni-
grams may lose their original semantic meaning. As such, sim-

∗Corresponding author. Fax.: 8602552090861.
Email addresses: zhulixing@seu.edu.cn (Lixing Zhu),

y.he@cantab.net (Yulan He), d.zhou@seu.edu.cn (Deyu Zhou)

Trump running for president
       of the United States

Support Oppose

RacismPersonal
charisma

Vote for #Trump.
     User A

Economic
   policy

... ...

#DonaldTrump  
is completely 
disgusting for 
his racist.

I am impressed with 
your patriotism and 
honesty, there needs 
to be more like you 
on capitol hill.
    User B 

  Higher 
employ-
ment rate

User C User D

 Trade 
protec-
tion

Figure 1: An example of a hierarchical viewpoint tree on the topic “Trump run
for election” from Twitter.

ply applying a bag-of-words topic model might wrongly group
them under the same topic due to the shared word ‘economic’.

To tackle these challenges, in this paper, we propose a
Bayesian model, called Hierarchical Opinion Phrase (HOP)
model, for hierarchical viewpoint discovery from text. In such
a model, the root node (level-1) contains the topic of inter-
est (e.g., ‘Trump run for president’) and the level-2 topics in-
dicate stance (e.g., either ‘Support’ or ‘Oppose’), while top-
ics in the level-3 and below contain viewpoints under differ-
ent stances. Assuming that viewpoints in each tweet are gen-
erated from a path from the root to a leaf of a hierarchical
viewpoint tree, the assignment of viewpoint topics can be re-
garded as following two nested Chinese Restaurant Processes
(nCRPs). Furthermore, a hierarchical Pitman-Yor process is
employed as a prior to model the generation of phrases with

October 8, 2018


arbitrary length. We have also explored various approaches
for incorporating prior information such as sentiment lexicons
and hashtags in order to improve the stance classification accu-
racy. To the best of our knowledge, our work is the first attempt
for hierarchical viewpoint detection. The proposed approach
has been evaluated on two Twitter corpora. Experimental re-
sults demonstrate the effectiveness of our approach in compar-
ison with existing approaches for hierarchical topic detection
or viewpoint discovery. Our source code is made available at
https://github.com/somethingx86/HOP.

2. Related Work

In this section, we give a brief review of four related lines of
research: opinion mining based on topic models, hierarchical
topic extraction, topical phrase models and deep learning for
sentiment analysis.

2.1. Opinion Mining based on Topic models

Topic models such as Latent Dirichlet Allocation
(LDA) (Blei et al., 2003) have been proven effective for opin-
ion mining. Lin and He (2009) proposed a Joint Sentiment
Topic (JST) model that extends the standard LDA by adding a
sentiment layer on top of the topic layer to allow the extrac-
tion of positive and negative topics. A variant of JST called
reverse-JST was studied in (Lin et al., 2010), where the gen-
eration of sentiments and topics is reversed. Kawamae (2012)
separated words into aspect words, sentiment words and other
words. Aspect words were generated dependent on latent as-
pects which were in turn sampled from sentiment-associated
topics. Kim et al. (2013) modified the aforementioned model
by using a recurrent Chinese Restaurant Process (rCRP) prior
on the aspect variable. They mined a hierarchy of aspects from
product reviews and associated each aspect with a positive or
negative sentiment.

In addition to online reviews, the LDA-based models have
also been applied on debate forums and Twitter. Lim et
al. (2014) proposed a Twitter opinion topic model which makes
use of target-opinion pairs. Trabelsi et al. (2014) designed
a joint topic viewpoint model by assuming that the distribu-
tion over viewpoints is associated with latent topics. Thonet et
al. (2016) treated nouns as topic words while adjectives, verbs,
adverbs were treated as opinion words for topic-specific opin-
ion discovery. Vilares and He (2017) focused on detecting per-
spectives from political debates. They modelled topics and their
associated perspectives as latent variables and generated words
associated with topics or perspectives by following different
routes. However, the aforementioned models are not able to
generate a hierarchy of opinions.

2.2. Hierarchical Topic Extraction

In general, there are two types of approaches for extract-
ing topical hierarchies. The first one is based on probabilistic
graphical models. For example, the Hierarchical LDA (HLDA)
model was proposed for discovering the topical hierarchies
from abstracts of scientific papers (Blei et al., 2010). In the

model, each document is assumed to be attached to a path where
each level is a topic. The path will induce a set of topics, and
words will be generated in the same way as in LDA. The allo-
cation of paths follows an nCRP prior. Kim et al. (2012) argued
that topics should be distributed over the whole nodes of hier-
archy. They achieved this by placing an rCRP prior on the tree.
Jordan et al. (2015) proposed a nonparametric model called
nested hierarchical Dirichlet process to allow shared groups
among clusters. Their work extended nCRP by incorporating
a hierarchical Dirichlet process.

The second type of approaches to hierarchical topic extrac-
tion is based on frequent pattern mining. An early work is fre-
quent itemset-based hierarchical clustering model (Fung et al.,
2003) which simply clustered documents according to their
shared items. Wang et al. (2013) developed a phrase min-
ing framework called CATHY (Constructing A Topical Hier-
archY). It first builds a term co-occurrence network using a fre-
quent pattern mining method which is commonly used in as-
sociation rule mining. The initial network corresponds to the
root topic. Then the network is clustered into subtopic net-
works in a probabilistic way by assuming that each term co-
occurrence was generated by a topic. The process is repeated
until no subtopics can be found.

2.3. Topical Phrase Models

In order to address the wide occurring of un-semantically de-
composable phrases, various topical phrase models have been
proposed in the literature. Wallach (2006) made the first attempt
to extend LDA by incorporating hierarchical Dirichlet language
model called Bigram Topic Model (BTM), where each word is
generated from a distribution over vocabulary following a two-
level hierarchical Dirichlet process. Wang et. al (2007) ex-
tended BTM by adding a switch variable at each word position
to decide whether to begin a new n-gram or to continue from
the previously identified n-gram. El-Kishky et al. (2014) de-
veloped a pipeline approach, called TopMine. It first extracts
frequent phrases using a frequent pattern mining method. Then
an LDA-based model is learned where words in the same phrase
are generated from the same topic. He (2016) proposed a top-
ical phrase model which extends TopMine by using the hierar-
chical Pitman-Yor Process (HPYP) to model the generation of
words in a phrase.

2.4. Deep Learning for Sentiment Analysis

Recent years have seen a surge of interests of developing
deep learning approaches for sentiment analysis. Many of them
have been applied to sentiment classification on product re-
views (Chen et al., 2016; Gui et al., 2017), news articles (Lai
et al., 2015; Nguyen et al., 2017) as well as tweets (Ghiassi
et al., 2013; Dong et al., 2014). Different neural network ar-
chitectures have been explored including Convolutional Neural
Network (CNN) models (dos Santos and Gatti, 2014; Severyn
and Moschitti, 2015a,b), Long Short-Term Memory (LSTM)
Networks (Tang et al., 2015; Wang et al., 2016) and models
with attention mechanisms (Ren et al., 2016; Yang et al., 2016).
However, these models rely on annotated datasets where each

October 8, 2018

https://github.com/somethingx86/HOP


γ

α

cd,3

cd,4

cd,L
n[1,Nd]

c1

cd,2λ

..
.

d[1,D]

Gu
k

wd,n,1

θd

k[1,]

...

...

zd,n,1 zd,n,2

wd,n,2

zd,n,Ud,n

wd,n,Ud,n

u[1,Ud,n]

G0
U

ηβk
k[1,]

d[1,D]

cd,1

cd,2

cd,3

cd,L

γ

α

θd

n[1,Nd]

zd,n

wd,n

..
.

(a) HLDA (b) HOP

Figure 2: Graphical models of HLDA and the proposed Hierarchical Opinion Phrase (HOP) model. Boxes are plate notations representing replicates.

Table 1: Notations Used in the Article.
Symbol Description
βk Topic k, which is a distribution over the vocabulary
cd,l The lth level, whose value indexes a topic and follows CRP
cd The path in the tree for tweet d, which follows nCRP
θd Distribution over the levels for tweet d
α Parameter for the Dirichlet distribution
γ Concentration parameter for the CRPs
η Parameter for the symmetric Dirichlet distribution
λ Parameter for the Bernoulli distribution
wd,n The token in tweet d, position n (HLDA)
zd,n Level allocation for token wd,n (HLDA)
wd,n,u The token in tweet d, phrase n, position u (HOP)
zd,n,u Level allocation for token wd,n,u (HOP)
G0 Base distribution for the HPYPs
Gk Topic k, which is an HPYP
Gku The PYP that generates the uth token in phrases of topic k

document is either labeled by the sentiment class or annotated
with more fine-grained opinion targets/words for training. As
such, they cannot be used for hierarchical opinion discovery in
the absence of annotated data.

2.5. Summary
Our model is partly inspired by HLDA. While the root-level

topics in HLDA mostly contain background words, our model
is able to extract the key topic of interest from tweets at the root-
level. Also, the level-2 topics in our model are constrained to
be stance-related topics to allow for the incorporation of prior
information as will be shown in the experiments section. Fur-
thermore, we incorporate the hierarchical Pitman-Yor Process
(HPYP) as the prior to deal with the generation of multi-word
phrases and hence the proposed model can generate hierarchi-
cal viewpoints with better interpretability.

3. Hierarchical Opinion Phrase (HOP) Model

In this section, we propose the Hierarchical Opinion Phrase
(HOP) model to learn a viewpoint hierarchy from text and also

model the generation of phrases at the same time. Before pre-
senting the details of HOP, we first describe the HLDA model.
The notations used in our model and HLDA are summarized in
Table 1.

HLDA (Blei et al., 2010) as illustrated in Figure 2(a) as-
sumes that each topic is tied to a node in a tree and the doc-
uments are generated by first selecting a path in the tree, then
choosing a topic at each level of the path and finally drawing
words from the assigned topics. Notations c in Figure 2(a)
are random variables indexing topics β. The per-tweet path
cd = {cd,1, cd,2, . . . , cd,L} follows nCRP, whose valuing behavior
functions as the affiliated tweet’s seating process, which will be
described later in the introduction of nCRP. The arrows linking
c indicate the constraint that the path’s lower-level node could
only take the value of indices of the topics in the restaurant the
upper level is pointing to, i.e., a tweet only counts those tweets
whose upper level takes the same value in the nCRP of the gen-
erative process.

One problem of applying HLDA for viewpoint discovery is
its unconstrained number of topics and the mixture of topics
under different stances. The problem is aggravated when the
corpus is noisy. We modify HLDA by placing a root topic that
is shared by all documents to generate common words. Since
the number of stances is fixed in our data (“Support” or “Op-
pose”), a stance latent variable that has only two possible values
is placed at level-2. The stance level will therefore hopefully
separate two sets of opposing viewpoints. Since the number
of level-2 topics is fixed at 2, it is now possible to incorporate
side information into the model such as hashtags indicating the
stances, which will be shown in the Experiments section. An-
other problem with the original HLDA is that it operates on the
bag-of-words assumption. As such, the topic results are less in-
terpretable since many phrases are not semantically decompos-
able. We therefore propose to generate phrases from the mod-
ified HLDA model by incorporating the HPYP into the gener-
ative process. HPYP has been previously explored for single-
level topical phrase extraction in newswire stories and clinical

October 8, 2018


documents (Lindsey et al., 2012; He, 2016). But it has never
been explored for hierarchical topical phrase extraction.

For comparison, the HOP model is illustrated in Figure 2(b)
in which the root topic (level-1 topic) contains the topic of in-
terest that is shared across all documents and the level-2 is a
stance layer that has only two possible topics, either ‘Support’
or ‘Oppose’. Topics in level-3 and below capture viewpoints
under different stances.

The proposed approach assumes that phrases have been
identified prior to model learning. Phrase extraction can be
done by many different approaches. In this paper, we ex-
tract phrases from data using an open source toolkit called
Gensim.models.phrases1. It first discovers candidate phrases
based on word collocation patterns, then transforms the phrases
into distributed representations, and finally filters out irrelevant
phrases (Mikolov et al., 2013). In the following subsections we
will discuss the HOP model in more details.

3.1. Generative Process
Suppose there is an L-level hierarchical opinion tree T as

shown in Figure 1 where L is fixed, and each tweet contains L
latent topics corresponding to a path from the root node to a leaf
node. For example, for the tweet “I am impressed with your pa-
triotism and honesty, there needs to be more like you on capitol
hill.”, its hierarchical topics are [Trump run for election, sup-
port, personal charisma]. The root node c1 of T is shared by
all tweets in the collection. The second level of T is limited
to have two topics corresponding to two stances, ‘Support’ or
‘Oppose’, which is assumed to follow a Bernoulli distribution
parameterized by λ. The value of lower levels follows an nCRP
prior, which can depict the unbound nature of viewpoints. We
use cd = {cd,1, cd,2, . . . , cd,L} to denote the path assigned to the
dth tweet. Therefore, the prior for each level can be expressed
as cd,2 ∼ Bernoulli(λ), cd,l ∼ CRP(γ, cd,l−1, . . . , cd,2).
nCRP. We briefly describe the nCRP (Blei et al., 2010) here.
The nCRP is employed as the prior for the assignment of top-
ics, which can organize the topics into a tree topology. As il-
lustrated in Figure 3, tweets are analogous to customers and
topics are analogous to tables. Assuming that the nth customer
enters the root restaurant, he will choose an existing table βi
proportional to the number of customers already sitting there or
choose a new table proportional to the concentration parameter
γ, that is

p(occupied table i|other customers) =
ni

n − 1 + γ
,

p(new table i|other customers) =
γ

n − 1 + γ
,

where ni is the number of customers seated at table βi, n is the
total number of customers including the present one, and γ is
the concentration parameter normally set to 0.5, which indicates
how likely the customer will sit in a new table. After choosing
the table βi, the customer will proceed to choose a table in the
lower-level restaurant which is pointed to by Table βi. This

1http://radimrehurek.com/gensim/models/phrases.html#id1

process will continue until the customer reaches depth L. As
a result, a path composed of tables (or topics) from L different
levels will be induced. This forms L hierarchical topics.
HPYP. We detect phrases based on word collocations so that
phrase detection is separated from topic inference. In HOP,
phrases will be assigned with the same topic if they are assigned
with the same level and their involved tweets happen to share
the same table in that level. We use an HPYP to model the gen-
eration of phrases assigned with the same topic. HPYP was first
proposed in (Teh, 2006), which has been proven to be the best
smoothing method for n-gram language models. In HPYP, the
distribution of the first word is generated from PYP defined as

G1 ∼ PYP(a0, b0, G0),

where G0 is a uniform distribution over a fixed vocabulary, a0
is the discount parameter indicating the degenerating rate of to-
generate power law distribution and b0 is the concentration pa-
rameter controlling the amount of variability of G1 around G0.
The distribution of the second word is generated using G1 as
the base distribution, G2 ∼ PYP(a1, b1, G1). The process con-
tinues until the end of the phrase. The generative process of
drawing words from the prior can be simulated by the gener-
alised CRP (Pitman et al., 2002). A restaurant corresponds to
each Gu, where each table is served a dish whose value is cho-
sen from the base distribution Gu−1. The first customer sits at
the first table; the n + 1th customer chooses an occupied table
in proportion to the number of customers already sitting there
and takes the value of dish on the table, or chooses a new table
proportional to a constant parameter and orders a dish from the
base distribution. This process continues until the proxy cus-
tomer sits in an existing table or there is no parent restaurant.
As such, the probability of the uth word given the seating ar-
rangement is

p(w|Λu) =
Cuw − au−1Tuw

Cu· + bu−1
+

au−1Tu· + bu−1
Cu· + bu−1

× p(w|Λu−1),

where Cuw is the number of customers having dish w in the
restaurant u, Tuw denotes the number of tables serving dish w in
restaurant u, Cu· =

∑
w Cuw and Tu· =

∑
w Tuw.

It is worth noting that the restaurant setup here is different
from that in nCRP. In nCRP each tweet is a customer and each
topic is a table. A tweet is assigned with L topics (or tables)
from the root to the leaf. In HPYP for phrase generation, each
word w is a customer and a restaurant u is the context of the
word. For example, for an n-gram phrase, the context of the nth
word is its preceding n − 1 words.

Based on the above description, the generative process of
HOP is given below.
• For each topic k ∈ {1, 2, 3, . . . ,∞}:

– Gk1 ∼ PYP(a0, b0, G0)
– Gk2 ∼ PYP(a1, b1, G

k
1)

· · ·

– GkU ∼ PYP(aU−1, bU−1, G
k
U−1)

• Set c1 to be the root restaurant
• For each tweet d ∈ {1, 2, 3, . . . , D}:

October 8, 2018

http://radimrehurek.com/gensim/models/phrases.html#id1


β
1

β
2

β
3

β
11

β
12

β
13

d3

d5

d9

d1 d2
d7

d8

d4
d6

d9

d5
d3

d1 d2 d7
d8

β
21

β
22

β
23

β
31

β
32

β
33

d4
d6

R
0

R
1

R
2

R
3

Figure 3: Illustration of the nested Chinese Restaurant Process (nCRP). Each circle represents a table which points to a unique restaurant denoted as a rectangle.
Each customer (or tweet) will first choose a table in the upper level then follow the link pointed to by the table to reach a lower-level restaurant and choose another
table there.

– Select the Level-2 topic cd,2 ∼ Bernoulli(λ)
– For level l ∈ {3, . . . , L}, select its corresponding topic cd,l ∼

CRP(γ, cd,l−1, . . . , cd,2)

– Draw a distribution over levels
θd ∼ Dirichlet(α)

– For each phrase n ∈ {1, 2, . . . , Nd}:

* For each word u ∈ {1, 2, . . . , Ud,n}:
If it is the first word in a phrase (u = 1):
· Assign a level zd,n,u|θd ∼ Discrete(θd )

· Draw a word wd,n,u|{zd,n,u, cd, G} ∼ Discrete(G
cd [zd,n,u ]
1 )

Else:
· Set zd,n,u = zd,n,u−1
· Draw a word wd,n,u|{zd,n,u, cd, G} ∼ Discrete(G

cd [zd,n,u ]
u ).

Here, cd [zd,n,u] denotes the zd,n,uth component of vector cd , G0 is
a uniform distribution over a fixed vocabulary W of V words,
for ∀w ∈ W, G0(w) =

1
V .

2 Since phrases have already been
identified prior to hierarchical opinion extraction, the bound-
aries of phrases are observed and need not be sampled from
data.

3.2. Inference and Parameter Estimation
Given the observed variable w = {w1, w2, . . . , wD}, our goal

is to infer the hidden variables c and z using the posterior distri-
bution p(c, z|w,λ,γ,α, G0, Ω) where Ω = {a0, b0, . . . , aU, bU}
denotes the hyper-parameters of HPYPs. Since exact infer-
ence of the posterior distribution is intractable, Gibbs sam-
pling (Griffiths and Steyvers, 2004) is employed to approximate
the hidden variables, which sequentially samples each variable
of interest using the conditional probability of that variable
given the current values of all other variables and the data. With
sufficient iterations, the sampling process will finally reach a
status when all the samples can be seen as generated by a sta-
tionary distribution.

The variables used in the sampling process are: (1) zd,n,u,
the level allocation of the uth word in the nth phrase of the
dth tweet; (2) cd , the path of the dth tweet. The objective of
the Gibbs sampler is to approximate p(c, z|w,λ,γ,α, G0, Ω).
We omit Ω for clarity. In Gibbs sampling, we are focused

2Note that we abuse the notation here that we use n to denote the nth phrase.

on the per-document posterior p(cd, zd|c−d, z−d, w,λ,γ,α, G0).
For d ∈ {1, 2, . . . , D}, we first sample word-wise level alloca-
tion p(zd,n,u|c, z−d,−n,−u, w,α, G0), where z−d,−n,−u is the vector
of level allocations leaving out zd,n,u. Then we sample path
p(cd|c−d, z, w,λ,γ, G0).

3.2.1. Level Allocation Sampling
Given a path, we need to sample level allocations for the

tweet d. According to Bayes’ rule and conditional indepen-
dence, we obtain

p(zd,n,u = l|c, z−d,−n,−u, w,α, G0) ∝
p(zd,n,u = l|zd,−n,−u,α)×
p(wd,n|c, z, w−d,−n, G0).

(1)

The first term in Eq. 1 is a conditional probability marginal-
izing out θd . Since the Dirichlet distribution p(θd|α) and the
discrete distribution p(zd,n,u|θd ) form a Dirichlet-Multinomial
conjugate, we have

p(zd,n,u = l|zd,−n,−u,α) =
Cld,−n,−u + α[l]

C·d,−n,−u +
∑L

l=1 α[l]
.

Here, Cld,−n,−u is the number of times level label l being assigned
to some word tokens in tweet d leaving out zd,n,u, C·d,−n,−u =∑L

l=1 C
l
d,−n,−u, L is the total depth of the hierarchy.

The second term in Eq. 1 follows a zd,n,u-specified HPYP
with wd,n as a new random variable given the particular status
of that HPYP. We use the generalised CRP to perform sam-
pling. Let Λcd [l] denote the current seating arrangement of topic
cd [l], the second term is rewritten as p(wd,n|Λcd [l]). Thereafter,
the random process can be simulated by the situation that the
first word wd,n,1 enters the restaurant Λ

cd [l]
1 as a customer. The

probability of the next word w from Gcd [l]u can be calculated re-
cursively as

p(w|Λcd [l]u ) =
Ccd [l]uw − au−1T

cd [l]
uw

Ccd [l]u· + bu−1
+

au−1T
cd [l]
u· + bu−1

Ccd [l]u· + bu−1
×p(w|Λcd [l]u−1 ).

Here, Ccd [l]uw denotes the number of customers eating dish w in
the restaurant u owned by topic cd [l], T

cd [l]
uw denotes the number

of tables serving dish w in restaurant u owned by topic cd [l],

October 8, 2018


Ccd [l]u· =
∑

w C
cd [l]
uw and T

cd [l]
u· =

∑
w T

cd [l]
uw . The recursion ends

when u = 1, that is

p(w|Λcd [l]1 ) =
Ccd [l]1w − a0T

cd [l]
1w

Ccd [l]1· + b0
+

a0T
cd [l]
1· + b0

Ccd [l]1· + b0

1
V
.

If wd,n,u is not the first word of a multi-term phrase, we just take
zd,n,u = zd,n,u−1 and do not sample a new topic.

3.2.2. Path Sampling
Given paths of other tweets and level allocations, we have to

sample the path for tweet d. By applying the Bayes’ rule, we
have

p(cd|c−d, z, w,λ,γ, G0) ∝ p(cd|c−d,λ,γ) × p(wd|c, z, w−d, G0).
(2)

The first term in Eq. 2 is a prior over paths. It can be com-
puted by first calculating a Bernoulli distribution then calculat-
ing each level’s seating distribution in the corresponding restau-
rant.

The second term in Eq. 2 is the probability of a given tweet
under a possible seating arrangement in the nCRP tree. It can
be decomposed into probabilities of phrases/words occurred in
the tweet,

p(wd|c, z, w−d, G0) = p(wd,1|c, z, w−d, G0)×
p(wd,2|c, z, w−d, wd,1, G0)×

·· ·

p(wd,Nd |c, z, w−d, wd,1, . . . , wd,Nd−1, G0).

(3)

Each term in Eq. 3 is a posterior distribution of phrase wd,n
conditioned on other phrases allocated with the same topic. The
distribution follows an HPYP, and thus can be sampled in the
same way as described for the second term in Eq. 1.

3.2.3. Complete Sampling Procedure

Algorithm 1 Gibbs sampling for HOP

1. Initialize the model by arbitrarily assigning a path to each
tweet. Randomly assign a level number of the path to each
word/phrase in the tweet. Initialize the HPYP configuration
within each topic for each associated word/phrase.

2. For each tweet d ∈ {1, 2, . . . , D}:

(a) Sample c(t+1)d using Eq. 2.
(b) For each phrase n in tweet d, n ∈ {1, 2, . . . , Nd}:

i. For each word u in phrase n, u ∈ {1, 2, . . . , Ud,n}:
A. Sample z(t+1)d,n,u using Eq. 1.

3. Repeat step 2 until the global log-likelihood converges or a
fixed number of iterations is reached.

4. Output the final sample {c, z}.

Given the conditional distributions defined above, we are
able to perform the full Gibbs sampling. Let {c(t), z(t)} denote
the current state, the sampling process is described in Algo-
rithm 1.

4. Experiments

In this section, we first describe the datasets and baselines
used in our experiments. We then evaluate HOP against the
baselines quantitatively and qualitatively.

4.1. Experimental Setup

Algorithm 2 Opinion tweets retrieval.

1. Define the seed patterns “[brexit|leaving the EU|staying
in the EU]+ [would|can|might|won’t|can’t]” for Dataset
I (Brexit), and “[Trump |Hillary|Donald Trump|Hillary
Clinton]+[would|can|might|won’t|can’t]” for Dataset II (US
General Election). These seed patterns are used for retriev-
ing seed tweets such as “Top economic think tanks agree
that brexit would harm economy. Vote Remain in the EU”
or “Trump would start WW3! There is no one he hasn’t of-
fended”. We denote the set of the seed patterns as P and the
set of opinion tweets as T .

2. (a) Perform Part-of-Speech (POS) tagging on the seed
tweets T .

(b) Extract keywords which are tagged as the latter noun
in the POS tag pattern “NN+MD+VB+NN”.

(c) Enlarge the seed pattern set P by adding new rules
based on the newly extracted keywords such as
“keyword+[would|can|might|won’t|can’t]”.

(d) Retrieve tweets based on the enlarged seed pattern set.
Add the retrieved tweets to T .

3. Repeat step 2 until no more tweets are found.

To evaluate HOP, two datasets are constructed. We devel-
oped a crawler3 using the Twitter4j toolkit and Twitter Stream-
ing API4. The crawler listens to the global timeline and fil-
ters tweets with specific keywords. Dataset I contains 515, 113
tweets crawled using hashtags such as #EURef, #EU refer-
endum and #Brexit, dated 14th-24th June 2016. Dataset II
contains 1, 691, 294 tweets with hashtags #PresidentElection,
#Election2016, #Hillary, #Trump, dated 1st-8th November
2016. To ensure tweets with opinions are kept, a rule-based
method inspired by (Handler et al., 2016) is used and is de-
scribed in Algorithm 2. The process runs iteratively until no
more tweets are found. The statistics of the final datasets are
shown in Table 2. It can be noticed that only 13.33% of tweets
are kept in Dataset I and for Dataset II the proportion is 2.13%.

Each tweet is pre-processed by removing common stop
words. No stemming is used. Phrases are identified based
on collocation patterns from data using Gensim5. We perform
Gibbs sampling for a maximum of 10,000 iterations and output
the intermediate results every 1,000 iterations. It usually takes

3https://github.com/somethingx86/EclipseTwitterStreamer
4https://developer.twitter.com/en/docs
5http://radimrehurek.com/gensim/models/phrases.html

October 8, 2018

https://github.com/somethingx86/EclipseTwitterStreamer
https://developer.twitter.com/en/docs
http://radimrehurek.com/gensim/models/phrases.html


Table 2: Statistics of the two datasets.
Dataset Property Value

I
#tweets 68,672
#unigram tokens 1,219,484
vocabulary size 31,149

II
#tweets 36,013
#unigram tokens 591,620
vocabulary size 21,595

around 1,000 iterations to reach a stationary status when the
topic topology and tweet-topic associations no longer change,
which can be visualized using a tree structure.

4.2. Methods for Comparison

We compare our model with the following approaches which
can generate topic hierarchies:
CATHY (Constructing A Topical HierarchY) (Wang et al.,
2013) builds a topical hierarchy where each topic is represented
as a set of phrases. A term co-occurrence network is first con-
structed using the Frequent Pattern (FP)-growth algorithm. The
network edges are clustered according to their associated topics
to obtain a hierarchy of topics. In our experiments, the hierar-
chy depth is set to 3, the cluster number is set to 2 for level 2
and 4 for level 3. Other parameter settings follow (Wang et al.,
2013).
HLDA (Hierarchical LDA) (Blei et al., 2010) assumes that each
document is generated by drawing an infinite L-level path ac-
cording to the nCRP prior, and drawing a topic distribution
over levels in the path according to a stick-breaking process.
Words are drawn from the L topics which are associated with
the restaurants along that path.
HASM (Hierarchical Aspect-Sentiment Model) (Kim et al.,
2013) produces hierarchical aspects in which each aspect is as-
sociated with positive, negative or neutral polarities. The aspect
hidden variable follows an rCRP. We use the default parame-
ter settings in our experiments. Note that the prior sentiment
knowledge from some common sentiment seed words is used
to set the asymmetric Dirichlet prior for aspect-sentiment-word
distributions in HASM. In our experiments, its default senti-
ment lexicon is replaced by Sentiment140Lex6 (Mohammad
et al., 2013) which was specifically constructed for Twitter sen-
timent analysis.

For HOP, the parameter settings are: λ = 0.5, γ = 0.5,
α = [0.75, 0.15, 0.075, 0.025], au = 0.8, bu = 1 for HPYP,
the maximum phrase length U = 3. We only keep topics which
are associated with at least 1,000 tweets.

4.3. Topic Coherence

Various measures have been proposed to evaluate the qual-
ity of the topics discovered. Newman et al. (2010) found that
the pointwise mutual information (PMI) of all word pairs in a

6http://www.saifmohammad.com/Lexicons/Sentiment140-Lexicon-v0.1.
zip

topic’s top ten words coincides well with human judgements.
PMI is defined as follows:

PMI(wi, w j) = log
p(wi, w j)

p(wi) p(w j)
, (4)

where p(wi, w j) is the co-occurrence likelihood of two words. It
can be estimated by counting the co-occurrence of the word pair
in sliding windows in an external large meta-document. Röder
et. al (2015) studied the known coherence measures and pro-
posed a new measure which was a combination of some exist-
ing ones. Particularly, this metric first retrieves co-occurrence
counts for the given words using a sliding window of size 110
in Wikipedia. For each top word a vector is built whose compo-
nents are the normalized Point-wise Mutual Information (PMI)
between the word and every other top words. The arithmetic
mean of all vector pairs’ cosine similarity is treated as the co-
herence measure of a given topic.

I II
Dataset

0.30

0.35

0.40

0.45

0.50

T
o
p
ic

 C
o
h
e
re

n
c
e

0.3796

0.4033
0.3979

0.4269

0.3946

0.4151
0.4207

0.4383

CATHY

HLDA

HASM

HOP

Figure 4: Topic coherence on two datasets.

Following the measure proposed in (Röder et al., 2015), we
report the average topic coherence scores computed on the top
10 words/phrases of each topic which is shown in Figure 4. It
can be observed that CATHY scores the lowest compared to the
other three methods on both datasets. HLDA gives better results
compared to HASM. HOP outperforms all the other methods
and the improvement over the second best performing model,
HLDA, is more prominent on Dataset I.

4.4. Stance Classification
Some previous studies used topic models to perform

sentiment/stance classification and achieved comparable re-
sults (Trabelsi and Zaıane, 2014; Thonet et al., 2016). Apart
from HLDA and HASM, we select two more baselines which
can output document-level stance labels:
VODUM (Viewpoint and Opinion Discovery Unification
Model) (Thonet et al., 2016) assumes nouns are topical words
and adjectives, verbs and adverbs are opinion words. It uses dif-
ferent generative routes to generate topical words and opinion
words. Their model associates a viewpoint label (equivalent to
a stance label here) with each document.

October 8, 2018

http://www.saifmohammad.com/Lexicons/Sentiment140-Lexicon-v0.1.zip
http://www.saifmohammad.com/Lexicons/Sentiment140-Lexicon-v0.1.zip


sLDA (Supervised LDA) (Mcauliffe and Blei, 2008) modified
LDA by adding an observed response variable to each docu-
ment, which follows a Gaussian distribution whose mean is the
averaged weighed sum of topic assignments of all the docu-
ment’s tokens.

To obtain the ground-truth for the evaluation of stance clas-
sification, we hired three senior undergraduate students, who
worked on NLP-related final-year projects, to manually anno-
tate some randomly selected tweets from Dataset II. Tweets
were discarded if there were disagreement among the annota-
tors. In total, 80 tweets were discarded. In the end, we kept a to-
tal of 1,000 tweets which consists of 748 positive and 252 neg-
ative tweets. We compare HOP with baselines on this dataset
for stance classification. For HOP, we use the level-2 topics for
stance classification. For HLDA, there is no restriction on its
level-2 topic number and we manually go through all the topics
in level-2 to identify the likely ‘Support’ and ‘Oppose’ stances.
Afterwards, each document’s stance can be identified accord-
ingly. For VODUM, the number of viewpoints (stances) is set
to 2 and all the hyperparameters take the default setting. We
run each approach for 20 times and average the classification
results over 20 such runs.

HASM used the prior sentiment knowledge to set the asym-
metric Dirichlet prior for aspect-sentiment-word distributions.
In more details, it placed higher probabilities on some polarity
seed words based on a given sentiment lexicon. The incorpora-
tion of this kind of supervised information gave them an edge
over the unsupervised alternatives. In order to compare our
model with HASM fairly, we also experimented with a variant
of HOP (call ‘pHOP’) by incorporating the prior information
of stance and sentiment into the model. The prior information
can be derived from either hashtags or existing sentiment lexi-
cons. For hashtags, we use ‘#VoteTrump’ and ‘#NeverHillary’
as a proxy label for the ‘Support’ stance, and ‘#VoteHillary’
and ‘#NeverTrump’ as a proxy label for the ‘Oppose’ stance. In
the experiments, the prior information was only utilized during
model initialization, where the ‘Support’ and ‘Oppose’ tweets
were assigned to the left and right path in Level-2, respectively.
Furthermore, we also utilize the sentiment prior information of
words which is obtained from Sentiment140Lex. We consider
those with sentiment intensity larger than 1.0 as strong sen-
timent words. We restrict those words with strong sentiment
from Sentiment140Lex to be only generated from the Level-2
topics. During model initialization, if a tweet has a stance la-
bel or a word token can be found in the sentiment lexicon, then
the path or level allocation will be kept. Otherwise, a path or
level allocation will be randomly initialized. There are about
10% tweets carrying proxy labels, and 1.15% words found in
the sentiment lexicon in our data. For sLDA, proxy stance la-
bels are treated as the observed response variables to train the
model.

Stance classification results are presented in Figure 5.
HLDA, VODUM and HOP are all unsupervised approaches
without the prior information. It can be observed that HLDA
gives better results compared to VODUM but VODUM tends
to be more stable. HOP outperforms both HLDA and VODUM.
Although the best performance achieved is only about 58%, this

VODUM HLDA HOP sLDA HASM pHOP
Models

0.50

0.55

0.60

0.65

0.70

0.75

0.80

A
c
c
u
ra

c
y

Figure 5: Stance classification accuracy.

is still noticeable given that HOP is totally unsupervised.
By incorporating the prior information into the model, we no-

tice a significant boost in stance classification accuracy where
pHOP achieves an accuracy of 71% and it outperforms sLDA
by 2%. HASM only achieves an accuracy of 60% since it only
utilizes the sentiment lexicon and cannot benefit from the proxy
stance labels due to its unbounded aspect node arrangement for
each sentence, which has to follow an rCRP.

4.5. Qualitative Results
To evaluate the quality of opinion hierarchies discovered, we

compare HOP with CATHY and HLDA. Figure 6 illustrates
the opinion hierarchies generated by these three models respec-
tively on Dataset I. It can be observed that CATHY can dis-
tinguish two opposing stances at level-2, but it generated less
coherent topics at level-3. Oftentimes, CATHY tends to group
phrases sharing common constituent words into the same topic,
e.g., Topic 2 and 4 under the right branch of the level-2 topic.
HLDA generates more sensible topics at level-3. However, it
is not able to distinguish two stances at level-2 since the num-
ber of topics is unconstrained. Also, top topic words listed at
level-3 for HLDA are dominated by unigrams and are less inter-
pretable. This is not surprising since HLDA operates under the
bag-of-phrases assumption. Phrases such as ‘immigration pol-
icy’ and ‘immigration control’ are treated as two distinct tokens
although they share the same constituent word ‘immigration’.
Hence, running HLDA on data with phrases identified suffers
from more severe data sparsity as the vocabulary size is signif-
icantly enlarged.

On the contrary, we can observe that HOP generates more
distinct opinions7, especially at level-2. It can also be observed
that HOP has more phrases appeared in its level-3 topics com-
pared to HLDA. Overall, HOP offers better interpretability of
the topics discovered. Moreover, when examining the hierar-
chical topic assignment to each individual tweet, we notice that

7We manually add the topic labels in bold face for easy understanding of
the results.

October 8, 2018


CATHY

HLDA

vote  leave  EU  Brexit  remain  referendum  UK  stay  country  Britain 
support against

remain  stay  life  house  place  live  long  today  safe  night
immigration government public services economy Brexit campaign human rights western political sovereignty

Brexit  
economy  
trade_deals
markets
economic_recession 
jobs  
workers_rights
pound  
bank_England
wages_rise

HOP

immigration   
EU 
control   
Brexit  
leave  
migration 
control_borders
migrants   
immigrants
immigration_control

Brexit  
Labour 
leave  
Cameron 
EU  
tory  
Osborne 
prime_minister
government  
party

EU  
Brexit  
leave  
NHS  
money  
pay  
cost  
tax
public_services  
spend

race  
muslim  
Trump 
human_rights  
Obama  
president  
touch 
open_borders
Hillary  
white

EU  
vote  
Europe 
England  
France
status_quo  
member_states
German 
European_Union  
western_political

unelected  
democracy  
freedom_movement  
stay  
parliament 
Scottish_independence  
laws  
sovereignty  
control 
general_election

EU  Brexit  vote_leave  UK  Britain  trade  economy  leaving  jobs  Europe

Brexit_campaign  
lies  
remain  
facts  
read  
Jo_Cox_murder  
debate  
Boris_Johnson  
truth  
Farage

leave  EU  vote  Brexit  stay  referendum  UK  country  remain  happen

EU  Brexit  UK  leave  leaving  trade  Britain  remain  economy  immigration stay  wait  long  place  life  man  forever  love  longer  friends

EU                control  
Cameron       deal
immigration  government
tory               reform  
veto              PM

Brexit            Britain  
economy       market  
industry         warns  
business        world  
markets         businesses

cost              funding  
pay               spend  
paid              housing  
increase        extra  
higher           immigrants

stay                          house  
lol                            safe  
wanna                      gonna  
forever                     happy  
live                           hate

mayor                       win  
gold                          para  
DA                            middle  
mayoral                    ref  
bath                          grave

Brexit  EU  leave  vote  UK  stay  Britain  remain  life  referendum

leave_EU  vote_leave  UK_EU  EU_referendum  Brexit_UK  vote_leave_EU  stay  stay_EU  stay_leave  stay_life  vote_stay  Boris_Johnson  love_stay  Jo_Cox

leave_EU 
vote_leave
vote_leave_EU
EU_referendum
remain_leave
vote_leave_remain
vote_EU_referendum
vote_Brexit
UK_vote_leave
UK_leave_EU

UK_EU  
leave_EU
UK_leave_EU
Britain_EU  
EU_trade
leave_EU_trade
Britain_leave_EU
David_Cameron_EU
Cameron_EU
rights_leave_EU

prime_minister
Michael_Gove
leave_EU  
post_Brexit
EU_Brexit
EU_change  
George_Osborne
project_fear
Michael_gove_EU
Osborne_Brexit

stay_EU  
stay_life
stay_leave  
vote_stay
stay_lane  
stay_place
leave_EU_stay
stay_forever
stay_long
UK_stay

love_stay
check_latest_updates  
stay_EU 
stay_house  
stay_home
stay_safe
stay_room
stay_leave  
world_stay
hope_stay

Boris_Johnson  
Jo_Cox
London_mayor
Sadiq_Khan
Ruth_Davison
leave_Boris_Johnson
Eddie_Izzard
Bob_Geldof
stay_EU
Jo_Cox_murder

Nigel_Farage
vote_leave  
stay_EU 
EU_eurovision 
leave_EU
stay_moment
Amber_Rudd
leave_racist
social_media_stay
vote_stay

Brexit_UK  
Brexit
EU_Brexit    
leaving_EU
single_market    
EU_NHS
EU_immigration
Brexit_economy
EU_economy
control_immigration

Figure 6: Opinion hierarchies discovered by CATHY, HLDA and HOP on Dataset I.

in many cases, tweets are assigned with sensible hierarchical
opinions. For example, “Briefing: Trade, investment and jobs
will benefit if we Vote Leave” is assigned with [Support, Econ-
omy]. It can be easily inferred from the topic result that the
tweet is about the support of Brexit because of an economy-
related reason.

On Dataset II only the hierarchy generated by HOP is pre-
sented in Figure 7 due to the space limit. Although the level-2
topics do not clearly represent two opposing stances since the
HOP model is totally unsupervised, it can be inferred from the
level-3 topics (‘Email scandal of Hillary’ and ‘Trump vowed
to drain the swamp8’) that the left level-2 topic is about ‘Sup-
porting Trump’. And similarly we can infer from the level-3
topics that the right level-2 topic is about ‘Supporting Hillary’.
Also, under the ‘Drain the swamp’ topic, the 4th level gives
more fine-grained topics on ‘Controlling illegal immigrants’
and ‘Trump will lead a unified Republican government’.

5. Conclusion

In this paper, we have proposed an unsupervised Hierarchi-
cal Opinion Phrase (HOP) model in which each document is

8Trump used this metaphor to describe his plan to fix problems in the federal
government.

Trump  Hillary  Donald_Trump  president  American  
vote Trump vote Hillary

Hillary  Trump  Clinton  Obama  America 
email scandal drain swamp racist taxes Russia

Hillary_Clinton  
FBI  
campaign  
emails  
Huma  
indicted  
spooky_story  
email  
FBI_investigation  
jail

Trump  
America  
country  
great  
president  
repeal_obamacare  
change  
world  
drain_swamp  
won

Trump  
taxes  
Obama  
tax 
release_tax_returns  
obamacare  
pay  
Clinton
supreme_court  
law

Trump  
Putin  
markets  
Russian  
economy  
crash  
Russia  
nuclear_war  
energy  
civil_war

Trump  win  Hillary  vote  Clinton  

Trump  
women  
white  
racist  
black  
women_rights  
black_voters  
shit  
build_wall  
child_rape

Donald_Trump
reliable
lead_unified_republican
wind
trusted

Trump      
worst
economy
stock_market  
Mark_Cuban  

Trump      
lies
Donald
stop_lie
cheat  

Trump 
deport
hunger
illegal_immigrants
million_immigrants

Figure 7: Opinion hierarchy discovered by HOP on Dataset II.

associated with a path in a topic tree and with a document-
specific distribution over different levels in the tree. Phrases
are drawn from their associated topic-specific HPYP. Experi-
mental results on two Twitter datasets show that our proposed
HOP model is able to reveal hierarchical opinions from social
media. It also shows that HOP significantly outperforms ex-
isting approaches to hierarchical topic extraction in both topic

October 8, 2018


coherence and stance classification. In our current work, all the
paths in the generated hierarchical opinion tree have the same
depth. In future work, we will explore modelling hierarchical
opinion trees with varying depths of path.

Acknowledgements

We would like to thank the reviewers for their valuable com-
ments and helpful suggestions. This work was funded by
the National Natural Science Foundation of China (61528302,
61772132), the Natural Science Foundation of Jiangsu Province
of China (BK20161430) and Innovate UK (grant no. 103652).

References

Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). The nested chinese
restaurant process and bayesian nonparametric inference of topic hierar-
chies. Journal of the ACM (JACM), 57(2):7.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022.

Chen, H., Sun, M., Tu, C., Lin, Y., and Liu, Z. (2016). Neural sentiment
classification with user and product attention. In Proceedings of the 2016
Conference on Empirical Methods in Natural Language Processing, pages
1650–1659.

Dong, L., Wei, F., Tan, C., Tang, D., Zhou, M., and Xu, K. (2014). Adaptive re-
cursive neural network for target-dependent twitter sentiment classification.
In Proceedings of the 52nd Annual Meeting of the Association for Compu-
tational Linguistics (Volume 2: Short Papers), volume 2, pages 49–54.

dos Santos, C. and Gatti, M. (2014). Deep convolutional neural networks for
sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th
International Conference on Computational Linguistics: Technical Papers,
pages 69–78.

El-Kishky, A., Song, Y., Wang, C., Voss, C. R., and Han, J. (2014). Scalable
topical phrase mining from text corpora. Proceedings of the VLDB Endow-
ment, 8(3):305–316.

Elfardy, H., Diab, M., and Callison-Burch, C. (2015). Ideological perspective
detection using semantic features. Lexical and Computational Semantics (*
SEM 2015), page 137.

Fung, B. C., Wang, K., and Ester, M. (2003). Hierarchical document clustering
using frequent itemsets. In Proceedings of the 2003 SIAM International
Conference on Data Mining, pages 59–70. SIAM.

Ghiassi, M., Skinner, J., and Zimbra, D. (2013). Twitter brand sentiment anal-
ysis: A hybrid system using n-gram analysis and dynamic artificial neural
network. Expert Systems with applications, 40(16):6266–6282.

Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proceedings
of the National academy of Sciences, 101(suppl 1):5228–5235.

Gui, L., Zhou, Y., Xu, R., He, Y., and Lu, Q. (2017). Learning representations
from heterogeneous network for sentiment classification of product reviews.
Knowledge-Based Systems, 124:34–45.

Handler, A., Denny, M. J., Wallach, H., and OâĂŹConnor, B. (2016). Bag of
what? simple noun phrase extraction for text analysis. NLP+ CSS 2016,
114.

Hasan, K. S. and Ng, V. (2013). Frame semantics for stance classification. In
CoNLL, pages 124–132.

He, Y. (2016). Extracting topical phrases from clinical documents. In AAAI,
pages 2957–2963.

Kawamae, N. (2012). Hierarchical approach to sentiment analysis. In Semantic
Computing (ICSC), 2012 IEEE Sixth International Conference on, pages
138–145. IEEE.

Kim, J. H., Kim, D., Kim, S., and Oh, A. (2012). Modeling topic hierarchies
with the recursive chinese restaurant process. In Proceedings of the 21st
ACM international conference on Information and knowledge management,
pages 783–792. ACM.

Kim, S., Zhang, J., Chen, Z., Oh, A. H., and Liu, S. (2013). A hierarchical
aspect-sentiment model for online reviews. In AAAI.

Lai, S., Xu, L., Liu, K., and Zhao, J. (2015). Recurrent convolutional neural
networks for text classification. In AAAI, volume 333, pages 2267–2273.

Lim, K. W. and Buntine, W. (2014). Twitter opinion topic model: Extracting
product opinions from tweets by leveraging hashtags and sentiment lexicon.
In Proceedings of the 23rd ACM International Conference on Conference
on Information and Knowledge Management, pages 1319–1328. ACM.

Lin, C. and He, Y. (2009). Joint sentiment/topic model for sentiment analysis.
In Proceedings of the 18th ACM conference on Information and knowledge
management, pages 375–384. ACM.

Lin, C., He, Y., and Everson, R. (2010). A comparative study of bayesian mod-
els for unsupervised sentiment detection. In Proceedings of the fourteenth
conference on computational natural language learning, pages 144–152.
Association for Computational Linguistics.

Lindsey, R. V., Headden III, W. P., and Stipicevic, M. J. (2012). A
phrase-discovering topic model using hierarchical pitman-yor processes.
In Proceedings of the Joint Conference on Empirical Methods in Natu-
ral Language Processing and Computational Natural Language Learning
(EMNLP-CoNLL), pages 214–222.

Mcauliffe, J. D. and Blei, D. M. (2008). Supervised topic models. In Advances
in neural information processing systems, pages 121–128.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Dis-
tributed representations of words and phrases and their compositionality.
Advances in Neural Information Processing Systems, 26:3111–3119.

Mohammad, S., Kiritchenko, S., and Zhu, X. (2013). Nrc-canada: Building the
state-of-the-art in sentiment analysis of tweets. In Proceedings of the seventh
international workshop on Semantic Evaluation Exercises (SemEval-2013),
Atlanta, Georgia, USA.

Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. (2010). Automatic eval-
uation of topic coherence. In Human Language Technologies: The 2010
Annual Conference of the North American Chapter of the Association for
Computational Linguistics, pages 100–108. Association for Computational
Linguistics.

Nguyen, D., Vo, K., Pham, D., Nguyen, M., and Quan, T. (2017). A deep
architecture for sentiment analysis of news articles. In International Con-
ference on Computer Science, Applied Mathematics and Applications, pages
129–140. Springer.

Paisley, J., Wang, C., Blei, D. M., and Jordan, M. I. (2015). Nested hierarchical
dirichlet processes. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 37(2):256–270.

Pitman, J. et al. (2002). Combinatorial stochastic processes.
Ren, Y., Zhang, Y., Zhang, M., and Ji, D. (2016). Context-sensitive twitter

sentiment classification using neural network. In AAAI, pages 215–221.
Röder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic

coherence measures. In Proceedings of the eighth ACM international con-
ference on Web search and data mining, pages 399–408. ACM.

Severyn, A. and Moschitti, A. (2015a). Twitter sentiment analysis with deep
convolutional neural networks. In Proceedings of the 38th International
ACM SIGIR Conference on Research and Development in Information Re-
trieval, pages 959–962. ACM.

Severyn, A. and Moschitti, A. (2015b). Unitn: Training deep convolutional
neural network for twitter sentiment classification. In Proceedings of the
9th international workshop on semantic evaluation (SemEval 2015), pages
464–469.

Tang, D., Qin, B., and Liu, T. (2015). Document modeling with gated recurrent
neural network for sentiment classification. In Proceedings of the 2015 con-
ference on empirical methods in natural language processing, pages 1422–
1432.

Teh, Y. W. (2006). A hierarchical bayesian language model based on pitman-
yor processes. In Proceedings of the 21st International Conference on Com-
putational Linguistics and the 44th annual meeting of the Association for
Computational Linguistics, pages 985–992. Association for Computational
Linguistics.

Thonet, T., Cabanac, G., Boughanem, M., and Pinel-Sauvagnat, K. (2016). Vo-
dum: a topic model unifying viewpoint, topic and opinion discovery. In
European Conference on Information Retrieval, pages 533–545. Springer.

Trabelsi, A. and Zaıane, O. R. (2014). Finding arguing expressions of diver-
gent viewpoints in online debates. In Proceedings of the 5th Workshop on
Language Analysis for Social Media (LASM)@ EACL, pages 35–43.

Vilares, D. and He, Y. (2017). Detecting perspectives in political debates. In
Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1573–1582.

Wallach, H. M. (2006). Topic modeling: beyond bag-of-words. In Proceedings
of the 23rd international conference on Machine learning, pages 977–984.

October 8, 2018


ACM.
Wang, C., Danilevsky, M., Desai, N., Zhang, Y., Nguyen, P., Taula, T., and

Han, J. (2013). A phrase mining framework for recursive construction of a
topical hierarchy. In Proceedings of the 19th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 437–445. ACM.

Wang, X., McCallum, A., and Wei, X. (2007). Topical n-grams: Phrase and
topic discovery, with an application to information retrieval. In Data Mining,
2007. ICDM 2007. Seventh IEEE International Conference on, pages 697–
702. IEEE.

Wang, Y., Huang, M., Zhao, L., et al. (2016). Attention-based lstm for aspect-
level sentiment classification. In Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing, pages 606–615.

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. (2016). Hier-
archical attention networks for document classification. In Proceedings of
the 2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, pages 1480–
1489.

October 8, 2018


	Introduction
	Related Work
	Opinion Mining based on Topic models
	Hierarchical Topic Extraction
	Topical Phrase Models
	Deep Learning for Sentiment Analysis
	Summary

	Hierarchical Opinion Phrase (HOP) Model
	Generative Process
	Inference and Parameter Estimation
	Level Allocation Sampling
	Path Sampling
	Complete Sampling Procedure


	Experiments
	Experimental Setup
	Methods for Comparison
	Topic Coherence
	Stance Classification
	Qualitative Results

	Conclusion