warwick.ac.uk/lib-publications Manuscript version: Author’s Accepted Manuscript The version presented in WRAP is the author’s accepted manuscript and may differ from the published version or Version of Record. Persistent WRAP URL: http://wrap.warwick.ac.uk/109440 How to cite: Please refer to published version for the most recent bibliographic citation information. If a published version is known of, the repository item page linked to above, will contain details on accessing it. Copyright and reuse: The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions. © 2018, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial- NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/. Publisher’s statement: Please refer to the repository item page, publisher’s statement section, for further information. For more information, please contact the WRAP Team at: wrap@warwick.ac.uk. http://go.warwick.ac.uk/lib-publications http://go.warwick.ac.uk/lib-publications http://wrap.warwick.ac.uk/109440 http://creativecommons.org/licenses/by-nc-nd/4.0/ mailto:wrap@warwick.ac.uk Hierarchical Viewpoint Discovery from Tweets Using Bayesian Modelling Lixing Zhua, Yulan Heb, Deyu Zhoua,∗ aSchool of Computer Science and Engineering, Southeast University, China bDepartment of Computer Science, University of Warwick, UK Abstract When users express their stances towards a topic in social media, they might elaborate their viewpoints or reasoning. Oftentimes, viewpoints expressed by different users exhibit a hierarchical structure. Therefore, detecting this kind of hierarchical viewpoints offers a better insight to understand the public opinion. In this paper, we propose a novel Bayesian model for hierarchical viewpoint discovery from tweets. Driven by the motivation that a viewpoint expressed in a tweet can be regarded as a path from the root to a leaf of a hierarchical viewpoint tree, the assignment of the relevant viewpoint topics is assumed to follow two nested Chinese restaurant processes. Moreover, opinions in text are often expressed in un-semantically decomposable multi-terms or phrases, such as ‘economic recession’. Hence, a hierarchical Pitman-Yor process is employed as a prior for modelling the generation of phrases with arbitrary length. Experimental results on two Twitter corpora demonstrate the effectiveness of the proposed Bayesian model for hierarchical viewpoint discovery. Keywords: Natural language processing, Opinion mining, Bayesian modelling 1. Introduction Stance classification aims to predict one’s stance in a two- sided debate or a controversial hot topic and has been inten- sively studied in recent years (Hasan and Ng, 2013; Elfardy et al., 2015). However, apart from detecting one’s stance, we are more interested in figuring out the reasons or key viewpoints why the person supports or opposes an issue of interest. More- over, viewpoints expressed by different users could be related and exhibit a hierarchical structure. Figure 1 illustrates an ex- ample hierarchical viewpoint tree, in which both User A and User B are supporters of Trump. However, the former just ex- pressed his support for Trump without mentioning any reasons behind, while the latter stated the reason that Trump is a charis- matic leader. We can also see that both User C and User D support Trump due to his economic policy, however with differ- ent reasons (‘higher employment rate’ vs. ‘trade protection’). Such a hierarchical viewpoint tree enables a better understand- ing of user opinions and allows a quick glimpse of reasons be- hind users’ stances. Mining hierarchical viewpoints from tweets is challenging for the reasons below: (1) In comparison with that the stance is either ‘Support’ or ‘Oppose’, the hierarchical structure of viewpoints is unknown a priori; (2) People tend to express their opinions in many different ways and even with infor- mal or ungrammatical language; (3) Opinion expressions often contain multi-word phrases, for example, ‘economic recession’ and ‘economic growth’. Simply decomposing phrases into uni- grams may lose their original semantic meaning. As such, sim- ∗Corresponding author. Fax.: 8602552090861. Email addresses: zhulixing@seu.edu.cn (Lixing Zhu), y.he@cantab.net (Yulan He), d.zhou@seu.edu.cn (Deyu Zhou) Trump running for president of the United States Support Oppose RacismPersonal charisma Vote for #Trump. User A Economic policy ... ... #DonaldTrump is completely disgusting for his racist. I am impressed with your patriotism and honesty, there needs to be more like you on capitol hill. User B Higher employ- ment rate User C User D Trade protec- tion Figure 1: An example of a hierarchical viewpoint tree on the topic “Trump run for election” from Twitter. ply applying a bag-of-words topic model might wrongly group them under the same topic due to the shared word ‘economic’. To tackle these challenges, in this paper, we propose a Bayesian model, called Hierarchical Opinion Phrase (HOP) model, for hierarchical viewpoint discovery from text. In such a model, the root node (level-1) contains the topic of inter- est (e.g., ‘Trump run for president’) and the level-2 topics in- dicate stance (e.g., either ‘Support’ or ‘Oppose’), while top- ics in the level-3 and below contain viewpoints under differ- ent stances. Assuming that viewpoints in each tweet are gen- erated from a path from the root to a leaf of a hierarchical viewpoint tree, the assignment of viewpoint topics can be re- garded as following two nested Chinese Restaurant Processes (nCRPs). Furthermore, a hierarchical Pitman-Yor process is employed as a prior to model the generation of phrases with October 8, 2018 arbitrary length. We have also explored various approaches for incorporating prior information such as sentiment lexicons and hashtags in order to improve the stance classification accu- racy. To the best of our knowledge, our work is the first attempt for hierarchical viewpoint detection. The proposed approach has been evaluated on two Twitter corpora. Experimental re- sults demonstrate the effectiveness of our approach in compar- ison with existing approaches for hierarchical topic detection or viewpoint discovery. Our source code is made available at https://github.com/somethingx86/HOP. 2. Related Work In this section, we give a brief review of four related lines of research: opinion mining based on topic models, hierarchical topic extraction, topical phrase models and deep learning for sentiment analysis. 2.1. Opinion Mining based on Topic models Topic models such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) have been proven effective for opin- ion mining. Lin and He (2009) proposed a Joint Sentiment Topic (JST) model that extends the standard LDA by adding a sentiment layer on top of the topic layer to allow the extrac- tion of positive and negative topics. A variant of JST called reverse-JST was studied in (Lin et al., 2010), where the gen- eration of sentiments and topics is reversed. Kawamae (2012) separated words into aspect words, sentiment words and other words. Aspect words were generated dependent on latent as- pects which were in turn sampled from sentiment-associated topics. Kim et al. (2013) modified the aforementioned model by using a recurrent Chinese Restaurant Process (rCRP) prior on the aspect variable. They mined a hierarchy of aspects from product reviews and associated each aspect with a positive or negative sentiment. In addition to online reviews, the LDA-based models have also been applied on debate forums and Twitter. Lim et al. (2014) proposed a Twitter opinion topic model which makes use of target-opinion pairs. Trabelsi et al. (2014) designed a joint topic viewpoint model by assuming that the distribu- tion over viewpoints is associated with latent topics. Thonet et al. (2016) treated nouns as topic words while adjectives, verbs, adverbs were treated as opinion words for topic-specific opin- ion discovery. Vilares and He (2017) focused on detecting per- spectives from political debates. They modelled topics and their associated perspectives as latent variables and generated words associated with topics or perspectives by following different routes. However, the aforementioned models are not able to generate a hierarchy of opinions. 2.2. Hierarchical Topic Extraction In general, there are two types of approaches for extract- ing topical hierarchies. The first one is based on probabilistic graphical models. For example, the Hierarchical LDA (HLDA) model was proposed for discovering the topical hierarchies from abstracts of scientific papers (Blei et al., 2010). In the model, each document is assumed to be attached to a path where each level is a topic. The path will induce a set of topics, and words will be generated in the same way as in LDA. The allo- cation of paths follows an nCRP prior. Kim et al. (2012) argued that topics should be distributed over the whole nodes of hier- archy. They achieved this by placing an rCRP prior on the tree. Jordan et al. (2015) proposed a nonparametric model called nested hierarchical Dirichlet process to allow shared groups among clusters. Their work extended nCRP by incorporating a hierarchical Dirichlet process. The second type of approaches to hierarchical topic extrac- tion is based on frequent pattern mining. An early work is fre- quent itemset-based hierarchical clustering model (Fung et al., 2003) which simply clustered documents according to their shared items. Wang et al. (2013) developed a phrase min- ing framework called CATHY (Constructing A Topical Hier- archY). It first builds a term co-occurrence network using a fre- quent pattern mining method which is commonly used in as- sociation rule mining. The initial network corresponds to the root topic. Then the network is clustered into subtopic net- works in a probabilistic way by assuming that each term co- occurrence was generated by a topic. The process is repeated until no subtopics can be found. 2.3. Topical Phrase Models In order to address the wide occurring of un-semantically de- composable phrases, various topical phrase models have been proposed in the literature. Wallach (2006) made the first attempt to extend LDA by incorporating hierarchical Dirichlet language model called Bigram Topic Model (BTM), where each word is generated from a distribution over vocabulary following a two- level hierarchical Dirichlet process. Wang et. al (2007) ex- tended BTM by adding a switch variable at each word position to decide whether to begin a new n-gram or to continue from the previously identified n-gram. El-Kishky et al. (2014) de- veloped a pipeline approach, called TopMine. It first extracts frequent phrases using a frequent pattern mining method. Then an LDA-based model is learned where words in the same phrase are generated from the same topic. He (2016) proposed a top- ical phrase model which extends TopMine by using the hierar- chical Pitman-Yor Process (HPYP) to model the generation of words in a phrase. 2.4. Deep Learning for Sentiment Analysis Recent years have seen a surge of interests of developing deep learning approaches for sentiment analysis. Many of them have been applied to sentiment classification on product re- views (Chen et al., 2016; Gui et al., 2017), news articles (Lai et al., 2015; Nguyen et al., 2017) as well as tweets (Ghiassi et al., 2013; Dong et al., 2014). Different neural network ar- chitectures have been explored including Convolutional Neural Network (CNN) models (dos Santos and Gatti, 2014; Severyn and Moschitti, 2015a,b), Long Short-Term Memory (LSTM) Networks (Tang et al., 2015; Wang et al., 2016) and models with attention mechanisms (Ren et al., 2016; Yang et al., 2016). However, these models rely on annotated datasets where each October 8, 2018 https://github.com/somethingx86/HOP γ α cd,3 cd,4 cd,L n[1,Nd] c1 cd,2λ .. . d[1,D] Gu k wd,n,1 θd k[1,] ... ... zd,n,1 zd,n,2 wd,n,2 zd,n,Ud,n wd,n,Ud,n u[1,Ud,n] G0 U ηβk k[1,] d[1,D] cd,1 cd,2 cd,3 cd,L γ α θd n[1,Nd] zd,n wd,n .. . (a) HLDA (b) HOP Figure 2: Graphical models of HLDA and the proposed Hierarchical Opinion Phrase (HOP) model. Boxes are plate notations representing replicates. Table 1: Notations Used in the Article. Symbol Description βk Topic k, which is a distribution over the vocabulary cd,l The lth level, whose value indexes a topic and follows CRP cd The path in the tree for tweet d, which follows nCRP θd Distribution over the levels for tweet d α Parameter for the Dirichlet distribution γ Concentration parameter for the CRPs η Parameter for the symmetric Dirichlet distribution λ Parameter for the Bernoulli distribution wd,n The token in tweet d, position n (HLDA) zd,n Level allocation for token wd,n (HLDA) wd,n,u The token in tweet d, phrase n, position u (HOP) zd,n,u Level allocation for token wd,n,u (HOP) G0 Base distribution for the HPYPs Gk Topic k, which is an HPYP Gku The PYP that generates the uth token in phrases of topic k document is either labeled by the sentiment class or annotated with more fine-grained opinion targets/words for training. As such, they cannot be used for hierarchical opinion discovery in the absence of annotated data. 2.5. Summary Our model is partly inspired by HLDA. While the root-level topics in HLDA mostly contain background words, our model is able to extract the key topic of interest from tweets at the root- level. Also, the level-2 topics in our model are constrained to be stance-related topics to allow for the incorporation of prior information as will be shown in the experiments section. Fur- thermore, we incorporate the hierarchical Pitman-Yor Process (HPYP) as the prior to deal with the generation of multi-word phrases and hence the proposed model can generate hierarchi- cal viewpoints with better interpretability. 3. Hierarchical Opinion Phrase (HOP) Model In this section, we propose the Hierarchical Opinion Phrase (HOP) model to learn a viewpoint hierarchy from text and also model the generation of phrases at the same time. Before pre- senting the details of HOP, we first describe the HLDA model. The notations used in our model and HLDA are summarized in Table 1. HLDA (Blei et al., 2010) as illustrated in Figure 2(a) as- sumes that each topic is tied to a node in a tree and the doc- uments are generated by first selecting a path in the tree, then choosing a topic at each level of the path and finally drawing words from the assigned topics. Notations c in Figure 2(a) are random variables indexing topics β. The per-tweet path cd = {cd,1, cd,2, . . . , cd,L} follows nCRP, whose valuing behavior functions as the affiliated tweet’s seating process, which will be described later in the introduction of nCRP. The arrows linking c indicate the constraint that the path’s lower-level node could only take the value of indices of the topics in the restaurant the upper level is pointing to, i.e., a tweet only counts those tweets whose upper level takes the same value in the nCRP of the gen- erative process. One problem of applying HLDA for viewpoint discovery is its unconstrained number of topics and the mixture of topics under different stances. The problem is aggravated when the corpus is noisy. We modify HLDA by placing a root topic that is shared by all documents to generate common words. Since the number of stances is fixed in our data (“Support” or “Op- pose”), a stance latent variable that has only two possible values is placed at level-2. The stance level will therefore hopefully separate two sets of opposing viewpoints. Since the number of level-2 topics is fixed at 2, it is now possible to incorporate side information into the model such as hashtags indicating the stances, which will be shown in the Experiments section. An- other problem with the original HLDA is that it operates on the bag-of-words assumption. As such, the topic results are less in- terpretable since many phrases are not semantically decompos- able. We therefore propose to generate phrases from the mod- ified HLDA model by incorporating the HPYP into the gener- ative process. HPYP has been previously explored for single- level topical phrase extraction in newswire stories and clinical October 8, 2018 documents (Lindsey et al., 2012; He, 2016). But it has never been explored for hierarchical topical phrase extraction. For comparison, the HOP model is illustrated in Figure 2(b) in which the root topic (level-1 topic) contains the topic of in- terest that is shared across all documents and the level-2 is a stance layer that has only two possible topics, either ‘Support’ or ‘Oppose’. Topics in level-3 and below capture viewpoints under different stances. The proposed approach assumes that phrases have been identified prior to model learning. Phrase extraction can be done by many different approaches. In this paper, we ex- tract phrases from data using an open source toolkit called Gensim.models.phrases1. It first discovers candidate phrases based on word collocation patterns, then transforms the phrases into distributed representations, and finally filters out irrelevant phrases (Mikolov et al., 2013). In the following subsections we will discuss the HOP model in more details. 3.1. Generative Process Suppose there is an L-level hierarchical opinion tree T as shown in Figure 1 where L is fixed, and each tweet contains L latent topics corresponding to a path from the root node to a leaf node. For example, for the tweet “I am impressed with your pa- triotism and honesty, there needs to be more like you on capitol hill.”, its hierarchical topics are [Trump run for election, sup- port, personal charisma]. The root node c1 of T is shared by all tweets in the collection. The second level of T is limited to have two topics corresponding to two stances, ‘Support’ or ‘Oppose’, which is assumed to follow a Bernoulli distribution parameterized by λ. The value of lower levels follows an nCRP prior, which can depict the unbound nature of viewpoints. We use cd = {cd,1, cd,2, . . . , cd,L} to denote the path assigned to the dth tweet. Therefore, the prior for each level can be expressed as cd,2 ∼ Bernoulli(λ), cd,l ∼ CRP(γ, cd,l−1, . . . , cd,2). nCRP. We briefly describe the nCRP (Blei et al., 2010) here. The nCRP is employed as the prior for the assignment of top- ics, which can organize the topics into a tree topology. As il- lustrated in Figure 3, tweets are analogous to customers and topics are analogous to tables. Assuming that the nth customer enters the root restaurant, he will choose an existing table βi proportional to the number of customers already sitting there or choose a new table proportional to the concentration parameter γ, that is p(occupied table i|other customers) = ni n − 1 + γ , p(new table i|other customers) = γ n − 1 + γ , where ni is the number of customers seated at table βi, n is the total number of customers including the present one, and γ is the concentration parameter normally set to 0.5, which indicates how likely the customer will sit in a new table. After choosing the table βi, the customer will proceed to choose a table in the lower-level restaurant which is pointed to by Table βi. This 1http://radimrehurek.com/gensim/models/phrases.html#id1 process will continue until the customer reaches depth L. As a result, a path composed of tables (or topics) from L different levels will be induced. This forms L hierarchical topics. HPYP. We detect phrases based on word collocations so that phrase detection is separated from topic inference. In HOP, phrases will be assigned with the same topic if they are assigned with the same level and their involved tweets happen to share the same table in that level. We use an HPYP to model the gen- eration of phrases assigned with the same topic. HPYP was first proposed in (Teh, 2006), which has been proven to be the best smoothing method for n-gram language models. In HPYP, the distribution of the first word is generated from PYP defined as G1 ∼ PYP(a0, b0, G0), where G0 is a uniform distribution over a fixed vocabulary, a0 is the discount parameter indicating the degenerating rate of to- generate power law distribution and b0 is the concentration pa- rameter controlling the amount of variability of G1 around G0. The distribution of the second word is generated using G1 as the base distribution, G2 ∼ PYP(a1, b1, G1). The process con- tinues until the end of the phrase. The generative process of drawing words from the prior can be simulated by the gener- alised CRP (Pitman et al., 2002). A restaurant corresponds to each Gu, where each table is served a dish whose value is cho- sen from the base distribution Gu−1. The first customer sits at the first table; the n + 1th customer chooses an occupied table in proportion to the number of customers already sitting there and takes the value of dish on the table, or chooses a new table proportional to a constant parameter and orders a dish from the base distribution. This process continues until the proxy cus- tomer sits in an existing table or there is no parent restaurant. As such, the probability of the uth word given the seating ar- rangement is p(w|Λu) = Cuw − au−1Tuw Cu· + bu−1 + au−1Tu· + bu−1 Cu· + bu−1 × p(w|Λu−1), where Cuw is the number of customers having dish w in the restaurant u, Tuw denotes the number of tables serving dish w in restaurant u, Cu· = ∑ w Cuw and Tu· = ∑ w Tuw. It is worth noting that the restaurant setup here is different from that in nCRP. In nCRP each tweet is a customer and each topic is a table. A tweet is assigned with L topics (or tables) from the root to the leaf. In HPYP for phrase generation, each word w is a customer and a restaurant u is the context of the word. For example, for an n-gram phrase, the context of the nth word is its preceding n − 1 words. Based on the above description, the generative process of HOP is given below. • For each topic k ∈ {1, 2, 3, . . . ,∞}: – Gk1 ∼ PYP(a0, b0, G0) – Gk2 ∼ PYP(a1, b1, G k 1) · · · – GkU ∼ PYP(aU−1, bU−1, G k U−1) • Set c1 to be the root restaurant • For each tweet d ∈ {1, 2, 3, . . . , D}: October 8, 2018 http://radimrehurek.com/gensim/models/phrases.html#id1 β 1 β 2 β 3 β 11 β 12 β 13 d3 d5 d9 d1 d2 d7 d8 d4 d6 d9 d5 d3 d1 d2 d7 d8 β 21 β 22 β 23 β 31 β 32 β 33 d4 d6 R 0 R 1 R 2 R 3 Figure 3: Illustration of the nested Chinese Restaurant Process (nCRP). Each circle represents a table which points to a unique restaurant denoted as a rectangle. Each customer (or tweet) will first choose a table in the upper level then follow the link pointed to by the table to reach a lower-level restaurant and choose another table there. – Select the Level-2 topic cd,2 ∼ Bernoulli(λ) – For level l ∈ {3, . . . , L}, select its corresponding topic cd,l ∼ CRP(γ, cd,l−1, . . . , cd,2) – Draw a distribution over levels θd ∼ Dirichlet(α) – For each phrase n ∈ {1, 2, . . . , Nd}: * For each word u ∈ {1, 2, . . . , Ud,n}: If it is the first word in a phrase (u = 1): · Assign a level zd,n,u|θd ∼ Discrete(θd ) · Draw a word wd,n,u|{zd,n,u, cd, G} ∼ Discrete(G cd [zd,n,u ] 1 ) Else: · Set zd,n,u = zd,n,u−1 · Draw a word wd,n,u|{zd,n,u, cd, G} ∼ Discrete(G cd [zd,n,u ] u ). Here, cd [zd,n,u] denotes the zd,n,uth component of vector cd , G0 is a uniform distribution over a fixed vocabulary W of V words, for ∀w ∈ W, G0(w) = 1 V . 2 Since phrases have already been identified prior to hierarchical opinion extraction, the bound- aries of phrases are observed and need not be sampled from data. 3.2. Inference and Parameter Estimation Given the observed variable w = {w1, w2, . . . , wD}, our goal is to infer the hidden variables c and z using the posterior distri- bution p(c, z|w,λ,γ,α, G0, Ω) where Ω = {a0, b0, . . . , aU, bU} denotes the hyper-parameters of HPYPs. Since exact infer- ence of the posterior distribution is intractable, Gibbs sam- pling (Griffiths and Steyvers, 2004) is employed to approximate the hidden variables, which sequentially samples each variable of interest using the conditional probability of that variable given the current values of all other variables and the data. With sufficient iterations, the sampling process will finally reach a status when all the samples can be seen as generated by a sta- tionary distribution. The variables used in the sampling process are: (1) zd,n,u, the level allocation of the uth word in the nth phrase of the dth tweet; (2) cd , the path of the dth tweet. The objective of the Gibbs sampler is to approximate p(c, z|w,λ,γ,α, G0, Ω). We omit Ω for clarity. In Gibbs sampling, we are focused 2Note that we abuse the notation here that we use n to denote the nth phrase. on the per-document posterior p(cd, zd|c−d, z−d, w,λ,γ,α, G0). For d ∈ {1, 2, . . . , D}, we first sample word-wise level alloca- tion p(zd,n,u|c, z−d,−n,−u, w,α, G0), where z−d,−n,−u is the vector of level allocations leaving out zd,n,u. Then we sample path p(cd|c−d, z, w,λ,γ, G0). 3.2.1. Level Allocation Sampling Given a path, we need to sample level allocations for the tweet d. According to Bayes’ rule and conditional indepen- dence, we obtain p(zd,n,u = l|c, z−d,−n,−u, w,α, G0) ∝ p(zd,n,u = l|zd,−n,−u,α)× p(wd,n|c, z, w−d,−n, G0). (1) The first term in Eq. 1 is a conditional probability marginal- izing out θd . Since the Dirichlet distribution p(θd|α) and the discrete distribution p(zd,n,u|θd ) form a Dirichlet-Multinomial conjugate, we have p(zd,n,u = l|zd,−n,−u,α) = Cld,−n,−u + α[l] C·d,−n,−u + ∑L l=1 α[l] . Here, Cld,−n,−u is the number of times level label l being assigned to some word tokens in tweet d leaving out zd,n,u, C·d,−n,−u =∑L l=1 C l d,−n,−u, L is the total depth of the hierarchy. The second term in Eq. 1 follows a zd,n,u-specified HPYP with wd,n as a new random variable given the particular status of that HPYP. We use the generalised CRP to perform sam- pling. Let Λcd [l] denote the current seating arrangement of topic cd [l], the second term is rewritten as p(wd,n|Λcd [l]). Thereafter, the random process can be simulated by the situation that the first word wd,n,1 enters the restaurant Λ cd [l] 1 as a customer. The probability of the next word w from Gcd [l]u can be calculated re- cursively as p(w|Λcd [l]u ) = Ccd [l]uw − au−1T cd [l] uw Ccd [l]u· + bu−1 + au−1T cd [l] u· + bu−1 Ccd [l]u· + bu−1 ×p(w|Λcd [l]u−1 ). Here, Ccd [l]uw denotes the number of customers eating dish w in the restaurant u owned by topic cd [l], T cd [l] uw denotes the number of tables serving dish w in restaurant u owned by topic cd [l], October 8, 2018 Ccd [l]u· = ∑ w C cd [l] uw and T cd [l] u· = ∑ w T cd [l] uw . The recursion ends when u = 1, that is p(w|Λcd [l]1 ) = Ccd [l]1w − a0T cd [l] 1w Ccd [l]1· + b0 + a0T cd [l] 1· + b0 Ccd [l]1· + b0 1 V . If wd,n,u is not the first word of a multi-term phrase, we just take zd,n,u = zd,n,u−1 and do not sample a new topic. 3.2.2. Path Sampling Given paths of other tweets and level allocations, we have to sample the path for tweet d. By applying the Bayes’ rule, we have p(cd|c−d, z, w,λ,γ, G0) ∝ p(cd|c−d,λ,γ) × p(wd|c, z, w−d, G0). (2) The first term in Eq. 2 is a prior over paths. It can be com- puted by first calculating a Bernoulli distribution then calculat- ing each level’s seating distribution in the corresponding restau- rant. The second term in Eq. 2 is the probability of a given tweet under a possible seating arrangement in the nCRP tree. It can be decomposed into probabilities of phrases/words occurred in the tweet, p(wd|c, z, w−d, G0) = p(wd,1|c, z, w−d, G0)× p(wd,2|c, z, w−d, wd,1, G0)× ·· · p(wd,Nd |c, z, w−d, wd,1, . . . , wd,Nd−1, G0). (3) Each term in Eq. 3 is a posterior distribution of phrase wd,n conditioned on other phrases allocated with the same topic. The distribution follows an HPYP, and thus can be sampled in the same way as described for the second term in Eq. 1. 3.2.3. Complete Sampling Procedure Algorithm 1 Gibbs sampling for HOP 1. Initialize the model by arbitrarily assigning a path to each tweet. Randomly assign a level number of the path to each word/phrase in the tweet. Initialize the HPYP configuration within each topic for each associated word/phrase. 2. For each tweet d ∈ {1, 2, . . . , D}: (a) Sample c(t+1)d using Eq. 2. (b) For each phrase n in tweet d, n ∈ {1, 2, . . . , Nd}: i. For each word u in phrase n, u ∈ {1, 2, . . . , Ud,n}: A. Sample z(t+1)d,n,u using Eq. 1. 3. Repeat step 2 until the global log-likelihood converges or a fixed number of iterations is reached. 4. Output the final sample {c, z}. Given the conditional distributions defined above, we are able to perform the full Gibbs sampling. Let {c(t), z(t)} denote the current state, the sampling process is described in Algo- rithm 1. 4. Experiments In this section, we first describe the datasets and baselines used in our experiments. We then evaluate HOP against the baselines quantitatively and qualitatively. 4.1. Experimental Setup Algorithm 2 Opinion tweets retrieval. 1. Define the seed patterns “[brexit|leaving the EU|staying in the EU]+ [would|can|might|won’t|can’t]” for Dataset I (Brexit), and “[Trump |Hillary|Donald Trump|Hillary Clinton]+[would|can|might|won’t|can’t]” for Dataset II (US General Election). These seed patterns are used for retriev- ing seed tweets such as “Top economic think tanks agree that brexit would harm economy. Vote Remain in the EU” or “Trump would start WW3! There is no one he hasn’t of- fended”. We denote the set of the seed patterns as P and the set of opinion tweets as T . 2. (a) Perform Part-of-Speech (POS) tagging on the seed tweets T . (b) Extract keywords which are tagged as the latter noun in the POS tag pattern “NN+MD+VB+NN”. (c) Enlarge the seed pattern set P by adding new rules based on the newly extracted keywords such as “keyword+[would|can|might|won’t|can’t]”. (d) Retrieve tweets based on the enlarged seed pattern set. Add the retrieved tweets to T . 3. Repeat step 2 until no more tweets are found. To evaluate HOP, two datasets are constructed. We devel- oped a crawler3 using the Twitter4j toolkit and Twitter Stream- ing API4. The crawler listens to the global timeline and fil- ters tweets with specific keywords. Dataset I contains 515, 113 tweets crawled using hashtags such as #EURef, #EU refer- endum and #Brexit, dated 14th-24th June 2016. Dataset II contains 1, 691, 294 tweets with hashtags #PresidentElection, #Election2016, #Hillary, #Trump, dated 1st-8th November 2016. To ensure tweets with opinions are kept, a rule-based method inspired by (Handler et al., 2016) is used and is de- scribed in Algorithm 2. The process runs iteratively until no more tweets are found. The statistics of the final datasets are shown in Table 2. It can be noticed that only 13.33% of tweets are kept in Dataset I and for Dataset II the proportion is 2.13%. Each tweet is pre-processed by removing common stop words. No stemming is used. Phrases are identified based on collocation patterns from data using Gensim5. We perform Gibbs sampling for a maximum of 10,000 iterations and output the intermediate results every 1,000 iterations. It usually takes 3https://github.com/somethingx86/EclipseTwitterStreamer 4https://developer.twitter.com/en/docs 5http://radimrehurek.com/gensim/models/phrases.html October 8, 2018 https://github.com/somethingx86/EclipseTwitterStreamer https://developer.twitter.com/en/docs http://radimrehurek.com/gensim/models/phrases.html Table 2: Statistics of the two datasets. Dataset Property Value I #tweets 68,672 #unigram tokens 1,219,484 vocabulary size 31,149 II #tweets 36,013 #unigram tokens 591,620 vocabulary size 21,595 around 1,000 iterations to reach a stationary status when the topic topology and tweet-topic associations no longer change, which can be visualized using a tree structure. 4.2. Methods for Comparison We compare our model with the following approaches which can generate topic hierarchies: CATHY (Constructing A Topical HierarchY) (Wang et al., 2013) builds a topical hierarchy where each topic is represented as a set of phrases. A term co-occurrence network is first con- structed using the Frequent Pattern (FP)-growth algorithm. The network edges are clustered according to their associated topics to obtain a hierarchy of topics. In our experiments, the hierar- chy depth is set to 3, the cluster number is set to 2 for level 2 and 4 for level 3. Other parameter settings follow (Wang et al., 2013). HLDA (Hierarchical LDA) (Blei et al., 2010) assumes that each document is generated by drawing an infinite L-level path ac- cording to the nCRP prior, and drawing a topic distribution over levels in the path according to a stick-breaking process. Words are drawn from the L topics which are associated with the restaurants along that path. HASM (Hierarchical Aspect-Sentiment Model) (Kim et al., 2013) produces hierarchical aspects in which each aspect is as- sociated with positive, negative or neutral polarities. The aspect hidden variable follows an rCRP. We use the default parame- ter settings in our experiments. Note that the prior sentiment knowledge from some common sentiment seed words is used to set the asymmetric Dirichlet prior for aspect-sentiment-word distributions in HASM. In our experiments, its default senti- ment lexicon is replaced by Sentiment140Lex6 (Mohammad et al., 2013) which was specifically constructed for Twitter sen- timent analysis. For HOP, the parameter settings are: λ = 0.5, γ = 0.5, α = [0.75, 0.15, 0.075, 0.025], au = 0.8, bu = 1 for HPYP, the maximum phrase length U = 3. We only keep topics which are associated with at least 1,000 tweets. 4.3. Topic Coherence Various measures have been proposed to evaluate the qual- ity of the topics discovered. Newman et al. (2010) found that the pointwise mutual information (PMI) of all word pairs in a 6http://www.saifmohammad.com/Lexicons/Sentiment140-Lexicon-v0.1. zip topic’s top ten words coincides well with human judgements. PMI is defined as follows: PMI(wi, w j) = log p(wi, w j) p(wi) p(w j) , (4) where p(wi, w j) is the co-occurrence likelihood of two words. It can be estimated by counting the co-occurrence of the word pair in sliding windows in an external large meta-document. Röder et. al (2015) studied the known coherence measures and pro- posed a new measure which was a combination of some exist- ing ones. Particularly, this metric first retrieves co-occurrence counts for the given words using a sliding window of size 110 in Wikipedia. For each top word a vector is built whose compo- nents are the normalized Point-wise Mutual Information (PMI) between the word and every other top words. The arithmetic mean of all vector pairs’ cosine similarity is treated as the co- herence measure of a given topic. I II Dataset 0.30 0.35 0.40 0.45 0.50 T o p ic C o h e re n c e 0.3796 0.4033 0.3979 0.4269 0.3946 0.4151 0.4207 0.4383 CATHY HLDA HASM HOP Figure 4: Topic coherence on two datasets. Following the measure proposed in (Röder et al., 2015), we report the average topic coherence scores computed on the top 10 words/phrases of each topic which is shown in Figure 4. It can be observed that CATHY scores the lowest compared to the other three methods on both datasets. HLDA gives better results compared to HASM. HOP outperforms all the other methods and the improvement over the second best performing model, HLDA, is more prominent on Dataset I. 4.4. Stance Classification Some previous studies used topic models to perform sentiment/stance classification and achieved comparable re- sults (Trabelsi and Zaıane, 2014; Thonet et al., 2016). Apart from HLDA and HASM, we select two more baselines which can output document-level stance labels: VODUM (Viewpoint and Opinion Discovery Unification Model) (Thonet et al., 2016) assumes nouns are topical words and adjectives, verbs and adverbs are opinion words. It uses dif- ferent generative routes to generate topical words and opinion words. Their model associates a viewpoint label (equivalent to a stance label here) with each document. October 8, 2018 http://www.saifmohammad.com/Lexicons/Sentiment140-Lexicon-v0.1.zip http://www.saifmohammad.com/Lexicons/Sentiment140-Lexicon-v0.1.zip sLDA (Supervised LDA) (Mcauliffe and Blei, 2008) modified LDA by adding an observed response variable to each docu- ment, which follows a Gaussian distribution whose mean is the averaged weighed sum of topic assignments of all the docu- ment’s tokens. To obtain the ground-truth for the evaluation of stance clas- sification, we hired three senior undergraduate students, who worked on NLP-related final-year projects, to manually anno- tate some randomly selected tweets from Dataset II. Tweets were discarded if there were disagreement among the annota- tors. In total, 80 tweets were discarded. In the end, we kept a to- tal of 1,000 tweets which consists of 748 positive and 252 neg- ative tweets. We compare HOP with baselines on this dataset for stance classification. For HOP, we use the level-2 topics for stance classification. For HLDA, there is no restriction on its level-2 topic number and we manually go through all the topics in level-2 to identify the likely ‘Support’ and ‘Oppose’ stances. Afterwards, each document’s stance can be identified accord- ingly. For VODUM, the number of viewpoints (stances) is set to 2 and all the hyperparameters take the default setting. We run each approach for 20 times and average the classification results over 20 such runs. HASM used the prior sentiment knowledge to set the asym- metric Dirichlet prior for aspect-sentiment-word distributions. In more details, it placed higher probabilities on some polarity seed words based on a given sentiment lexicon. The incorpora- tion of this kind of supervised information gave them an edge over the unsupervised alternatives. In order to compare our model with HASM fairly, we also experimented with a variant of HOP (call ‘pHOP’) by incorporating the prior information of stance and sentiment into the model. The prior information can be derived from either hashtags or existing sentiment lexi- cons. For hashtags, we use ‘#VoteTrump’ and ‘#NeverHillary’ as a proxy label for the ‘Support’ stance, and ‘#VoteHillary’ and ‘#NeverTrump’ as a proxy label for the ‘Oppose’ stance. In the experiments, the prior information was only utilized during model initialization, where the ‘Support’ and ‘Oppose’ tweets were assigned to the left and right path in Level-2, respectively. Furthermore, we also utilize the sentiment prior information of words which is obtained from Sentiment140Lex. We consider those with sentiment intensity larger than 1.0 as strong sen- timent words. We restrict those words with strong sentiment from Sentiment140Lex to be only generated from the Level-2 topics. During model initialization, if a tweet has a stance la- bel or a word token can be found in the sentiment lexicon, then the path or level allocation will be kept. Otherwise, a path or level allocation will be randomly initialized. There are about 10% tweets carrying proxy labels, and 1.15% words found in the sentiment lexicon in our data. For sLDA, proxy stance la- bels are treated as the observed response variables to train the model. Stance classification results are presented in Figure 5. HLDA, VODUM and HOP are all unsupervised approaches without the prior information. It can be observed that HLDA gives better results compared to VODUM but VODUM tends to be more stable. HOP outperforms both HLDA and VODUM. Although the best performance achieved is only about 58%, this VODUM HLDA HOP sLDA HASM pHOP Models 0.50 0.55 0.60 0.65 0.70 0.75 0.80 A c c u ra c y Figure 5: Stance classification accuracy. is still noticeable given that HOP is totally unsupervised. By incorporating the prior information into the model, we no- tice a significant boost in stance classification accuracy where pHOP achieves an accuracy of 71% and it outperforms sLDA by 2%. HASM only achieves an accuracy of 60% since it only utilizes the sentiment lexicon and cannot benefit from the proxy stance labels due to its unbounded aspect node arrangement for each sentence, which has to follow an rCRP. 4.5. Qualitative Results To evaluate the quality of opinion hierarchies discovered, we compare HOP with CATHY and HLDA. Figure 6 illustrates the opinion hierarchies generated by these three models respec- tively on Dataset I. It can be observed that CATHY can dis- tinguish two opposing stances at level-2, but it generated less coherent topics at level-3. Oftentimes, CATHY tends to group phrases sharing common constituent words into the same topic, e.g., Topic 2 and 4 under the right branch of the level-2 topic. HLDA generates more sensible topics at level-3. However, it is not able to distinguish two stances at level-2 since the num- ber of topics is unconstrained. Also, top topic words listed at level-3 for HLDA are dominated by unigrams and are less inter- pretable. This is not surprising since HLDA operates under the bag-of-phrases assumption. Phrases such as ‘immigration pol- icy’ and ‘immigration control’ are treated as two distinct tokens although they share the same constituent word ‘immigration’. Hence, running HLDA on data with phrases identified suffers from more severe data sparsity as the vocabulary size is signif- icantly enlarged. On the contrary, we can observe that HOP generates more distinct opinions7, especially at level-2. It can also be observed that HOP has more phrases appeared in its level-3 topics com- pared to HLDA. Overall, HOP offers better interpretability of the topics discovered. Moreover, when examining the hierar- chical topic assignment to each individual tweet, we notice that 7We manually add the topic labels in bold face for easy understanding of the results. October 8, 2018 CATHY HLDA vote leave EU Brexit remain referendum UK stay country Britain support against remain stay life house place live long today safe night immigration government public services economy Brexit campaign human rights western political sovereignty Brexit economy trade_deals markets economic_recession jobs workers_rights pound bank_England wages_rise HOP immigration EU control Brexit leave migration control_borders migrants immigrants immigration_control Brexit Labour leave Cameron EU tory Osborne prime_minister government party EU Brexit leave NHS money pay cost tax public_services spend race muslim Trump human_rights Obama president touch open_borders Hillary white EU vote Europe England France status_quo member_states German European_Union western_political unelected democracy freedom_movement stay parliament Scottish_independence laws sovereignty control general_election EU Brexit vote_leave UK Britain trade economy leaving jobs Europe Brexit_campaign lies remain facts read Jo_Cox_murder debate Boris_Johnson truth Farage leave EU vote Brexit stay referendum UK country remain happen EU Brexit UK leave leaving trade Britain remain economy immigration stay wait long place life man forever love longer friends EU control Cameron deal immigration government tory reform veto PM Brexit Britain economy market industry warns business world markets businesses cost funding pay spend paid housing increase extra higher immigrants stay house lol safe wanna gonna forever happy live hate mayor win gold para DA middle mayoral ref bath grave Brexit EU leave vote UK stay Britain remain life referendum leave_EU vote_leave UK_EU EU_referendum Brexit_UK vote_leave_EU stay stay_EU stay_leave stay_life vote_stay Boris_Johnson love_stay Jo_Cox leave_EU vote_leave vote_leave_EU EU_referendum remain_leave vote_leave_remain vote_EU_referendum vote_Brexit UK_vote_leave UK_leave_EU UK_EU leave_EU UK_leave_EU Britain_EU EU_trade leave_EU_trade Britain_leave_EU David_Cameron_EU Cameron_EU rights_leave_EU prime_minister Michael_Gove leave_EU post_Brexit EU_Brexit EU_change George_Osborne project_fear Michael_gove_EU Osborne_Brexit stay_EU stay_life stay_leave vote_stay stay_lane stay_place leave_EU_stay stay_forever stay_long UK_stay love_stay check_latest_updates stay_EU stay_house stay_home stay_safe stay_room stay_leave world_stay hope_stay Boris_Johnson Jo_Cox London_mayor Sadiq_Khan Ruth_Davison leave_Boris_Johnson Eddie_Izzard Bob_Geldof stay_EU Jo_Cox_murder Nigel_Farage vote_leave stay_EU EU_eurovision leave_EU stay_moment Amber_Rudd leave_racist social_media_stay vote_stay Brexit_UK Brexit EU_Brexit leaving_EU single_market EU_NHS EU_immigration Brexit_economy EU_economy control_immigration Figure 6: Opinion hierarchies discovered by CATHY, HLDA and HOP on Dataset I. in many cases, tweets are assigned with sensible hierarchical opinions. For example, “Briefing: Trade, investment and jobs will benefit if we Vote Leave” is assigned with [Support, Econ- omy]. It can be easily inferred from the topic result that the tweet is about the support of Brexit because of an economy- related reason. On Dataset II only the hierarchy generated by HOP is pre- sented in Figure 7 due to the space limit. Although the level-2 topics do not clearly represent two opposing stances since the HOP model is totally unsupervised, it can be inferred from the level-3 topics (‘Email scandal of Hillary’ and ‘Trump vowed to drain the swamp8’) that the left level-2 topic is about ‘Sup- porting Trump’. And similarly we can infer from the level-3 topics that the right level-2 topic is about ‘Supporting Hillary’. Also, under the ‘Drain the swamp’ topic, the 4th level gives more fine-grained topics on ‘Controlling illegal immigrants’ and ‘Trump will lead a unified Republican government’. 5. Conclusion In this paper, we have proposed an unsupervised Hierarchi- cal Opinion Phrase (HOP) model in which each document is 8Trump used this metaphor to describe his plan to fix problems in the federal government. Trump Hillary Donald_Trump president American vote Trump vote Hillary Hillary Trump Clinton Obama America email scandal drain swamp racist taxes Russia Hillary_Clinton FBI campaign emails Huma indicted spooky_story email FBI_investigation jail Trump America country great president repeal_obamacare change world drain_swamp won Trump taxes Obama tax release_tax_returns obamacare pay Clinton supreme_court law Trump Putin markets Russian economy crash Russia nuclear_war energy civil_war Trump win Hillary vote Clinton Trump women white racist black women_rights black_voters shit build_wall child_rape Donald_Trump reliable lead_unified_republican wind trusted Trump worst economy stock_market Mark_Cuban Trump lies Donald stop_lie cheat Trump deport hunger illegal_immigrants million_immigrants Figure 7: Opinion hierarchy discovered by HOP on Dataset II. associated with a path in a topic tree and with a document- specific distribution over different levels in the tree. Phrases are drawn from their associated topic-specific HPYP. Experi- mental results on two Twitter datasets show that our proposed HOP model is able to reveal hierarchical opinions from social media. It also shows that HOP significantly outperforms ex- isting approaches to hierarchical topic extraction in both topic October 8, 2018 coherence and stance classification. In our current work, all the paths in the generated hierarchical opinion tree have the same depth. In future work, we will explore modelling hierarchical opinion trees with varying depths of path. Acknowledgements We would like to thank the reviewers for their valuable com- ments and helpful suggestions. This work was funded by the National Natural Science Foundation of China (61528302, 61772132), the Natural Science Foundation of Jiangsu Province of China (BK20161430) and Innovate UK (grant no. 103652). References Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). The nested chinese restaurant process and bayesian nonparametric inference of topic hierar- chies. Journal of the ACM (JACM), 57(2):7. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022. Chen, H., Sun, M., Tu, C., Lin, Y., and Liu, Z. (2016). Neural sentiment classification with user and product attention. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1650–1659. Dong, L., Wei, F., Tan, C., Tang, D., Zhou, M., and Xu, K. (2014). Adaptive re- cursive neural network for target-dependent twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), volume 2, pages 49–54. dos Santos, C. and Gatti, M. (2014). Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 69–78. El-Kishky, A., Song, Y., Wang, C., Voss, C. R., and Han, J. (2014). Scalable topical phrase mining from text corpora. Proceedings of the VLDB Endow- ment, 8(3):305–316. Elfardy, H., Diab, M., and Callison-Burch, C. (2015). Ideological perspective detection using semantic features. Lexical and Computational Semantics (* SEM 2015), page 137. Fung, B. C., Wang, K., and Ester, M. (2003). Hierarchical document clustering using frequent itemsets. In Proceedings of the 2003 SIAM International Conference on Data Mining, pages 59–70. SIAM. Ghiassi, M., Skinner, J., and Zimbra, D. (2013). Twitter brand sentiment anal- ysis: A hybrid system using n-gram analysis and dynamic artificial neural network. Expert Systems with applications, 40(16):6266–6282. Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228–5235. Gui, L., Zhou, Y., Xu, R., He, Y., and Lu, Q. (2017). Learning representations from heterogeneous network for sentiment classification of product reviews. Knowledge-Based Systems, 124:34–45. Handler, A., Denny, M. J., Wallach, H., and OâĂŹConnor, B. (2016). Bag of what? simple noun phrase extraction for text analysis. NLP+ CSS 2016, 114. Hasan, K. S. and Ng, V. (2013). Frame semantics for stance classification. In CoNLL, pages 124–132. He, Y. (2016). Extracting topical phrases from clinical documents. In AAAI, pages 2957–2963. Kawamae, N. (2012). Hierarchical approach to sentiment analysis. In Semantic Computing (ICSC), 2012 IEEE Sixth International Conference on, pages 138–145. IEEE. Kim, J. H., Kim, D., Kim, S., and Oh, A. (2012). Modeling topic hierarchies with the recursive chinese restaurant process. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 783–792. ACM. Kim, S., Zhang, J., Chen, Z., Oh, A. H., and Liu, S. (2013). A hierarchical aspect-sentiment model for online reviews. In AAAI. Lai, S., Xu, L., Liu, K., and Zhao, J. (2015). Recurrent convolutional neural networks for text classification. In AAAI, volume 333, pages 2267–2273. Lim, K. W. and Buntine, W. (2014). Twitter opinion topic model: Extracting product opinions from tweets by leveraging hashtags and sentiment lexicon. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 1319–1328. ACM. Lin, C. and He, Y. (2009). Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 375–384. ACM. Lin, C., He, Y., and Everson, R. (2010). A comparative study of bayesian mod- els for unsupervised sentiment detection. In Proceedings of the fourteenth conference on computational natural language learning, pages 144–152. Association for Computational Linguistics. Lindsey, R. V., Headden III, W. P., and Stipicevic, M. J. (2012). A phrase-discovering topic model using hierarchical pitman-yor processes. In Proceedings of the Joint Conference on Empirical Methods in Natu- ral Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 214–222. Mcauliffe, J. D. and Blei, D. M. (2008). Supervised topic models. In Advances in neural information processing systems, pages 121–128. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Dis- tributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26:3111–3119. Mohammad, S., Kiritchenko, S., and Zhu, X. (2013). Nrc-canada: Building the state-of-the-art in sentiment analysis of tweets. In Proceedings of the seventh international workshop on Semantic Evaluation Exercises (SemEval-2013), Atlanta, Georgia, USA. Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. (2010). Automatic eval- uation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 100–108. Association for Computational Linguistics. Nguyen, D., Vo, K., Pham, D., Nguyen, M., and Quan, T. (2017). A deep architecture for sentiment analysis of news articles. In International Con- ference on Computer Science, Applied Mathematics and Applications, pages 129–140. Springer. Paisley, J., Wang, C., Blei, D. M., and Jordan, M. I. (2015). Nested hierarchical dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):256–270. Pitman, J. et al. (2002). Combinatorial stochastic processes. Ren, Y., Zhang, Y., Zhang, M., and Ji, D. (2016). Context-sensitive twitter sentiment classification using neural network. In AAAI, pages 215–221. Röder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international con- ference on Web search and data mining, pages 399–408. ACM. Severyn, A. and Moschitti, A. (2015a). Twitter sentiment analysis with deep convolutional neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Re- trieval, pages 959–962. ACM. Severyn, A. and Moschitti, A. (2015b). Unitn: Training deep convolutional neural network for twitter sentiment classification. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 464–469. Tang, D., Qin, B., and Liu, T. (2015). Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 con- ference on empirical methods in natural language processing, pages 1422– 1432. Teh, Y. W. (2006). A hierarchical bayesian language model based on pitman- yor processes. In Proceedings of the 21st International Conference on Com- putational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 985–992. Association for Computational Linguistics. Thonet, T., Cabanac, G., Boughanem, M., and Pinel-Sauvagnat, K. (2016). Vo- dum: a topic model unifying viewpoint, topic and opinion discovery. In European Conference on Information Retrieval, pages 533–545. Springer. Trabelsi, A. and Zaıane, O. R. (2014). Finding arguing expressions of diver- gent viewpoints in online debates. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)@ EACL, pages 35–43. Vilares, D. and He, Y. (2017). Detecting perspectives in political debates. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1573–1582. Wallach, H. M. (2006). Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning, pages 977–984. October 8, 2018 ACM. Wang, C., Danilevsky, M., Desai, N., Zhang, Y., Nguyen, P., Taula, T., and Han, J. (2013). A phrase mining framework for recursive construction of a topical hierarchy. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 437–445. ACM. Wang, X., McCallum, A., and Wei, X. (2007). Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, pages 697– 702. IEEE. Wang, Y., Huang, M., Zhao, L., et al. (2016). Attention-based lstm for aspect- level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 606–615. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. (2016). Hier- archical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480– 1489. October 8, 2018 Introduction Related Work Opinion Mining based on Topic models Hierarchical Topic Extraction Topical Phrase Models Deep Learning for Sentiment Analysis Summary Hierarchical Opinion Phrase (HOP) Model Generative Process Inference and Parameter Estimation Level Allocation Sampling Path Sampling Complete Sampling Procedure Experiments Experimental Setup Methods for Comparison Topic Coherence Stance Classification Qualitative Results Conclusion