Senti-LSSVM: Sentiment-Oriented Multi-Relation Extraction with Latent Structural SVM Lizhen Qu Max Planck Institute for Informatics lqu@mpi-inf.mpg.de Yi Zhang Nuance Communications yi.zhang@nuance.com Rui Wang DFKI GmbH mars198356@hotmail.com Lili Jiang Max Planck Institute for Informatics ljiang@mpi-inf.mpg.de Rainer Gemulla Max Planck Institute for Informatics rgemulla@mpi-inf.mpg.de Gerhard Weikum Max Planck Institute for Informatics weikum@mpi-inf.mpg.de Abstract Extracting instances of sentiment-oriented re- lations from user-generated web documents is important for online marketing analysis. Un- like previous work, we formulate this extrac- tion task as a structured prediction problem and design the corresponding inference as an integer linear program. Our latent structural SVM based model can learn from training cor- pora that do not contain explicit annotations of sentiment-bearing expressions, and it can si- multaneously recognize instances of both bi- nary (polarity) and ternary (comparative) re- lations with regard to entity mentions of in- terest. The empirical evaluation shows that our approach significantly outperforms state- of-the-art systems across domains (cameras and movies) and across genres (reviews and forum posts). The gold standard corpus that we built will also be a valuable resource for the community. 1 Introduction Sentiment-oriented relation extraction (Choi et al., 2006) is concerned with recognizing sentiment po- larities and comparative relations between entities from natural language text. Identifying such rela- tions often requires syntactic and semantic analysis at both sentence and phrase level. Most prior work on sentiment analysis consider either i) subjective sentence detection (Yu and Kübler, 2011), ii) po- larity classification (Johansson and Moschitti, 2011; Wilson et al., 2005), or iii) comparative relation identification (Jindal and Liu, 2006; Ganapathib- hotla and Liu, 2008). In practice, however, differ- ent types of sentiment-oriented relations frequently coexist in documents. In particular, we found that more than 38% of the sentences in our test corpus contain more than one type of relations. The iso- lated analysis approach is inappropriate because i) it sacrifices acuracy by ignoring the intricate interplay among different types of relations; ii) it could lead to conflicting predictions such as estimating a relation candidate as both negative and comparative. There- fore, in this paper, we identify instances of both sen- timent polarities and comparative relations for enti- ties of interest simultaneously. We assume that all the mentions of entities and attributes are given, and entities are disambiguated. It is a widely used as- sumption when evaluating a module in a pipeline system that the outputs of preceding modules are error-free. To the best of our knowledge, the only exist- ing system capable of extracting both comparisons and sentiment polarities is a rule-based system pro- posed by Ding et al. (2009). We argue that it is better to tackle the task by using a unified model with structured outputs. It allows us to consider a set of correlated relation instances jointly and char- acterize their interaction through a set of soft and hard constraints. For example, we can encode con- straints to discourage an attribute to participate in a polarity relation and a comparative relation at the same time. As a result, the system extracts a set of correlated instances of sentiment-oriented relations from a given sentence. For example, with the sen- tence about the camera Canon 7D, “The sensor is great, but the price is higher than Nikon D7000.” the expected output is positive(Canon 7D, sensor) 155 Transactions of the Association for Computational Linguistics, 2 (2014) 155–168. Action Editor: Janyce Wiebe. Submitted 6/2013; Revised 11/2013; Published 4/2014. c©2014 Association for Computational Linguistics. and preferred(Nikon D7000, Canon 7D, textit- price). However, constructing a fully annotated train- ing corpus for this task is labor-intensive and re- quires strong linguistic background. We minimize this overhead by applying a simplified annotation scheme, in which annotators mark mentions of en- tities and attributes, disambiguate the entities, and label instances of relations for each sentence. Based on the new scheme, we have created a small Senti- ment Relation Graph (SRG) corpus for the domains of cameras and movies, which significantly differs from the corpora used in prior work (Wei and Gulla, 2010; Kessler et al., 2010; Toprak et al., 2010; Wiebe et al., 2005; Hu and Liu, 2004) in the follow- ing ways: i) both sentiment polarities and compar- ative relations are annotated; ii) all mentioned en- tities are disambiguated; and iii) no subjective ex- pressions are annotated, unless they are part of entity mentions. The new annotation scheme raises a new chal- lenge for learning algorithms in that they need to automatically find textual evidences for each anno- tated relation during training. For example, with the sentence “I like the Rebel a little better, but that is another price jump”, simply assigning a sentiment- bearing expression to the nearest relation candidate is insufficient, especially when the sentiment is not explicitly expressed. In this paper, we propose SENTI-LSSVM, a latent structural SVM based model for sentiment-oriented relation extraction. SENTI-LSSVM is applied to find the most likely set of the relation instances expressed in a given sentence, where the latent variables are used to assign the most appropriate textual evidences to the respective instances. In summary, the contributions of this paper are the following: • We propose SENTI-LSSVM: the first unified sta- tistical model with the capability of extracting instances of both binary and ternary sentiment- oriented relations. • We design a task-specific integer linear pro- gramming (ILP) formulation for inference. • We construct a new SRG corpus as a valuable asset for the evaluation of sentiment relation extraction. • We conduct extensive experiments with on- line reviews and forum posts, showing that SENTI-LSSVM model can effectively learn from a training corpus without explicitly annotated subjective expressions and that its performance significantly outperforms state-of-the-art sys- tems. 2 Related Work There are ample works on analyzing sentiment po- larities and entity comparisons, but the majority of them studied the two tasks in isolation. Most prior approaches for fine-grained sentiment analysis focus on polarity classification. Super- vised approaches on expression-level analysis re- quire the annotation of sentiment-bearing expres- sions as training data (Jin et al., 2009; Choi and Cardie, 2010; Johansson and Moschitti, 2011; Yessenalina and Cardie, 2011; Wei and Gulla, 2010). However, the corresponding annotation pro- cess is time-consuming. Although sentence-level annotations are easier to obtain, the analysis at this level cannot cope with sentences conveying relations of multiple types (McDonald et al., 2007; Täckström and McDonald, 2011; Socher et al., 2012). Lexicon- based approaches require no training data (Ku et al., 2006; Kim and Hovy, 2006; Godbole et al., 2007; Ding et al., 2008; Popescu and Etzioni, 2005; Liu et al., 2005) but suffer from inferior performance (Wil- son et al., 2005; Qu et al., 2012). In contrast, our method requires no annotation of sentiment-bearing expressions for training and can predict both senti- ment polarities and comparative relations. Sentiment-oriented comparative relations have been studied in the context of user-generated dis- course (Jindal and Liu, 2006; Ganapathibhotla and Liu, 2008). Approaches rely on linguistically moti- vated rules and assume the existence of independent keywords in sentences which indicate comparative relations. Therefore, these methods fall short of ex- tracting comparative relations based on domain de- pendent information. Both Johansson and Moschitti (2011) and Wu et al. (2011) formulate fine-grained sentiment analy- sis as a learning problem with structured outputs. However, they focus only on polarity classification 156 of expressions and require annotation of sentiment- bearing expressions for training as well. While ILP has been previously applied for infer- ence in sentiment analysis (Choi and Cardie, 2009; Somasundaran and Wiebe, 2009; Wu et al., 2011), our task requires a complete ILP reformulation due to 1) the absence of annotated sentiment expressions and 2) the constraints imposed by the joint extrac- tion of both sentiment polarity and comparative re- lations. 3 System Overview This section gives an overview of the whole system for extracting sentiment-oriented relation instances. Prior to presenting the system architecture, we in- troduce the essential concepts and the definitions of two kinds of directed hypergraphs as the represen- tation of correlated relation instances extracted from sentences. 3.1 Concepts and Definitions Entity. An entity is an abstract or concrete thing, which needs not be of material existence. An entity in this paper refers to either a product or a brand. Attribute. An attribute is an object closely associ- ated with or belonging to an entity, such as the lens of digital camera. Sentiment-Oriented Relation. A sentiment- oriented relation is either a sentiment polarity or a comparative relation, defined on tuples of entities and attributes. A sentiment polarity relation conveys either a positive or a negative attitude towards enti- ties or their attributes, whereas a comparative rela- tion indicates the preference of one entity over the other entity w.r.t. an attribute. Relation Instance. An instance of sentiment polar- ity takes the form r(entity, attribute) with r ∈ {pos- itive, negative}, such as positive(Canon 7D, sen- sor). The polarity instances expressed in the form of unary relations, such as “Nikon D7000 is ex- cellent.”, are denoted as binary relations r(entity, whole), where the attribute whole indicates the en- tity as a whole. In contrast, an instance of compar- ative relation is in the form of preferred{entity, en- tity, attribute}, e.g. preferred(Canon 7D, Nikon D7000, price). For brevity, we refer to an instance set of sentiment-oriented relations extracted from a sentence as an sSoR. To represent the instances of the remaining relations, we represent them as other{entity, attribute}, such as textitpartOf{wheel, car}. These relations include objective relations and the subjective relations other than sentiment- oriented relations. Mention-Based Relation Instances. A mention- based relation instance refers to a tuple of entity mentions with a certain relation. This concept is in- troduced as the representation of instances in a sen- tence by replacing entities with the corresponding entity mentions, such as positive(“Canon SD880i”, “wide angle view”). Figure 1: An example of MRG. Mention-Based Relation Graph. A mention-based relation graph (or MRG ) represents a collection of mention-based relation instances expressed in a sen- tence. As illustrated in Figure 1, an MRG is a di- rected hypergraph G = 〈M,E〉 with a vertex set M and an edge set E. A vertex mi ∈ M denotes a mention of an entity or an attribute occurring ei- ther within the sentence or in its context. We say that a mention is from the context if it is mentioned in the previous sentence or is an attribute implied in the current sentence. An instance of a binary re- lation in an MRG takes the form of a binary edge el = (mi,ma), where mi and ma denote an en- tity mention and an attribute mention respectively, and the type l ∈ {positive, negative, other}. A ternary edge el indicating comparative relation is represented as el = (mi,mj,ma), where two en- tity mentions mi and mj are compared with respect to the attribute mention ma. We define the type l ∈ {better,worse} to indicate two possible direc- tions of the relation and assume mi occurs before mj. As a result, we have a set L of five relation types: positive, negative, better, worse or other. Ac- cording to these definitions, the annotations in the SRG corpus are actually MRGs and disambiguated entities. If there are multiple mentions referring to the same entity, annotators are asked to choose the 157 most obvious one because it saves annotation time and is less demanding for the entity recognition and diambiguation modules. Figure 2: An example of eMRG. The textual evi- dences are wrapped by green dashed boxes. Evidentiary Mention-Based Relation Graph. An evidentiary mention-based relation graph, coined eMRG , extends an MRG by associating each edge with a textual evidence to support the corresponding relation assertions (see Figure 2). Consequently, an edge in an eMRG is denoted by a pair (a,c), where a represents a mention-based relation instance and c is the associated textual evidence. It is also re- ferred to as an evidentiary edge. represented as el = (mi,mj,ma), an MRG as an evidentiary MRG (eMRG) and the edges of eMRGs as evidentiary edges, as shown in Figure 2. 3.2 System Architecture Figure 3: System architecture. As illustrated by Figure 3, at the core of our sys- tem is the SENTI-LSSVM model, which extracts sets of mention-based relationships in the form of eMRGs from sentences. For a given sentence with known entity mentions, we select all possible mention sets as relation candidates, where each set includes at least one entity mention. Then we associate each relation candidate with a set of constituents or the whole sentence as the textual evidence candidates (cf. Section 6.1). Subsequently, the inference com- ponent aims to find the most likely eMRG from all possible combinations of mention-based relation in- stances and their textual evidences (cf. Section 6.2). The representation eMRG is chosen because it char- acterizes exactly the model outputs by letting each edge correspond to an instance of mention-based re- lation and the associated textual evidence. Finally, the model parameters of this model are learned by an online algorithm (cf. Section 7). Since instance sets of sentiment-oriented relations (sSoRs) are the expected outputs, we can obtain sSoRs from MRGs by using a simple rule-based al- gorithm. The algorithm essentially maps the men- tions from an MRG into entities and attributes in an sSoR and label the corresponding tuples with the re- lation types of the edges from an MRG. For instances of comparative relation, the label better or worse is mapped to the relation type preferred. 4 SENTI-LSSVM Model The task of sentiment-oriented relation extraction is to determine the most likely sSoR in a sentence. Since sSoRs are derived from the corresponding MRGs as described in Section 3, the task is reduced to find the most likely MRG for each sentence. Since an MRG is created by assigning relation types to a subset of all relation candidates, which are possible tuples of mentions with unknown relation types, the number of MRGs can be extremely high. To tackle the task, one solution is to employ an edge-factored linear model in the framework of structural SVM (Martins et al., 2009; Tsochantaridis et al., 2004). The model suggests that a bag of fea- tures should be specified for each relation candidate, and then the model predicts the most likely candi- date sets along with their relation types to form the optimal MRGs. As we observed, for a relation can- didate, the most informative features are the words near its entity mentions in the original text. How- 158 ever, if we represent a candidate by all these words, it is very likely that the instances of different relation types share overly similar features, because a men- tion is often involved in more than one relation can- didate, as shown in Figure 2. As a consequence, the instances of different relations represented by overly similar features can easily confuse the learning algo- rithm. Thus, it is critical to select proper constituents or sentences as textual evidences for each relation candidate in both training and testing. Consequently, we divide the task of sentiment- oriented relation extraction into two subtasks : i) identifying the most likely MRGs; ii) assigning proper textual evidences to each edge of MRGs to support their relation assertions. It is desirable to carry out the two subtasks jointly as these two sub- tasks could enhance each other. First, the identifi- cation of relation types requires proper textual ev- idences; second, the soft and hard constraints im- posed by the correlated relation instances facilitate the recognition of the corresponding textual evi- dences. Since the eMRGs are created by attaching every MRG with a set of textual evidences, tackling the two subtasks simultaneously is equivalent to se- lecting the most likely eMRG from a set of eMRG candidates. It is challenging because our SRG corpus does not contain any annotation of textual evidences. Formally, let X denote the set of all available sen- tences, and we define y ∈ Y(x)(x ∈ X) as the set of labeled edges of an MRG and Y = ∪x∈XY(x). Since the assignments of textual evidences are not observed, an assignment of evidences to y is de- noted by a latent variable h ∈ H(x) and H = ∪x∈XH(x). Then (y,h) corresponds to an eMRG, and (a,c) ∈ (y,h) is a labeled edge a attached with a textual evidence c. Given a labeled dataset D = {(x1,y1), ..., (xn,yn)} ∈ (X ×Y)n, we aim to learn a discriminant function f : X →Y×H that outputs the optimal eMRG (y,h) ∈Y(x)×H(x) for a given sentence x. Due to the introduction of latent variables, we adopt the latent structural SVM (Yu and Joachims, 2009) for structural classification. Our discriminant function is defined as f(x) = argmax(y,h)∈Y(x)×H(x)β >Φ(x,y,h) (1) where Φ(x,y,h) is the feature function of an eMRG (y,h) and β is the corresponding weight vector. To ensure tractability, we also employ edge-based factorization for our model. Let Mp denote a set of entity mentions and yr(mi) be a set of edges labeled with sentiment-oriented relations incident to mi, the factorization of Φ(x,y,h) is given as Φ(x,y,h) = ∑ (a,c)∈(y,h) Φe(x,a,c) + (2) ∑ mi∈Mp ∑ a,a′∈yr(mi),a 6=a′ Φc(a,a ′) where Φe(x,a,c) is a local edge feature function for a labeled edge a attached with a textual evidence c and Φc(a,a′) is a feature function capturing co- occurrence of two labeled edges ami and a ′ mi inci- dent to an entity mention mi. 5 Feature Space The following features are used in the feature func- tions (Equation 2): Unigrams: As mentioned before, a textual evi- dence attached to an edge in MRG is either a word, phrase or sentence. We consider all lemmatized un- igrams in the textual evidence as unigram features. Context: Since web users usually express related sentiments about the same entity across sentence boundaries, we describe the sentiment flow using a set of contextual binary features. For example, if en- tity A is mentioned in both the previous sentence and the current sentence, a set of contextual binary fea- tures are used to indicate all possible combinations of the current and the previous mentioned sentiment- oriented relations regarding to entity A. Co-occurrence: We have mentioned the co- occurrence feature in Equation 2, indicated by Φc(a,a ′). It captures the co-occurrence of two la- beled edges incident to the same entity mention. Note that the co-occurrence feature function is con- sidered only if there is a contrast conjunction such as “but” between the non-shared entity mentions inci- dent to the two labeled edges. Senti-predictors: Following the idea of (Qu et al., 2012), we encode the prediction results from the rule-based phrase-level multi-relation predic- tor (Ding et al., 2009) and from the bag-of-opinions predictor (Qu et al., 2010) as features based on the textual evidence. The output of the first predictor is an integer value, while the output of the second predictor is a sentiment relation, such as “positive”, 159 “negative”, “better” or “worse”. We map the rela- tional outputs into integer values and then encode the outputs from both predictors as senti-predictor features. Others: The commonly used part-of-speech tags are also included as features. Moreover, for an edge candidate, a set of binary features are used to denote the types of the edge and its entity mentions. For in- stance, a binary feature indicates whether an edge is a binary edge related to an entity mentioned in con- text. To characterize the syntactic dependencies be- tween two adjacent entity mentions, we use the path in the dependency tree between the heads of the cor- responding constituents, the number of words and other mentions in-between as features. Additionally, if the textual evidence is a constituent, its feature w.r.t. an edge is the dependency path to the clos- est mention of the edge that does not overlap with this constituent. 6 Structural Inference In order to find the best eMRG for a given sentence with a well trained model, we need to determine the most likely relation type for each relation candi- date and support the corresponding assertions with proper textual evidences. We formulate this task as an Integer Linear Programming (ILP). Instead of considering all constituents of a sentence, we empir- ically select a subset as textual evidences for each relation candidate. 6.1 Textual Evidence Candidates Selection Textual evidences are selected based on the con- stituent trees of sentences parsed by the Stanford parser (Klein and Manning, 2003). For each men- tion in a sentence, we first locate a constituent in the tree with the maximal overlap by Jaccard sim- ilarity. Starting from this constituent, we consider two types of candidates: type I candidates are con- stituents at the highest level which contain neither any word of another mention nor any contrast con- junctions such as “but”; type II candidates are con- stituents at the highest level which cover exactly two mentions of an edge and do not overlap with any other mentions. For a binary edge connecting an en- tity mention and an attribute mention, we consider a type I candidate starting from the attribute men- tion. For a binary edge connecting two entity men- tions, we consider type I candidates starting from both mentions. Moreover, for a comparative ternary edge, we consider both type I and type II candidates starting from the attribute mention. This strategy is based on our observation that these candidates of- ten cover the most important information w.r.t. the covered entity mentions. 6.2 ILP Formulation We formulate the inference problem of finding the best eMRG as an ILP problem due to its convenient integration of both soft and hard constraints. Given the model parameters β, we reformulate the score of an eMRG in the discriminant function (1) as follows, β>Φ(x,y,h) = ∑ (a,c)∈(y,h) saczac + ∑ mi∈Mp ∑ a,a′∈yr(mi),a 6=a′ saa′zaa′ where sac = β >Φe(x,a,c) denotes the score of a labeled edge a attached with a textual evidence c, saa′ = β >Φc(a,a ′) is the edge co-occurrence score, the binary variable zac indicates the presence or ab- sence of the corresponding edge, and zaa′ indicates if two edges co-occurr. As not every edge set can form an eMRG, we require that a valid eMRG should satisfy a set of linear constraints, which form our constraint space. Then function (1) is equivalent to max z∈B s>z + µzd s.t. A   z η τ   ≤ d z,η,τ ∈ B where B = 2S with S = {0, 1}, and η and τ are auxiliary binary variables that help define the con- straint space. The above optimization problem takes exactly the form of an ILP because both the con- straints and the objective function are linear, and all variables take only integer values. In the following, we consider two types of con- straint space, 1) an eMRG with only binary edges and 2) an eMRG with both binary and ternary edges. 160 eMRG with only Binary Edges: An eMRG has only binary edges if a sentence contains no attribute mention or at most one entity mention. We expect that each edge has only one relation type and is sup- ported by a single textual evidence. To facilitate the formulation of constraints, we introduce ηel to de- note the presence or absence of a labeled edge el, and ηec to indicate if a textual evidence c is assigned to an unlabeled edge e. Then the binary variable for the corresponding evidentiary edge zelc = ηec ∧ηel , where the ILP formulation of conjunction can be found in (Martins et al., 2009). Let Ce denote the set of textual evidence candi- dates of an unlabeled edge e. The constraint of at most one textual evidence per edge is formulated as: ∑ c∈Ce ηec ≤ 1 (3) Once a textual evidence is assigned to an edge, their relation labels should match and the number of labeled edges must agree with the number of at- tached textual evidences. Further, we assume that a textual evidence c conveys at most one relation so that an evidence will not be assigned to the relations of different types, which is the main problem for the structural SVM based model. Let ηcl indicate that the textual evidence c is labeled by the relation type l. The corresponding constraints are expressed as, ∑ l∈Le ηel = ∑ c∈Ce ηec; zelc ≤ ηcl; ∑ l∈L ηcl ≤ 1 where Le denotes the set of all possible labels for an unlabeled edge e, and L is the set of all relation types of MRGs (cf. Section 3). In order to avoid a textual evidence being overly reused by multiple relation candidates, we first pe- nalize the assignment of a textual evidence c to a labeled edge a by associating the corresponding zac with a fixed negative cost −µ in the objective func- tion. Then the selection of one textual evidence per edge a is encouraged by associating µ to zdc in the objective function, where zdc = ∨ e∈Sc ηec and Sc is the set of edges that the textual evidence c serves as a candidate. The disjunction zdc is expressed as: zdc ≥ ηe,e ∈ Sc zdc ≤ ∑ e∈Sc ηe (a) Binary edge structure (b) Ternary edge structure Figure 4: Alternative structures associated with an attribute mention. This soft constraint not only encourages one textual evidence per edge, but also keeps it eligible for mul- tiple assignments. For any two labeled edge a and a′ incident to the same entity mention, the edge-to-edge co- occurrence is described by zca,a′ = za ∧za′ . eMRG with both Binary and Ternary Edges: If there are more than one entity mentions and at least one attribute mention in a sentence, an eMRG can potentially have both binary and ternary edges. In this case, we assume that each mention of attributes can participate either in binary relations or in ternary relations. The assumption holds in more than 99.9% of the sentences in our SRG corpus, thus we describe it as a set of hard constraints. Geometrically, the as- sumption can be visualized as the selection between two alternative structures incident to the same at- tribute mention, as shown in Figure 4. Note that, in the binary edge structure, we include not only the edges incident to the attribute mention but also the edge between the two entity mentions. Let Sbmi be the set of all possible labeled edges in a binary edge structure of an attribute mention mi. Variable τbmi = ∨ el∈Sbmi ηel indicates whether the attribute mention is associated with a binary edge structure or not. In the same manner, we use τtmi = ∨ el∈Stmi ηel to indicate the association of the an attribute mention mi with an ternary edge struc- ture from the set of all incident ternary edges Stmi . The selection between two alternative structures is 161 formulated as τbmi + τ t mi = 1. As this influences only the edges incident to an attribute mention, we keep all the constraints introduced in the previous section unchanged except for constraint (3), which is modified as ∑ c∈Ce ηec ≤ τbmi ; ∑ c∈Ce ηec ≤ τtmi Therefore, we can have either binary edges or ternary edges for an attribute mention. 7 Learning Model Parameters Given a set of training sentences D = {(x1,y1), . . . , (xn,yn)}, the best weight vec- tor β of the discriminant function (1) is found by solving the following optimization problem: min β 1 n n∑ i=1 [ max (ŷ,ĥ)∈Y(x)×H(x) (β>Φ(x,ŷ, ĥ)+δ(ĥ, ŷ,y)) − max h̄∈H(x) β>Φ(x,y, h̄)] + ρ|β|] (4) where δ(ĥ, ŷ,y) is a loss function measuring the dis- crepancies between an eMRG (y,h̄) with gold stan- dard edge labels y and an eMRG (ŷ, ĥ) with inferred labeled edges ŷ and textual evidences ĥ. Due to the sparse nature of the lexical features, we apply L1 regularizer to the weight vector β, and the degree of sparsity is controlled by the hyperparameter ρ. Since the L1 norm in the above optimization problem is not differentiable at zero, we apply the online forward-backward splitting (FOBOS) algo- rithm (Duchi and Singer, 2009). It requires two steps for updating the weight vector β by using a single training sentence x on each iteration t. βt+ 1 2 = βt −εt∆t βt+1 = arg min β 1 2 ‖β −βt‖2 + εtρ|β| where ∆t is the subgradient computed without con- sidering the L1 norm and εt is the learning rate. For a labeled sentence x, ∆t = Φ(x,ŷ∗, ĥ∗) − Φ(x,y, h̄∗), where the feature functions of the corre- sponding eMRGs are inferred by solving (ŷ∗, ĥ∗) = arg max (ĥ,ŷ)∈H(x)×Y(x)[β >Φ(x,ŷ, ĥ) + δ(ĥ, ŷ,y)] and (y,h̄∗) = arg maxh̄∈H(x) β >Φ(x,y, h̄), as in- dicated in the optimization problem (4). The former inference problem is similar to the one we considered in the previous section except the inclusion of the loss function. We incorporate the loss function into the ILP formulation by defin- ing the loss between an MRG (y,h) and a gold stan- dard MRG as the sum of per-edge costs. In our ex- periments, we consider a positive cost ϕ for each wrongly labeled edge a, so that if an edge a has a different label from the gold standard, we add ϕ to the coefficient sac of the corresponding variable zac in the objective function of the ILP formulation. In addition, since the non-positive weights of edge labels in the initial learning phrase often lead to eMRGs with many unlabeled edges, which harms the learning performance, we fix it by adding a con- straint for the minimal number of labeled edges in an eMRG, ∑ a∈A ∑ c∈Ca ηac ≥ ζ (5) where A is the set of all labeled edge candidates and ζ denotes the minimal number of labeled edges. Empirically, the best way to determine ζ is to make it equal to the maximal number of labeled edges in an eMRG with the restriction that a tex- tual evidence can be assigned to at most one edge. By considering all the edge candidates A and all the textual evidence candidates C as two vertex sets in a bipartite graph Ĝ = 〈V = (A,C),E〉 (with edges in E indicating which textual evidence can be assigned to which edge), ζ corresponds to exactly the size of a maximum matching of the bipartite graph1. To find the optimal eMRG (y,h̄∗), for the gold la- bel k of each edge, we consider the following set of constraints for inference since the labels of the edges are known for the training data, ∑ c∈Ce ηec ≤ 1; ηec ≤ lck ∑ k′∈L lck′ ≤ 1; ∑ e∈Sc ηec ≤ 1 We include also the soft constraints, which avoid a textual evidence being overly reused by multiple relations, and the constraints similar to (5) to ensure a minimal number of labeled edges and a minimal number of sentiment-oriented relations. 1It is computed by the Hopcroft-Karp algorithm (Hopcroft and Karp, 1973) in our implementation. 162 8 SRG Corpus For evaluation we constructed the SRG corpus, which in total consists of 1686 manually annotated online reviews and forum posts in the digital camera and movie domains2. For each domain, we maintain a set of attributes and a list of entity names. The annotation scheme for the sentiment repre- sentation asserts minimal linguistic knowledge from our annotators. By focusing on the meanings of the sentences, the annotators make decisions based on their language intuition, not restricted by specific syntactic structures. Taking the example in Figure 2, the annotators only need to mark the mentions of entities and attributes from both the sentences and the context, disambiguate them, and label (“Canon 7D”, “Nikon D7000”, price) as worse and (“Canon 7D”, “sensor”) as positive, whereas in prior work, people have annotated the sentiment-bearing expres- sions such as “great” and link them to the respective relation instances as well. This also enables them to annotate instances of both sentiment polarity and comparative relaton, which are conveyed by not only explicit sentiment-bearing expressions like “excel- lent performance”, but also factual expressions im- plying evaluations such as “The 7V has 10x optical zoom and the 9V has 16x.”. Camera Movie Reviews Forums Reviews Forums positive 386 1539 879 905 negative 165 363 529 331 comparison 30 480 39 35 Table 1: Distribution of relation instances in SRG corpus. 14 annotators participated in the annotation project. After a short training period, annotators worked on randomly assigned documents one at a time. For product reviews, the system lists all rel- evant information about the entity and the prede- fined attributes. For forum posts, the system shows only the attribute list. For each sentence in a doc- ument, the annotator first determines if it refers to an entity of interest. If not, the sentence is marked 2The 107 camera reviews are from bestbuy.com and Ama- zon.com; the 667 camera forum posts are downloaded from fo- rum.digitalcamerareview.com; the 138 movie reviews and 774 forum posts are from imdb.com and boards.ie respectively as off-topic. Otherwise, the annotator will identify the most obvious mentions, disambiguate them, and mark the MRGs. We evaluate the inter-annotator agreement on sSoRs in terms of Cohen’s Kappa (κ) (Cohen, 1968). An average Kappa value of 0.698 was achieved on a randomly selected set consisting of 412 sentences. Table 1 shows the corpus distribution after nor- malizing them into sSoRs. Camera forum posts con- tain the largest proportion of comparisons because they are mainly about the recommendation of dig- ital cameras. In contrast, web users are much less interested in comparing movies, in both reviews and forums. In all subsets, positive relations play a dom- inant role since web users intend to express more positive attitudes online than negative ones (Pang and Lee, 2007). 9 Experiments This section describes the empirical evaluation of SENTI-LSSVM together with two competitive base- lines on the SRG corpus. 9.1 Experimental Setup We implemented a rule-based baseline (DING- RULE) and a structural SVM (Tsochantaridis et al., 2004) baseline (SENTI-SSVM) for comparison. The former system extends the work of Ding et al. (2009), which designed several linguistically- motivated rules based on a sentiment polarity lexi- con for relation identification and assumes there is only one type of sentiment relation in a sentence. In our implementation, we keep all the rules of (Ding et al., 2009) and add one phrase-level rule when there are more than one mention in a sentence. The ad- ditional rule assigns sentiment-bearing words and negators to its nearest relation candidates based on the absolute surface distance between the words and the corresponding mentions. In this case, the phrase- level sentiment-oriented relations depend only on the assigned sentiment words and negators. The lat- ter system is based on a structural SVM and does not consider the assignment of textual evidences to relation instances during inference. The textual fea- tures of a relation candidate are all lexical and sen- timent predictor features within a surface distance of four words from the mentions of the candidate. 163 Thus, this baseline does not need the inference con- straints of SENTI-LSSVM for the selection of textual evidences. To gain more insights into the model, we also evaluate the contribution of individual fea- tures of SENTI-LSSVM. In addition, to show if identi- fying sentiment polarities and comparative relations jointly works better than tackling each task on its own, we train SENTI-LSSVM for each task separately and combine their predictions according to compat- ibility rules and the corresponding graph scores. For each domain and text genre, we withheld 15% documents for development and use the remaining for cross validation. The hyperparameters of all sys- tems are tuned on the development datasets. For all experiments of SENTI-LSSVM, we use ρ = 0.0001 for the L1 regularizer in Eq.(4) and ϕ = 0.05 for the loss function; and for SENTI-SSVM, ρ = 0.0001 and ϕ = 0.01. Since the relation type of off-topic sentences is certainly other, we evaluate all systems with 5-fold cross-validation only on the on-topic sentences in the evaluation dataset. Since the same sSoR can have several equivalent MRGs and the rela- tion type other is not of our interest, we evaluate the sSoRs in terms of precision, recall and F-measure. All reported numbers are averages over the 5 folds. 9.2 Results Table 2 shows the complete results of all sys- tems. Here our model SENTI-LSSVM outperformed all baselines in terms of the average F-measure scores and recalls by a large margin. The F-measure on movie reviews is about 14% over the best base- line. The rule-based system has higher precision than recall in most cases. However, simply increas- ing the coverage of the domain independent senti- ment polarity lexicon might lead to worse perfor- mance (Taboada et al., 2011) because many sen- timent oriented relations are conveyed by domain dependent expressions and factual expressions im- plying evaluations, such as “This camera does not have manual control.” Compared to DING-RULE, SENTI-SSVM performs better in the camera domain but worse for the movies due to many misclassi- fication of negative relation instances as other. It also wrongly predicted more positive instances as other than SENTI-LSSVM. We found that the recalls of these instances are low because they often have overly similar features with the instances of the type other linking to the same mentions. The problem gets worse in the movie domain since i) many sen- tences contain no explicit sentiment-bearing words; ii) the prior polarity of the sentiment-bearing words do not agree with their contextual polarity in the sentences. Consider the following example from a forum post about the movie “Superman Returns”: “Have a look at Superman: the Animated Series or Justice League Unlimited . . . that is how the char- acters of Superman and Lex Luthor should be.”. In contrast, our model minimizes the overlapping fea- tures by assigning them to the most likely relation candidates. This leads to significantly better per- formance. Although SENTI-SSVM has low recall for both positive and negative relations, it achieves the highest recall for the comparative relation among all systems in the movie domain and camera reviews. Since less than 1% of all instances are for compara- tive relations in these document sets and all models are trained to optimize the overall accuracy, SENTI- LSSVM intends to trade off the minority class for the overall better performance. This advantage disap- pears on the camera forum posts, where the number of instances of comparative relation is 12 times more than that in the other data sets. All systems perform better in predicting positive relations than the negative ones. This corresponds well to the empirical findings in (Wilson, 2008) that people intend to use more complex expressions for negative sentiments than their affirmative counter- parts. It is also in accordance with the distribution of these relations in our SRG corpus which is randomly sampled from the online documents. For learning systems, it can also be explained by the fact that the training data for positive relations are considerably more than those for negative ones. The comparative relation is the hardest one to process since we found that many corresponding expressions do not contain explicit keywords for comparison. To understand the performance of the key fea- ture groups in our model better, we remove each group from the full SENTI-LSSVM system and eval- uate the variations with movie reviews and camera forum posts, which have relatively balanced distri- bution of relation types. As shown in Table 3, the features from the sentiment predictors make signif- icant contributions for both datasets. The differ- ent drops of the performance indicate that the po- 164 Positive Negative Comparison Micro-average P R F P R F P R F P R F C am er a Fo ru m DING-RULE 56.4 39.0 46.1 46.2 24.0 31.6 42.6 14.0 21.0 53.4 30.8 39.0 SENTI-SSVM 60.2 35.6 44.8 44.2 38.5 41.2 28.0 40.1 32.9 43.7 36.7 39.9 SENTI-LSSVM 69.2 38.9 49.8 50.8 39.3 44.3 42.6 35.1 38.5 56.5 38.0 45.4 C am er a R e- vi ew DING-RULE 83.6 69.0 75.6 68.6 38.8 49.6 30.0 16.9 21.6 81.1 58.6 68.1 SENTI-SSVM 72.6 75.4 74.0 63.9 62.5 63.2 28.0 38.9 32.5 68.1 70.4 69.3 SENTI-LSSVM 77.3 85.4 81.2 68.9 61.3 64.9 22.3 20.7 21.6 73.1 73.4 73.7 M ov ie Fo ru m DING-RULE 63.7 37.4 47.1 27.6 34.3 30.6 8.9 5.6 6.8 48.2 35.9 41.2 SENTI-SSVM 66.2 30.1 41.3 25.6 17.3 20.7 44.2 56.7 49.7 53.3 27.9 36.6 SENTI-LSSVM 63.3 44.2 52.1 29.7 45.6 36.0 40.1 45.0 42.4 49.7 44.6 47.0 M ov ie R e- vi ew DING-RULE 66.5 47.2 55.2 42.0 39.1 40.5 31.4 12.0 17.4 56.2 44.0 49.4 SENTI-SSVM 61.3 54.0 57.4 45.2 13.7 21.1 24.5 63.3 35.3 54.6 39.2 45.7 SENTI-LSSVM 59.0 79.1 67.6 53.3 51.4 52.3 28.3 34.0 30.9 57.9 68.8 62.9 Table 2: Evaluation results for DING-RULE, SENTI-SSVM and SENTI-LSSVM. Boldface figures are statistically significantly better than all others in the same comparison group under t-test with p = 0.05. Feature Models Movie Reviews Camera Forums full system 62.9 45.4 ¬unigram 63.2 (+0.3) 41.2 (-4.2) ¬context 54.5 (-8.4) 46.0 (+0.6) ¬co-occurrence 62.6 (-0.3) 44.9 (-0.5) ¬senti-predictors 61.3 (-1.6) 34.3 (-11.1) Table 3: Micro-average F-measure of SENTI-LSSVM with different feature models larities predicted by rules are more consistent in camera forum posts than in movie reviews. Due to the complexity of expressions in the movie re- views our model cannot benefit from the unigram features but these features are a good compensation for the sentiment predictor features in camera fo- rum posts. The sharp drop by removing the context features from our model on movie reviews indicates that the sentiments in movie reviews depend highly on the relations of the previous sentences. In con- trast, the sentiment-oriented relations of the previ- ous sentences could be a reason of overfitting for camera forum data. The edge co-occurrence fea- tures do not play an important role in our model since the number of co-occurred sentiment-oriented relations in the sentences with contrast conjunctions like “but” is small. However, we found that allow- ing the co-occurrence of any sentiment-oriented re- lations would harm the performance of the model. In addition, our experiments showed that the sep- arated approach, which trains a model for senti- ment polarities and comparative relations respec- tively, leads to a decrease by almost 1% in terms of the F-measure averaged over all four datasets. The largest drop of F-measure is 3% on camera forum posts, since this dataset contains the largest propor- tion of comparative relations. We found that the er- rors are increased when the trained models make conflicting predictions. In this case, the joint ap- proach can take all factors into account and make more consistent decisions than the separated ap- proaches. 10 Conclusion We proposed SENTI-LSSVM model for extracting in- stances of both sentiment polarities and comparative relations. For evaluating and training the model, we created an SRG corpus by using a lightweight an- notation scheme. We showed that our model can automatically find textual evidences to support its relation predictions and achieves significantly bet- ter F-measure scores than alternative state-of-the-art methods. References Yejin Choi and Claire Cardie. 2009. Adapting a polarity lexicon using integer linear programming for domain- specific sentiment classification. In Proceedings of the 2009 Conference on Empirical Methods in Natural 165 Language Processing: Volume 2 - Volume 2, EMNLP ’09, pages 590–598, Stroudsburg, PA, USA. Associa- tion for Computational Linguistics. Yejin Choi and Claire Cardie. 2010. Hierarchical se- quential learning for extracting opinions and their at- tributes. In Proceedings of the Annual meeting of the Association for Computational Linguistics, pages 269–274. Association for Computational Linguistics. Yejin Choi, Eric Breck, and Claire Cardie. 2006. Joint extraction of entities and relations for opinion recog- nition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 431– 439, Stroudsburg, PA, USA. Association for Compu- tational Linguistics. Jacob Cohen. 1968. Weighted Kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Par- tial Credit. Psychological bulletin, 70(4):213. Xiaowen Ding, Bing Liu, and Philip S. Yu. 2008. A holistic lexicon-based approach to opinion mining. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 231–240, New York, NY, USA. ACM. Xiaowen Ding, Bing Liu, and Lei Zhang. 2009. Entity discovery and assignment for opinion mining applica- tions. In Proceedings of the ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, pages 1125–1134. John Duchi and Yoram Singer. 2009. Efficient online and batch learning using forward backward splitting. The Journal of Machine Learning Research, 10:2899– 2934. Murthy Ganapathibhotla and Bing Liu. 2008. Mining opinions in comparative sentences. In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, pages 241–248, Stroudsburg, PA, USA. Association for Computational Linguistics. Namrata Godbole, Manjunath Srinivasaiah, and Steven Skiena. 2007. Large-scale sentiment analysis for news and blogs (system demonstration). In Proceed- ings of the International AAAI Conference on Weblogs and Social Media. John E Hopcroft and Richard M Karp. 1973. An nˆ5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing, 2(4):225–231. Minqing Hu and Bing Liu. 2004. Mining and summa- rizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowl- edge discovery and data mining, Proceedings of the ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining, pages 168–177, New York, NY, USA. ACM. Wei Jin, Hung Hay Ho, and Rohini K. Srihari. 2009. Opinionminer: a novel machine learning system for web opinion mining and extraction. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1195– 1204, New York, NY, USA. ACM. Nitin Jindal and Bing Liu. 2006. Mining comparative sentences and relations. In Proceedings of the 21st In- ternational Conference on Artificial Intelligence - Vol- ume 2, AAAI’06, pages 1331–1336. AAAI Press. Richard Johansson and Alessandro Moschitti. 2011. Extracting opinion expressions and their polarities– exploration of pipelines and joint models. In Proceed- ings of the Annual meeting of the Association for Com- putational Linguistics, volume 11, pages 101–106. Jason S. Kessler, Miriam Eckert, Lyndsie Clark, and Nicolas Nicolov. 2010. The 2010 icwsm jdpa sent- ment corpus for the automotive domain. In 4th Inter- national AAAI Conference on Weblogs and Social Me- dia Data Workshop Challenge (ICWSM-DWC 2010). Soo-Min Kim and Eduard Hovy. 2006. Extracting opin- ions, opinion holders, and topics expressed in online news media text. In Proceedings of the Workshop on Sentiment and Subjectivity in Text, SST ’06, pages 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st An- nual Meeting on Association for Computational Lin- guistics - Volume 1, ACL ’03, pages 423–430, Strouds- burg, PA, USA. Association for Computational Lin- guistics. Lun-Wei Ku, Yu-Ting Liang, and Hsin-Hsi Chen. 2006. Opinion extraction, summarization and tracking in news and blog corpora. In AAAI Spring Sympo- sium: Computational Approaches to Analyzing We- blogs, pages 100–107. Bing Liu, Minqing Hu, and Junsheng Cheng. 2005. Opinion observer: analyzing and comparing opinions on the web. In Proceedings of the 14th international conference on World Wide Web, pages 342–351, New York, NY, USA. ACM. André L. Martins, Noah A. Smith, and Eric P. Xing. 2009. Concise integer linear programming formula- tions for dependency parsing. In Proceedings of the Annual meeting of the Association for Computational Linguistics, pages 342–350. Ryan T. McDonald, Kerry Hannan, Tyler Neylon, Mike Wells, and Jeffrey C. Reynar. 2007. Structured mod- els for fine-to-coarse sentiment analysis. In Proceed- ings of the Annual meeting of the Association for Com- putational Linguistics. Bo Pang and Lillian Lee. 2007. Opinion mining and sentiment analysis. Foundations and Trends in Infor- mation Retrieval, 2(1-2):1–135. 166 Ana-Maria Popescu and Oren Etzioni. 2005. Extract- ing product features and opinions from reviews. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Lan- guage Processing, HLT ’05, pages 339–346, Strouds- burg, PA, USA. Association for Computational Lin- guistics. Lizhen Qu, Georgiana Ifrim, and Gerhard Weikum. 2010. The bag-of-opinions method for review rat- ing prediction from sparse text patterns. In Chu-Ren Huang and Dan Jurafsky, editors, Proceedings of the 23rd International Conference on Computational Lin- guistics (Coling 2010), ACL Anthology, pages 913– 921, Beijing, China. Tsinghua University Press. Lizhen Qu, Rainer Gemulla, and Gerhard Weikum. 2012. A weakly supervised model for sentence-level seman- tic orientation analysis with multiple experts. In Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning (EMNLP-CoNLL), pages 149–159, Jeju Island, Korea, July. Proceedings of the Annual meeting of the Association for Computational Linguis- tics. Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceed- ings of the Conference on Empirical Methods in Natu- ral Language Processing, pages 1201–1211. Swapna Somasundaran and Janyce Wiebe. 2009. Rec- ognizing stances in online debates. In Proceedings of the Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Lan- guage Processing, pages 226–234. Maite Taboada, Julian Brooke, Milan Tofiloski, Kim- berly D. Voll, and Manfred Stede. 2011. Lexicon- based methods for sentiment analysis. Computational Linguistics, 37(2):267–307. Oscar Täckström and Ryan McDonald. 2011. Discov- ering fine-grained sentiment with latent variable struc- tured prediction models. In Proceedings of the 33rd European conference on Advances in information re- trieval, ECIR’11, pages 368–374, Berlin, Heidelberg. Springer-Verlag. Cigdem Toprak, Niklas Jakob, and Iryna Gurevych. 2010. Sentence and expression level annotation of opinions in user-generated discourse. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 575–584, Stroudsburg, PA, USA. Association for Computational Linguistics. Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. 2004. Support vec- tor machine learning for interdependent and structured output spaces. In Proceedings of the International Conference on Machine Learning, pages 104–112. Wei Wei and Jon Atle Gulla. 2010. Sentiment learn- ing on product reviews via sentiment ontology tree. In Proceedings of the Annual meeting of the Association for Computational Linguistics, pages 404–413. Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2- 3):165–210. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the confer- ence on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 347–354, Stroudsburg, PA, USA. Association for Computational Linguistics. Theresa Ann Wilson. 2008. Fine-grained subjectivity and sentiment analysis: recognizing the intensity, po- larity, and attitudes of private states. Ph.D. thesis, UNIVERSITY OF PITTSBURGH. Yuanbin Wu, Qi Zhang, Xuanjing Huang, and Lide Wu. 2011. Structural opinion mining for graph-based sen- timent representation. In Proceedings of the Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 1332–1341. Ainur Yessenalina and Claire Cardie. 2011. Composi- tional matrix-space models for sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 172–182. Chun-Nam John Yu and Thorsten Joachims. 2009. Learning structural svms with latent variables. In Pro- ceedings of the International Conference on Machine Learning, page 147. Ning Yu and Sandra Kübler. 2011. Filling the gap: Semi-supervised learning for opinion detection across domains. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 200–209. Association for Computational Linguistics. 167 168