Cross-Sentence N-ary Relation Extraction with Graph LSTMs Nanyun Peng1∗ Hoifung Poon2 Chris Quirk2 Kristina Toutanova3∗ Wen-tau Yih2 1 Center for Language and Speech Processing, Computer Science Department Johns Hopkins University, Baltimore, MD, USA 2 Microsoft Research, Redmond, WA, USA 3 Google Research, Seattle, WA, USA npeng1@jhu.edu, kristout@google.com {hoifung,chrisq,scottyih}@microsoft.com Abstract Past work in relation extraction has focused on binary relations in single sentences. Re- cent NLP inroads in high-value domains have sparked interest in the more general setting of extracting n-ary relations that span mul- tiple sentences. In this paper, we explore a general relation extraction framework based on graph long short-term memory networks (graph LSTMs) that can be easily extended to cross-sentence n-ary relation extraction. The graph formulation provides a unified way of exploring different LSTM approaches and in- corporating various intra-sentential and inter- sentential dependencies, such as sequential, syntactic, and discourse relations. A robust contextual representation is learned for the en- tities, which serves as input to the relation clas- sifier. This simplifies handling of relations with arbitrary arity, and enables multi-task learning with related relations. We evaluate this frame- work in two important precision medicine set- tings, demonstrating its effectiveness with both conventional supervised learning and distant supervision. Cross-sentence extraction pro- duced larger knowledge bases. and multi-task learning significantly improved extraction ac- curacy. A thorough analysis of various LSTM approaches yielded useful insight the impact of linguistic analysis on extraction accuracy. 1 Introduction Relation extraction has made great strides in newswire and Web domains. Recently, there has ∗ This research was conducted when the authors were at Microsoft Research. been increasing interest in applying relation extrac- tion to high-value domains such as biomedicine. The advent of $1000 human genome1 heralds the dawn of precision medicine, but progress in personalized can- cer treatment has been hindered by the arduous task of interpreting genomic data using prior knowledge. For example, given a tumor sequence, a molecular tumor board needs to determine which genes and mu- tations are important, and what drugs are available to treat them. Already the research literature has a wealth of relevant knowledge, and it is growing at an astonishing rate. PubMed2, the online repository of biomedical articles, adds two new papers per minute, or one million each year. It is thus imperative to advance relation extraction for machine reading. In the vast literature on relation extraction, past work focused primarily on binary relations in single sentences, limiting the available information. Con- sider the following example: “The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the L858E point mutation on exon-21 was noted in 10. All patients were treated with gefitinib and showed a partial response.”. Collectively, the two sentences convey the fact that there is a ternary interaction between the three entities in bold, which is not expressed in either sentence alone. Namely, tumors with L858E mutation in EGFR gene can be treated with gefitinib. Extracting such knowledge clearly requires moving beyond binary relations and single sentences. N-ary relations and cross-sentence extraction have received relatively little attention in the past. Prior 1http://www.illumina.com/systems/ hiseq-x-sequencing-system.html 2https://www.ncbi.nlm.nih.gov/pubmed 101 Transactions of the Association for Computational Linguistics, vol. 5, pp. 101–115, 2017. Action Editor: Mark Johnson. Submission batch: 10/2016; Revision batch: 4/2017; Published 4/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the L858E point mutation on exon-21 was noted in 10. ROOT DET NN PREP ON PREP OF NN NSUBJ COP PREP IN ADVCL NUM DET NN NN PREP ON MARK NSUBJPASS AUXPASS PREP DEP NEXTSENT All patients were treated with gefitinib and showed a partial response. DET NSUBJPASS AUXPASS PREP WITH CONJ ANDROOT DOBJ DET AMOD Figure 1: An example document graph for a pair of sentences expressing a ternary interaction (tumors with L858E mutation in EGFR gene respond to gefitinib treatment). For simplicity, we omit edges between adjacent words or representing discourse relations. work on n-ary relation extraction focused on sin- gle sentences (Palmer et al., 2005; McDonald et al., 2005) or entity-centric attributes that can be extracted largely independently (Chinchor, 1998; Surdeanu and Heng, 2014). Prior work on cross-sentence ex- traction often used coreference to gain access to ar- guments in a different sentence (Gerber and Chai, 2010; Yoshikawa et al., 2011), without truly model- ing inter-sentential relational patterns. (See Section 7 for a more detailed discussion.) A notable excep- tion is Quirk and Poon (2017), which applied distant supervision to general cross-sentence relation extrac- tion, but was limited to binary relations. In this paper, we explore a general framework for cross-sentence n-ary relation extraction, based on graph long short-term memory networks (graph LSTMs). By adopting the graph formulation, our framework subsumes prior approaches based on chain or tree LSTMs, and can incorporate a rich set of linguistic analyses to aid relation extraction. Relation classification takes as input the entity representations learned from the entire text, and can be easily ex- tended for arbitrary relation arity n. This approach also facilitates joint learning with kindred relations where the supervision signal is more abundant. We conducted extensive experiments on two im- portant domains in precision medicine. In both dis- tant supervision and supervised learning settings, graph LSTMs that encode rich linguistic knowledge outperformed other neural network variants, as well as a well-engineered feature-based classifier. Multi- task learning with sub-relations led to further im- provement. Syntactic analysis conferred a significant benefit to the performance of graph LSTMs, espe- cially when syntax accuracy was high. In the molecular tumor board domain, PubMed- scale extraction using distant supervision from a small set of known interactions produced orders of magnitude more knowledge, and cross-sentence ex- traction tripled the yield compared to single-sentence extraction. Manual evaluation verified that the accu- racy is high despite the lack of annotated examples. 2 Cross-sentence n-ary relation extraction Let e1, · · · ,em be entity mentions in text T . Rela- tion extraction can be formulated as a classification problem of determining whether a relation R holds for e1, · · · ,em in T . For example, given a cancer patient with mutation v in gene g, a molecular tumor board seeks to find if this type of cancer would re- spond to drug d. Literature with such knowledge has been growing rapidly; we can help the tumor board by checking if the Respond relation holds for the (d,g,v) triple. Traditional relation extraction methods focus on binary relations where all entities occur in the same sentence (i.e., m = 2 and T is a sentence), and cannot handle the aforementioned ternary relations. Moreover, as we focus on more complex relations and n increases, it becomes increasingly rare that the related entities will be contained entirely in a single sentence. In this paper, we generalize extraction to cross-sentence, n-ary relations, where m > 2 and T can contain multiple sentences. As will be shown in our experiments section, n-ary relations are crucial for high-value domains such as biomedicine, and expanding beyond the sentence boundary enables the extraction of more knowledge. In the standard binary-relation setting, the dom- inant approaches are generally defined in terms of the shortest dependency path between the two en- tities in question, either by deriving rich features from the path or by modeling it using deep neural 102 networks. Generalizing this paradigm to the n-ary setting is challenging, as there are ( n 2 ) paths. One apparent solution is inspired by Davidsonian seman- tics: first, identify a single trigger phrase that sig- nifies the whole relation, then reduce the n-ary re- lation to n binary relations between the trigger and an argument. However, challenges remain. It is of- ten hard to specify a single trigger, as the relation is manifested by several words, often not contigu- ous. Moreover, it is expensive and time-consuming to annotate training examples, especially if triggers are required, as is evident in prior annotation efforts such as GENIA (Kim et al., 2009). The realistic and widely adopted paradigm is to leverage indirect su- pervision, such as distant supervision (Craven and Kumlien, 1999; Mintz et al., 2009), where triggers are not available. Additionally, lexical and syntactic patterns signi- fying the relation will be sparse. To handle such sparsity, traditional feature-based approaches require extensive engineering and large data. Unfortunately, this challenge becomes much more severe in cross- sentence extraction when the text spans multiple sen- tences. To overcome these challenges, we explore a gen- eral relation extraction framework based on graph LSTMs. By learning a continuous representation for words and entities, LSTMs can handle sparsity effectively without requiring intense feature engineer- ing. The graph formulation subsumes prior LSTM approaches based on chains or trees, and can incor- porate rich linguistic analyses. This approach also opens up opportunities for joint learning with related relations. For example, the Response relation over d,g,v also implies a binary sub-relation over drug d and mutation v, with the gene underspecified. Even with distant supervision, the supervision signal for n-ary relations will likely be sparser than their binary sub-relations. Our ap- proach makes it very easy to use multi-task learning over both the n-ary relations and their sub-relations. 3 Graph LSTMs Learning a continuous representation can be effective for dealing with lexical and syntactic sparsity. For se- quential data such as text, recurrent neural networks (RNNs) are quite popular. They resemble hidden Contextual Entity Representation c w(1) …… w(n-1) w(n) … … … … Word Embeddings for Input Text ( ) concatenation Rela%on Classifier R1 …… Rk Graph LSTM … … Figure 2: A general architecture for cross-sentence n-ary relation extraction based on graph LSTMs. Markov models (HMMs), except that discrete hid- den states are replaced with continuous vectors, and emission and transition probabilities with neural net- works. Conventional RNNs with sigmoid units suffer from gradient diffusion or explosion, making train- ing very difficult (Bengio et al., 1994; Pascanu et al., 2013). Long short-term memory (LSTMs) (Hochre- iter and Schmidhuber, 1997) combats these problems by using a series of gates (input, forget and output) to avoid amplifying or suppressing gradients during backpropagation. Consequently, LSTMs are much more effective in capturing long-distance dependen- cies, and have been applied to a variety of NLP tasks. However, most approaches are based on linear chains and only explicitly model the linear context, which ignores a variety of linguistic analyses, such as syn- tactic and discourse dependencies. In this section, we propose a general framework that generalizes LSTMs to graphs. While there is some prior work on learning tree LSTMs (Tai et al., 2015; Miwa and Bansal, 2016), to the best of our knowledge, graph LSTMs have not been applied to any NLP task yet. Figure 2 shows the architecture of this approach. The input layer is the word embedding of input text. Next is the graph LSTM which learns a contextual representation for each word. For the entities in question, their contextual representations are concatenated and become the input to the relation classifiers. For a multi-word entity, we simply used the average of its word representations and leave the exploration of more sophisticated aggregation approaches to future work. The layers are trained jointly with backpropagation. This framework is 103 All patients were treated with gefitinib and showed a partial response. ◦→◦→◦→◦→◦→◦→◦→◦→◦→◦→◦ ◦←◦←◦←◦←◦←◦←◦←◦←◦←◦←◦ Figure 3: The graph LSTMs used in this paper. The document graph (top) is partitioned into two directed acyclic graphs (bottom); the graph LSTMs is constructed by a forward pass (Left to Right) followed by a backward pass (Right to Left). Note that information goes from dependency child to parent. agnostic to the choice of classifiers. Jointly designing classifiers with graph LSTMs would be interesting future work. At the core of the graph LSTM is a document graph that captures various dependencies among the input words. By choosing what dependencies to in- clude in the document graph, graph LSTMs naturally subsumes linear-chain or tree LSTMs. Compared to conventional LSTMs, the graph for- mulation presents new challenges. Due to potential cycles in the graph, a straightforward implementation of backpropagation might require many iterations to reach a fixed point. Moreover, in the presence of a po- tentially large number of edge types (adjacent-word, syntactic dependency, etc.), parametrization becomes a key problem. In the remainder of this section, we first introduce the document graph and show how to conduct back- propagation in graph LSTMs. We then discuss two strategies for parametrizing the recurrent units. Fi- nally, we show how to conduct multi-task learning with this framework. 3.1 Document Graph To model various dependencies from linguistic analy- sis at our disposal, we follow Quirk and Poon (2017) and introduce a document graph to capture intra- and inter-sentential dependencies. A document graph consists of nodes that represent words and edges that represent various dependencies such as linear context (adjacent words), syntactic dependencies, and discourse relations (Lee et al., 2013; Xue et al., 2015). Figure 1 shows the document graph for our running example; this instance suggests that tumors with L858E mutation in EGFR gene responds to the drug gefitinib. This document graph acts as the backbone upon which a graph LSTM is constructed. If it con- tains only edges between adjacent words, we recover linear-chain LSTMs. Similarly, other prior LSTM approaches can be captured in this framework by re- stricting edges to those in the shortest dependency path or the parse tree. 3.2 Backpropagation in Graph LSTMs Conventional LSTMs are essentially very deep feed- forward neural networks. For example, a left-to-right linear LSTM has one hidden vector for each word. This vector is generated by a neural network (re- current unit) that takes as input the embedding of the given word and the hidden vector of the previ- ous word. In discriminative learning, these hidden vectors then serve as input for the end classifiers, from which gradients are backpropagated through the whole network. Generalizing such a strategy to graphs with cycles typically requires unrolling recurrence for a number of steps (Scarselli et al., 2009; Li et al., 2016; Liang et al., 2016). Essentially, a copy of the graph is created for each step that serves as input for the next. The result is a feed-forward neural network through time, and backpropagation is conducted accordingly. In principle, we could adopt the same strategy. Ef- fectively, gradients are backpropagated in a manner similar to loopy belief propagation (LBP). However, this makes learning much more expensive as each up- date step requires multiple iterations of backpropaga- tion. Moreover, loopy backpropagation could suffer from the same problems encountered to in LBP, such as oscillation or failure to converge. We observe that dependencies such as coreference and discourse relations are generally sparse, so the backbone of a document graph consists of the lin- ear chain and the syntactic dependency tree. As in belief propagation, such structures can be leveraged to make backpropagation more efficient by replac- 104 ing synchronous updates, as in the unrolling strat- egy, with asynchronous updates, as in linear-chain LSTMs. This opens up opportunities for a variety of strategies in ordering backpropagation updates. In this paper, we adopt a simple strategy that per- formed quite well in preliminary experiments, and leave further exploration to future work. Specifi- cally, we partition the document graph into two di- rected acyclic graphs (DAGs). One DAG contains the left-to-right linear chain, as well as other forward- pointing dependencies. The other DAG covers the right-to-left linear chain and the backward-pointing dependencies. Figure 3 illustrates this strategy. Effec- tively, we partition the original graph into the forward pass (left-to-right), followed by the backward pass (right-to-left), and construct the LSTMs accordingly. When the document graph only contains linear chain edges, the graph LSTMs is exactly a bi-directional LSTMs (BiLSTMs). 3.3 The Basic Recurrent Propagation Unit A standard LSTM unit consists of an input vector (word embedding), a memory cell and an output vec- tor (contextual representation), as well as several gates. The input gate and output gate control the information flowing into and out of the cell, whereas the forget gate can optionally remove information from the recurrent connection to a precedent unit. In linear-chain LSTMs, each unit contains only one forget gate, as it has only one direct precedent (i.e., the adjacent-word edge pointing to the previous word). In graph LSTMs, however, a unit may have several precedents, including connections to the same word via different edges. We thus introduce a forget gate for each precedent, similar to the approach taken by Tai et al. (2015) for tree LSTMs. Encoding rich linguistic analysis introduces many distinct edge types besides word adjacency, such as syntactic dependencies, which opens up many possi- bilities for parametrization. This was not considered in prior syntax-aware LSTM approaches (Tai et al., 2015; Miwa and Bansal, 2016). In this paper, we ex- plore two schemes that introduce more fined-grained parameters based on the edge types. Full Parametrization Our first proposal simply in- troduces a different set of parameters for each edge type, with computation specified below. it = σ(Wixt + ∑ j∈P(t) U m(t,j) i hj + bi) ot = σ(Woxt + ∑ j∈P(t) Um(t,j)o hj + bo) c̃t = tanh(Wcxt + ∑ j∈P(t) Um(t,j)c hj + bc) ftj = σ(Wfxt + U m(t,j) f hj + bf) ct = it � c̃t + ∑ j∈P(t) ftj � cj ht = ot � tanh(ct) As in standard chain LSTMs, xt is the input word vector for node t, ht is the hidden state vector for node t, W ’s are the input weight matrices, and b’s are the bias vectors. σ, tanh, and � represent the sig- moid function, the hyperbolic tangent function, and the Hadamard product (pointwise multiplication), re- spectively. The main differences lie in the recurrence terms. In graph LSTMs, a unit might have multiple predecessors (P(t)), for each of which (j) there is a forget gate ftj, and a typed weight matrix Um(t,j), where m(t,j) signifies the connection type between t and j. The input and output gates (it,ot) depend on all predecessors, whereas the forget gate (ftj) only depends on the predecessor with which the gate is associated. ct and c̃t represent intermediate compu- tation results within the memory cell, which take into account the input and forget gates, and will be combined with output gate to produce the hidden representation ht. Full parameterization is straightforward, but it re- quires a large number of parameters when there are many edge types. For example, there are dozens of syntactic edge types, each corresponding to a Stan- ford dependency label. As a result, in our exper- iments we resort to using only the coarse-grained types: word adjacency, syntactic dependency, etc. Next, we will consider a more fine-grained approach by learning an edge-type embedding. Edge-Type Embedding To reduce the number of parameters and leverage potential correlation among fine-grained edge types, we learned a low- dimensional embedding of the edge types, and con- ducted an outer product of the predecessor’s hidden vector and the edge-type embedding to generate a “typed hidden representation”, which is a matrix. The new computation is as follows: 105 it = σ(Wixt + ∑ j∈P(t) Ui ×T (hj ⊗ej) + bi) ftj = σ(Wfxt + Uf ×T (hj ⊗ej) + bf) ot = σ(Woxt + ∑ j∈P(t) Uo ×T (hj ⊗ej) + bo) c̃t = tanh(Wcxt + ∑ j∈P(t) Uc ×T (hj ⊗ej) + bc) ct = it � c̃t + ∑ j∈P(t) ftj � cj ht = ot � tanh(ct) U’s are now l× l×d tensors (l is the dimension of the hidden vector and d is the dimension for edge- type embedding), and hj ⊗ ej is a tensor product that produces an l ×d matrix. ×T denotes a tensor dot product defined as T ×T A = ∑ d(T:,:,d ·A:,d), which produces an l-dimensional vector. The edge- type embedding ej is jointly trained with the other parameters. 3.4 Comparison with Prior LSTM Approaches The main advantages of a graph formulation are its generality and flexibility. As seen in Section 3.1, linear-chain LSTMs are a special case when the doc- ument graph is the linear chain of adjacent words. Similarly, Tree LSTMs (Tai et al., 2015) are a special case when the document graph is the parse tree. In graph LSTMs, the encoding of linguistic knowl- edge is factored from the backpropagation strategy (Section 3.2), making it much more flexible, includ- ing introducing cycles. For example, Miwa and Bansal (2016) conducted joint entity and binary re- lation extraction by stacking a LSTM for relation extraction on top of another LSTM for entity recog- nition. In graph LSTMs, the two can be combined seamlessly using a document graph comprising both the word-adjacency chain and the dependency path between the two entities. The document graph can also incorporate other linguistic information. For example, coreference and discourse parsing are intuitively relevant for cross-sentence relation extraction. Although existing systems have not yet been shown to improve cross- sentence relation extraction (Quirk and Poon, 2017), it remains an important future direction to explore incorporating such analyses, especially after adapting them to the biomedical domains (Bell et al., 2016). 3.5 Multi-task Learning with Sub-relations Multi-task learning has been shown to be beneficial in training neural networks (Caruana, 1998; Collobert and Weston, 2008; Peng and Dredze, 2016). By learning contextual entity representations, our frame- work makes it straightforward to conduct multi-task learning. The only change is to add a separate classi- fier for each related auxiliary relation. All classifiers share the same graph LSTMs representation learner and word embeddings, and can potentially help each other by pooling their supervision signals. In the molecular tumor board domain, we applied this paradigm to joint learning of both the ternary rela- tion (drug-gene-mutation) and its binary sub-relation (drug-mutation). Experiment results show that this provides significant gains in both tasks. 4 Implementation Details We implemented our methods using the Theano li- brary (Theano Development Team, 2016). We used logistic regression for our relation classifiers. Hyper parameters were set based on preliminary experi- ments on a small development dataset. Training was done using mini-batched stochastic gradient descent (SGD) with batch size 8. We used a learning rate of 0.02 and trained for at most 30 epochs, with early stopping based on development data (Caruana et al., 2001; Graves et al., 2013). The dimension for the hidden vectors in LSTM units was set to 150, and the dimension for the edge-type embedding was set to 3. The word embeddings were initialized with the pub- licly available 100-dimensional GloVe word vectors trained on 6 billion words from Wikipedia and web text3 (Pennington et al., 2014). Other model param- eters were initialized with random samples drawn uniformly from the range [−1,1]. In multi-task training, we alternated among all tasks, each time passing through all data for one task4, and updating the parameters accordingly. This was repeated for 30 epochs. 3http://nlp.stanford.edu/projects/glove/ 4However, drug-gene pairs have much more data, so we sub- sampled the instances down to the same size as the main n-ary relation task. 106 5 Domain: Molecular Tumor Boards Our main experiments focus on extracting ternary interactions over drugs, genes and mutations, which is important for molecular tumor boards. A drug- gene-mutation interaction is broadly construed as an association between the drug efficacy and the muta- tion in the given gene. There is no annotated dataset for this problem. However, due to the importance of such knowledge, oncologists have been painstakingly curating known relations from reading papers. Such a manual approach cannot keep up with the rapid growth of the research literature, and the coverage is generally sparse and not up to date. However, the cu- rated knowledge can be used for distant supervision. 5.1 Datasets We obtained biomedical literature from PubMed Cen- tral5, consisting of approximately one million full- text articles as of 2015. Note that only a fraction of papers contain knowledge about drug-gene-mutation interactions. Extracting such knowledge from the vast body of biomedical papers is exactly the chal- lenge. As we will see in later subsections, distant supervision enables us to generate a sizable train- ing set from a small number of manually curated facts, and the learned model was able to extract or- ders of magnitude more facts. In future work, we will explore incorporating more known facts for dis- tant supervision and extracting from more full-text articles. We conducted tokenization, part-of-speech tag- ging, and syntactic parsing using SPLAT (Quirk et al., 2012), and obtained Stanford dependencies (de Marneffe et al., 2006) using Stanford CoreNLP (Man- ning et al., 2014). We used the entity taggers from Literome (Poon et al., 2014) to identify drug, gene and mutation mentions. We used the Gene Drug Knowledge Database (GDKD) (Dienstmann et al., 2015) and the Clini- cal Interpretations of Variants In Cancer (CIVIC) knowledge base6 for distant supervision. The knowl- edge bases distinguish fine-grained interaction types, which we do not use in this paper. 5http://www.ncbi.nlm.nih.gov/pmc/ 6http://civic.genome.wustl.edu 5.2 Distant Supervision After identifying drug, gene and mutation mentions in the text, co-occurring triples with known interac- tions were chosen as positive examples. However, unlike the single-sentence setting in standard dis- tant supervision, care must be taken in selecting the candidates. Since the triples can reside in differ- ent sentences, an unrestricted selection of text spans would risk introducing many obviously wrong ex- amples. We thus followed Quirk and Poon (2017) in restricting the candidates to those occurring in a minimal span, i.e., we retain a candidate only if is no other co-occurrence of the same entities in an overlapping text span with a smaller number of con- secutive sentences. Furthermore, we avoid picking unlikely candidates where the triples are far apart in the document. Specifically, we considered en- tity triples within K consecutive sentences, ignoring paragraph boundaries. K = 1 corresponds to the baseline of extraction within single sentences. We explored K ≤ 3, which captured a large fraction of candidates without introducing many unlikely ones. Only 59 distinct drug-gene-mutation triples from the knowledge bases were matched in the text. Even from such a small set of unique triples, we obtained 3,462 ternary relation instances that can serve as pos- itive examples. For multi-task learning, we also con- sidered drug-gene and drug-mutation sub-relations, which yielded 137,469 drug-gene and 3,192 drug- mutation relation instances as positive examples. We generate negative examples by randomly sam- pling co-occurring entity triples without known inter- actions, subject to the same restrictions above. We sampled the same number as positive examples to obtain a balanced dataset7. 5.3 Automatic Evaluation To compare the various models in our proposed framework, we conducted five-fold cross-validation, treating the positive and negative examples from dis- tant supervision as gold annotation. To avoid train- test contamination, all examples from a document were assigned to the same fold. Since our datasets are balanced by construction, we simply report aver- age test accuracy on held-out folds. Obviously, the 7We will release the dataset at http://hanover.azurewebsites.net. 107 Model Single-Sent. Cross-Sent. Feature-Based 74.7 77.7 CNN 77.5 78.1 BiLSTM 75.3 80.1 Graph LSTM - EMBED 76.5 80.6 Graph LSTM - FULL 77.9 80.7 Table 1: Average test accuracy in five-fold cross- validation for drug-gene-mutation ternary interac- tions. Feature-Based used the best performing model in (Quirk and Poon, 2017) with features derived from shortest paths between all entity pairs. Model Single-Sent. Cross-Sent. Feature-Based 73.9 75.2 CNN 73.0 74.9 BiLSTM 73.9 76.0 BiLSTM-Shortest-Path 70.2 71.7 Tree LSTM 75.9 75.9 Graph LSTM-EMBED 74.3 76.5 Graph LSTM-FULL 75.6 76.7 Table 2: Average test accuracy in five-fold cross- validation for drug-mutation binary relations, with an extra baseline using a BiLSTM on the shortest dependency path (Xu et al., 2015b; Miwa and Bansal, 2016). results could be noisy (e.g., entity triples not known to have an interaction might actually have one), but this evaluation is automatic and can quickly evaluate the impact of various design choices. We evaluated two variants of graph LSTMs: “Graph LSTM-FULL” with full parametrization and “Graph LSTM-EMBED” with edge-type embedding. We compared graph LSTMs with three strong base- line systems: a well-engineered feature-based classi- fier (Quirk and Poon, 2017), a convolutional neural network (CNN) (Zeng et al., 2014; Santos et al., 2015; Wang et al., 2016), and a bi-directional LSTM (BiLSTM). Following Wang et al. (2016), we used in- put attention for the CNN and a input window size of 5. Quirk and Poon (2017) only extracted binary rela- tions. We extended it to ternary relations by deriving features for each entity pair (with added annotation to signify the two entity types), and pooling the features from all pairs. For binary relation extraction, prior syntax-aware approaches are directly applicable. So we also compared with a state-of-the-art tree LSTM system (Miwa and Bansal, 2016) and a BiLSTM on the shortest dependency path between the two entities (BiLSTM-Shortest-Path) (Xu et al., 2015b). Table 1 shows the results for cross-sentence, ternary relation extraction. All neural-network based models outperformed the feature-based classifier, il- lustrating their advantage in handling sparse linguis- tic patterns without requiring intense feature engi- neering. All LSTMs significantly outperformed CNN in the cross-sentence setting, verifying the impor- tance in capturing long-distance dependencies. The two variants of graph LSTMs perform on par with each other, though Graph LSTM-FULL has a small advantage, suggesting that further exploration of parametrization schemes could be beneficial. In particular, the edge-type embedding might improve by pretraining on unlabeled text with syntactic parses. Both graph variants significantly outperformed BiLSTMs (p < 0.05 by McNemar’s chi-square test), though the difference is small. This result is intrigu- ing. In Quirk and Poon (2017), the best system in- corporated syntactic dependencies and outperformed the linear-chain variant (Base) by a large margin. So why didn’t graph LSTMs make an equally substantial gain by modeling syntactic dependencies? One reason is that linear-chain LSTMs can already captured some of the long-distance dependencies available in syntactic parses. BiLSTMs substantially outperformed the feature-based classifier, even with- out explicit modeling of syntactic dependencies. The gain cannot be entirely attributed to word embedding as LSTMs also outperformed CNNs. Another reason is that syntactic parsing is less accurate in the biomedical domain. Parse errors con- fuse the graph LSM learner, limiting the potential for gain. In Section 6, we show supporting evidence in a domain when gold parses are available. We also reported accuracy on instances within single sentences, which exhibited a broadly similar set of trends. Note that single-sentence and cross- sentence accuracies are not directly comparable, as the test sets are different (one subsumes the other). We conducted the same experiments on the binary sub-relation between drug-mutation pairs. Table 2 108 Drug-Gene-Mut. Drug-Mut. BiLSTM 80.1 76.0 +Multi-task 82.4 78.1 Graph LSTM 80.7 76.7 +Multi-task 82.0 78.5 Table 3: Multi-task learning improved accuracy for both BiLSTMs and Graph LSTMs. shows the results, which are similar to the ternary case: Graph LSTM-FULL consistently performed the best for both single sentence and cross-sentence instances. BiLSTMs on the shortest path substan- tially underperformed BiLSTMs or graph LSTMs, losing between 4-5 absolute points in accuracy, which could be attributed to the lower parsing quality in the biomedical domain. Interestingly, the state-of-the-art tree LSTMs (Miwa and Bansal, 2016) also under- performed graph LSTMs, even though they encoded essentially the same linguistic structures (word adja- cency and syntactic dependency). We attributed the gain to the fact that Miwa and Bansal (2016) used separate LSTMs for the linear chain and the depen- dency tree, whereas graph LSTMs learned a single representation for both. To evaluate whether joint learning with sub- relations can help, we conducted multi-task learning using Graph LSTM-FULL to jointly train extractors for both the ternary interaction and the drug-mutation, drug-gene sub-relations. Table 3 shows the results. Multi-task learning resulted in a significant gain for both the ternary interaction and the drug-mutation interaction. Interestingly, the advantage of graph LSTMs over BiLSTMs is reduced with multi-task learning, suggesting that with more supervision sig- nal, even linear-chain LSTMs can learn to capture long-range dependencies that are were made evident by parse features in graph LSTMs. Note that there are many more instances for drug-gene interaction than others, so we only sampled a subset of comparable size. Therefore, we do not evaluate the performance gain for drug-gene interaction, as in practice, one would simply learn from all available data, and the sub-sampled results are not competitive. We included coreference and discourse relations in our document graph. However, we didn’t observe any significant gains, similar to the observation in Single-Sent. Cross-Sent. Candidates 10,873 57,033 p ≥ 0.5 1,408 4,279 p ≥ 0.9 530 1,461 GDKD + CIVIC 59 Table 4: Numbers of unique drug-gene-mutation in- teractions extracted from PubMed Central articles, compared to that from manually curated KBs used in distant supervision. p signifies output probability. Quirk and Poon (2017). We leave further exploration to future work. 5.4 PubMed-Scale Extraction Our ultimate goal is to extract all knowledge from available text. We thus retrained our model using the best system from automatic evaluation (i.e., Graph LSTM-FULL) on all available data. The resulting model was then used to extract relations from all PubMed Central articles. Table 4 shows the number of candidates and ex- tracted interactions. With as little as 59 unique drug- gene-mutation triples from the two databases8, we learned to extract orders of magnitude more unique interactions. The results also highlight the benefit of cross-sentence extraction, which yields 3 to 5 times more relations than single-sentence extraction. Table 5 conducts a similar comparison on unique number of drugs, genes, and mutations. Again, ma- chine reading covers far more unique entities, espe- cially with cross-sentence extraction. 5.5 Manual Evaluation Our automatic evaluations are useful for comparing competing approaches, but may not reflect the true classifier precision as the labels are noisy. Therefore, we randomly sampled extracted relation instances and asked three researchers knowledgeable in pre- cision medicine to evaluate their correctness. For each instance, the annotators were presented with the provenance: sentences with the drug, gene, and mutation highlighted. The annotators determined in 8There are more in the databases, but these are the only ones for which we found matching instances in the text. In future work, we will explore various ways to increase the number, e.g., by matching underspecified drug classes to specific drugs. 109 Drug Gene Mut. GDKD + CIVIC 16 12 41 Single-Sent. (p ≥ 0.9) 68 228 221 Single-Sent. (p ≥ 0.5) 93 597 476 Cross-Sent. (p ≥ 0.9) 103 512 445 Cross-Sent. (p ≥ 0.5) 144 1344 1042 Table 5: Numbers of unique drugs, genes and muta- tions in extraction from PubMed Central articles, in comparison with that in the manually curated Gene Drug Knowledge Database (GDKD) and Clinical In- terpretations of Variants In Cancer (CIVIC) used for distant supervision. p signifies output probability. Entity Relation Precision Error Error Random 17% 36% 47% p ≥ 0.5 64% 7% 29% p ≥ 0.9 75% 1% 24% Table 6: Sample precision of drug-gene-mutation interactions extracted from PubMed Central articles. p signifies output probability. each case whether this instance implied that the given entities were related. Note that evaluation does not attempt to identify whether the relationships are true or replicated in follow-up papers; rather, it focuses on whether the relationships are entailed by the text. We focused our evaluation efforts on the cross- sentence ternary-relation setting. We considered three probability thresholds: 0.9 for a high-precision but potentially low-recall setting, 0.5, and a random sample of all candidates. In each case, 150 instances were selected for a total of 450 annotations. A subset of 150 instances were reviewed by two annotators, and the inter-annotator agreement was 88%. Table 6 shows that the classifier indeed filters out a large portion of potential candidates, with estimated instance accuracy of 64% at the threshold of 0.5, and 75% at 0.9. Interestingly, LSTMs are effective at screening out many entity mention errors, presum- ably because they include broad contextual features. Model Precision Recall F1 Poon et al. (2015) 37.5 29.9 33.2 BiLSTM 37.6 29.4 33.0 Graph LSTM 41.4 30.0 34.8 Graph LSTM (GOLD) 43.3 30.5 35.8 Table 7: GENIA test results on the binary relation of gene regulation. Graph LSTM (GOLD) used gold syntactic parses in the document graph. 6 Domain: Genetic Pathways We also conducted experiments on extracting genetic pathway interactions using the GENIA Event Extrac- tion dataset (Kim et al., 2009). This dataset contains gold syntactic parses for the sentences, which offered a unique opportunity to investigate the impact of syn- tactic analysis on graph LSTMs. It also allowed us to test our framework in supervised learning. The original shared task evaluated on complex, nested events for nine event types, many of which are unary relations (Kim et al., 2009). Following Poon et al. (2015), we focused on gene regulation and reduced it to binary-relation classification for head- to-head comparison. We followed their experimental protocol by sub-sampling negative examples to be about three times of positive examples. Since the dataset is not entirely balanced, we re- ported precision, recall, and F1. We used our best performing graph LSTM from the previous experi- ments. By default, automatic parses were used in the document graphs, whereas in Graph LSTM (GOLD), gold parses were used instead. Table 7 shows the re- sults. Once again, despite the lack of intense feature engineering, linear-chain LSTMs performed on par with the feature-based classifier (Poon et al., 2015). Graph LSTMs exhibited a more commanding advan- tage over linear-chain LSTMs in this domain, sub- stantially outperforming the latter (p < 0.01 by Mc- Nemar’s chi-square test). Most interestingly, graph LSTMs using gold parses significantly outperformed that using automatic parses, suggesting that encoding high-quality analysis is particularly beneficial. 7 Related Work Most work on relation extraction has been applied to binary relations of entities in a single sentence. We first review relevant work on the single-sentence bi- 110 nary relation extraction task, and then review related work on n-ary and cross-sentence relation extraction. Binary relation extraction The traditional feature- based methods rely on carefully designed features to learn good models, and often integrate diverse sources of evidence such as word sequences and syn- tax context (Kambhatla, 2004; GuoDong et al., 2005; Boschee et al., 2005; Suchanek et al., 2006; Chan and Roth, 2010; Nguyen and Grishman, 2014). The kernel-based methods design various subsequence or tree kernels (Mooney and Bunescu, 2005; Bunescu and Mooney, 2005; Qian et al., 2008) to capture struc- tured information. Recently, models based on neural networks have advanced the state of the art by auto- matically learning powerful feature representations (Xu et al., 2015a; Zhang et al., 2015; Santos et al., 2015; Xu et al., 2015b; Xu et al., 2016). Most neural architectures resemble Figure 2, where there is a core representation learner (blue) that takes word embeddings as input and produces contextual entity representations. Such representa- tions are then taken by relation classifiers to pro- duce the final predictions. Effectively representing sequences of words, both convolutional (Zeng et al., 2014; Wang et al., 2016; Santos et al., 2015) and RNN-based architectures (Zhang et al., 2015; Socher et al., 2012; Cai et al., 2016) have been successful. Most of these have focused on modeling either the surface word sequences or the hierarchical syntac- tic structure. Miwa and Bansal (2016) proposed an architecture that benefits from both types of informa- tion, using a surface sequence layer, followed by a dependency-tree sequence layer. N-ary relation extraction Early work on extract- ing relations between more than two arguments has been done in MUC-7, with a focus on fact/event extraction from news articles (Chinchor, 1998). Se- mantic role labeling in the Propbank (Palmer et al., 2005) or FrameNet (Baker et al., 1998) style are also instances of n-ary relation extraction, with extrac- tion of events expressed in a single sentence. Mc- Donald et al. (2005) extract n-ary relations in a bio- medical domain, by first factoring the n-ary relation into pair-wise relations between all entity pairs, and then constructing maximal cliques of related enti- ties. Recently, neural models have been applied to semantic role labeling (FitzGerald et al., 2015; Roth and Lapata, 2016). These works learned neural rep- resentations by effectively decomposing the n-ary relation into binary relations between the predicate and each argument, by embedding the dependency path between each pair, or by combining features of the two using a feed-forward network. Although some re-ranking or joint inference models have been employed, the representations of the individual argu- ments do not influence each other. In contrast, we propose a neural architecture that jointly represents n entity mentions, taking into account long-distance dependencies and inter-sentential information. Cross-sentence relation extraction Several rela- tion extraction tasks have benefited from cross- sentence extraction, including MUC fact and event extraction (Swampillai and Stevenson, 2011), record extraction from web pages (Wick et al., 2006), extrac- tion of facts for biomedical domains (Yoshikawa et al., 2011), and extensions of semantic role labeling to cover implicit inter-sentential arguments (Gerber and Chai, 2010). These prior works have either relied on explicit co-reference annotation, or on the assump- tion that the whole document refers to a single co- herent event, to simplify the problem and reduce the need for powerful representations of multi-sentential contexts of entity mentions. Recently, cross-sentence relation extraction models have been learned with distant supervision, and used integrated contextual evidence of diverse types without reliance on these assumptions (Quirk and Poon, 2017), but that work focused on binary relations only and explicitly engi- neered sparse indicator features. Relation extraction using distant supervision Distant supervision has been applied to extraction of binary (Mintz et al., 2009; Poon et al., 2015) and n-ary (Reschke et al., 2014; Li et al., 2015) relations, traditionally using hand-engineered features. Neural architectures have recently been applied to distantly supervised extraction of binary relations (Zeng et al., 2015). Our work is the first to propose a neural archi- tecture for n-ary relation extraction, where the repre- sentation of a tuple of entities is not decomposable into independent representations of the individual entities or entity pairs, and which integrates diverse information from multi-sentential context. To utilize training data more effectively, we show how multi- task learning for component binary sub-relations can 111 improve performance. Our learned representation combines information sources within a single sen- tence in a more integrated and generalizable fashion than prior approaches, and can also improve perfor- mance on single-sentence binary relation extraction. 8 Conclusion We explore a general framework for cross-sentence n- ary relation extraction based on graph LSTMs. The graph formulation subsumes linear-chain and tree LSTMs and makes it easy to incorporate rich linguis- tic analysis. Experiments on biomedical domains showed that extraction beyond the sentence bound- ary produced far more knowledge, and encoding rich linguistic knowledge provided consistent gain. While there is much room to improve in both recall and precision, our results indicate that machine read- ing can already be useful in precision medicine. In particular, automatically extracted facts (Section 5.4) can serve as candidates for manual curation. Instead of scanning millions of articles to curate from scratch, human curators would just quickly vet thousands of extractions. The errors identified by curators offer direct supervision to the machine reading system for continuous improvement. Therefore, the most im- portant goal is to attain high recall and reasonable precision. Our current models are already quite capa- ble. Future directions include: interactive learning with user feedback; improving discourse modeling in graph LSTMs; exploring other backpropagation strategies; joint learning with entity linking; applica- tions to other domains. Acknowledgements We thank Daniel Fried and Ming-Wei Chang for use- ful discussions, as well as the anonymous reviewers and editor-in-chief Mark Johnson for their helpful comments. References Collin Baker, Charles Fillmore, and John Lowe. 1998. The Berkeley FrameNet project. In Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth Interna- tional Conference on Computational Linguistics. Dane Bell, Gustave Hahn-Powell, Marco A. Valenzuela- Escarcega, and Mihai Surdeanu. 2016. An investi- gation of coreference phenomena in the biomedical domain. In Proceedings of the Tenth Edition of the Language Resources and Evaluation Conference. Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2). Elizabeth Boschee, Ralph Weischedel, and Alex Zama- nian. 2005. Automatic information extraction. In Proceedings of the International Conference on Intelli- gence Analysis. Razvan C Bunescu and Raymond J Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing. Rui Cai, Xiaodong Zhang, and Houfeng Wang. 2016. Bidirectional recurrent convolutional neural network for relation classification. In Proceedings of the Fifty- Fourth Annual Meeting of the Association for Compu- tational Linguistics. Rich Caruana, Steve Lawrence, and Lee Giles. 2001. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Proceedings of The Fifteenth Annual Conference on Neural Information Processing Systems. Rich Caruana. 1998. Multitask learning. In Learning to learn. Springer. Yee Seng Chan and Dan Roth. 2010. Exploiting back- ground knowledge for relation extraction. In Proceed- ings of the Twenty-Third International Conference on Computational Linguistics. Nancy Chinchor. 1998. Overview of MUC-7/MET-2. Technical report, Science Applications International Corporation, San Diego, CA. Ronan Collobert and Jason Weston. 2008. A unified ar- chitecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the Twenty-Fifth International Conference on Machine learning. Mark Craven and Johan Kumlien. 1999. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the Seventh Inter- national Conference on Intelligent Systems for Molecu- lar Biology. Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the Fifth International Conference on Language Resources and Evaluation. Rodrigo Dienstmann, In Sock Jang, Brian Bot, Stephen Friend, and Justin Guinney. 2015. Database of ge- nomic biomarkers for cancer drugs and clinical tar- getability in solid tumors. Cancer Discovery, 5. 112 Nicholas FitzGerald, Oscar Täckström, Kuzman Ganchev, and Dipanjan Das. 2015. Semantic role labeling with neural network factors. In Proceedings of the Con- ference on Empirical Methods in Natural Language Processing. Matthew Gerber and Joyce Y. Chai. 2010. Beyond Nom- Bank: A study of implicit arguments for nominal predi- cates. In Proceedings of the Forty-Eighth Annual Meet- ing of the Association for Computational Linguistics. Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hin- ton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of The Thirty-Eighth IEEE International Conference on Acoustics, Speech and Signal Processing. Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min. 2005. Exploring various knowledge in relation extraction. In Proceedings of the Forty-Third Annual Meeting of the Association for Computational Linguistics. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8). Nanda Kambhatla. 2004. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In Proceedings of the Forty- Second Annual Meeting of the Association for Compu- tational Linguistics, Demonstration Sessions. Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshi- nobu Kano, and Jun’ichi Tsujii. 2009. Overview of BioNLP’09 shared task on event extraction. In Proceed- ings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task. Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2013. Deterministic coreference resolution based on entity- centric, precision-ranked rules. Computational Linguis- tics, 39(4). Hong Li, Sebastian Krause, Feiyu Xu, Andrea Moro, Hans Uszkoreit, and Roberto Navigli. 2015. Improvement of n-ary relation extraction by adding lexical semantics to distant-supervision rule learning. In Proceedings of the Seventh International Conference on Agents and Artificial Intelligence. Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2016. Gated graph sequence neural networks. In Proceedings of the Fourth International Conference on Learning Representations. Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, and Shuicheng Yan. 2016. Semantic object parsing with graph LSTM. In Proceedings of European Conference on Computer Vision. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language pro- cessing toolkit. In Proceedings of the Fifty-Second Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Ryan McDonald, Fernando Pereira, Seth Kulick, Scott Winters, Yang Jin, and Pete White. 2005. Simple algo- rithms for complex relation extraction with applications to biomedical IE. In Proceedings of the Forty-Third Annual Meeting on Association for Computational Lin- guistics. Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- sky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Con- ference of the Forty-Seventh Annual Meeting of the As- sociation for Computational Linguistics and the Fourth International Joint Conference on Natural Language Processing. Makoto Miwa and Mohit Bansal. 2016. End-to-end re- lation extraction using LSTMs on sequences and tree structures. In Proceedings of the Fifty-Fourth Annual Meeting of the Association for Computational Linguis- tics. Raymond J Mooney and Razvan C Bunescu. 2005. Subse- quence kernels for relation extraction. In Proceedings of The Nineteen Annual Conference on Neural Informa- tion Processing Systems. Thien Huu Nguyen and Ralph Grishman. 2014. Employ- ing word representations and regularization for domain adaptation of relation extraction. In Proceedings of the Fifty-Second Annual Meeting of the Association for Computational Linguistics. Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An annotated corpus of seman- tic roles. Computational Linguistics, 31(1). Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of The Thirtieth International Conference on Machine Learning. Nanyun Peng and Mark Dredze. 2016. Improving named entity recognition for chinese social media with word segmentation representation learning. In Proceedings of the Fifty-Fourth Annual Meeting of the Association for Computational Linguistics. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word repre- sentation. In Proceedings of the Conference on Empiri- cal Methods in Natural Language Processing. Hoifung Poon, Chris Quirk, Charlie DeZiel, and David Heckerman. 2014. Literome: PubMed-scale genomic knowledge base in the cloud. Bioinformatics, 30(19). Hoifung Poon, Kristina Toutanova, and Chris Quirk. 2015. Distant supervision for cancer pathway extraction from text. In Pacific Symposium on Biocomputing. Longhua Qian, Guodong Zhou, Fang Kong, Qiaoming Zhu, and Peide Qian. 2008. Exploiting constituent 113 dependencies for tree kernel-based semantic relation extraction. In Proceedings of the Twenty-Second Inter- national Conference on Computational Linguistics. Chris Quirk and Hoifung Poon. 2017. Distant supervi- sion for relation extraction beyond the sentence bound- ary. In Proceedings of the Fifteenth Conference on European chapter of the Association for Computational Linguistics. Chris Quirk, Pallavi Choudhury, Jianfeng Gao, Hisami Suzuki, Kristina Toutanova, Michael Gamon, Wen-tau Yih, and Lucy Vanderwende. 2012. MSR SPLAT, a language analysis toolkit. In Proceedings of the Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Demonstration Session. Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher D Manning, and Daniel Jurafsky. 2014. Event extraction using distant supervision. In Proceed- ings of Eighth edition of the Language Resources and Evaluation Conference. Michael Roth and Mirella Lapata. 2016. Neural semantic role labeling with dependency path embeddings. In Proceedings of the Fifty-Fourth Annual Meeting of the Association for Computational Linguistics. Cicero Nogueira dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying relations by ranking with convolutional neural networks. In Proceedings of the Fifty-Third Annual Meeting of the Association for Com- putational Linguistics. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. The graph neural network model. IEEE Transactions on Neural Networks, 20(1). Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the Joint Conference on Empirical Methods in Natu- ral Language Processing and Computational Natural Language Learning. Fabian M Suchanek, Georgiana Ifrim, and Gerhard Weikum. 2006. Combining linguistic and statistical analysis to extract relations from web documents. In Proceedings of the Twelfth International Conference on Knowledge Discovery and Data Mining. Mihai Surdeanu and Ji Heng. 2014. Overview of the english slot filling track at the TAC2014 knowledge base population evaluation. In Proceedings of the U.S. National Institute of Standards and Technology Knowl- edge Base Population 2014 Workshop. Kumutha Swampillai and Mark Stevenson. 2011. Extract- ing relations within and across sentences. In Proceed- ings of the Conference on Recent Advances in Natural Language Processing. Kai Sheng Tai, Richard Socher, and Christopher D Man- ning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the Fifty-Third Annual Meeting of the Association for Computational Linguistics. Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical ex- pressions. arXiv e-prints, abs/1605.02688. Linlin Wang, Zhu Cao, Gerard de Melo, and Zhiyuan Liu. 2016. Relation classification via multi-level attention CNNs. In Proceedings of the Fifty-Fourth Annual Meet- ing of the Association for Computational Linguistics. Michael Wick, Aron Culotta, and Andrew McCallum. 2006. Learning field compatibilities to extract database records from unstructured text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Kun Xu, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2015a. Semantic relation classification via convolutional neural networks with simple negative sampling. In Proceedings of Conference on Empirical Methods in Natural Language Processing. Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin. 2015b. Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of Conference on Empirical Methods in Natural Language Processing. Yan Xu, Ran Jia, Lili Mou, Ge Li, Yunchuan Chen, Yangyang Lu, and Zhi Jin. 2016. Improved relation classification by deep recurrent neural networks with data augmentation. In Proceedings of the Twenty-Sixth International Conference on Computational Linguis- tics. Nianwen Xue, Hwee Tou Ng, Sameer Pradhan, Rashmi Prasad, Christopher Bryant, and Attapol Rutherford. 2015. The CoNLL-2015 shared task on shallow dis- course parsing. In Proceedings of the Conference on Computational Natural Language Learning, Shared Task. Katsumasa Yoshikawa, Sebastian Riedel, Tsutomu Hi- rao, Masayuki Asahara, and Yuji Matsumoto. 2011. Coreference based event-argument relation extraction on biomedical text. Journal of Biomedical Semantics, 2(5). Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, Jun Zhao, et al. 2014. Relation classification via convo- lutional deep neural network. In Proceedings of the Twenty-Sixth International Conference on Computa- tional Linguistics. Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 114 Shu Zhang, Dequan Zheng, Xinchen Hu, and Ming Yang. 2015. Bidirectional long short-term memory networks for relation classification. In Proceedings of Twenty- Ninth Pacific Asia Conference on Language, Informa- tion and Computation. 115 116