Under review as a conference paper at ICLR 2019 GRAPH CONVOLUTIONAL NETWORK WITH SEQUEN- TIAL ATTENTION FOR GOAL-ORIENTED DIALOGUE SYSTEMS Anonymous authors Paper under double-blind review ABSTRACT Domain specific goal-oriented dialogue systems typically require modeling three types of inputs, viz., (i) the knowledge-base associated with the domain, (ii) the history of the conversation, which is a sequence of utterances and (iii) the cur- rent utterance for which the response needs to be generated. While modeling these inputs, current state-of-the-art models such as Mem2Seq typically ignore the rich structure inherent in the knowledge graph and the sentences in the con- versation context. Inspired by the recent success of structure-aware Graph Con- volutional Networks (GCNs) for various NLP tasks such as machine translation, semantic role labeling and document dating, we propose a memory augmented GCN for goal-oriented dialogues. Our model exploits (i) the entity relation graph in a knowledge-base and (ii) the dependency graph associated with an utterance to compute richer representations for words and entities. Further, we take cog- nizance of the fact that in certain situations, such as, when the conversation is in a code-mixed language, dependency parsers may not be available. We show that in such situations we could use the global word co-occurrence graph and use it to enrich the representations of utterances. We experiment with the modified DSTC2 dataset and its recently released code-mixed versions in four languages and show that our method outperforms existing state-of-the-art methods, using a wide range of evaluation metrics. 1 INTRODUCTION Goal-oriented dialogue systems which can assist humans in various day-to-day activities have widespread applications in several domains such as e-commerce, entertainment, healthcare, etc. For example, such systems can help humans in scheduling medical appointments, reserving restaurants, booking tickets, etc.. From a modeling perspective, one clear advantage of dealing with domain spe- cific goal-oriented dialogues is that the vocabulary is typically limited, the utterances largely follow a fixed set of templates and there is an associated domain knowledge which can be exploited. More specifically, there is some structure associated with the utterances as well as the knowledge base. More formally, the task here is to generate the next response given (i) the previous utterances in the conversation history (ii) the current user utterance (known as the query) and (iii) the entities and relationships in the associated knowledge base. Current state-of-the-art methods (Seo et al., 2017; Eric & Manning, 2017; Madotto et al., 2018) typically use variants of Recurrent Neural Network (Elman, 1990) to encode the history and current utterance and an external memory network to store the entities in the knowledge base. The encodings of the utterances and memory elements are then suitably combined using an attention network and fed to the decoder to generate the response, one word at a time. However, these methods do not exploit the structure in the knowledge base as defined by entity-entity relations and the structure in the utterances as defined by a dependency parse. Such structural information can be exploited to improve the performance of the system as demonstrated by recent works on syntax-aware neural machine translation (Eriguchi et al., 2016; Bastings et al., 2017; Chen et al., 2017), semantic role labeling (Marcheggiani & Titov, 2017) and document dating 1 Under review as a conference paper at ICLR 2019 (Vashishth et al., 2018) which use GCNs (Defferrard et al., 2016; Duvenaud et al., 2015; Kipf & Welling, 2017) to exploit sentence structure. In this work, we propose to use such graph structures for goal-oriented dialogues. In particular, we compute the dependency parse tree for each utterance in the conversation and use a GCN to capture the interactions between words. This allows us to capture interactions between distant words in the sentence as long as they are connected by a dependency relation. We also use GCNs to encode the entities of the KB where the entities are treated as nodes and the relations as edges of the graph. Once we have a richer structure aware representation for the utterances and the entities, we use a sequential attention mechanism to compute an aggregated context representation from the GCN node vectors of the query, history and entities. Further, we note that in certain situations, such as, when the conversation is in a code-mixed language or a language for which parsers are not available then it may not be possible to construct a dependency parse for the utterances. To overcome this, we construct a co-occurrence matrix from the entire corpus and use this matrix to impose a graph structure on the utterances. More specifically, we add an edge between two words in a sentence if they co-occur frequently in the corpus. Our experiments suggest that this simple strategy acts as a reasonable substitute for dependency parse trees. We perform experiments with the modified DSTC2 (Bordes et al., 2017) dataset which contains goal-oriented conversations for reserving restaurants. We also use its recently released code-mixed versions (Banerjee et al., 2018) which contain code-mixed conversations in four different languages, viz., Hindi, Bengali, Gujarati and Tamil. We compare with recent state-of-the-art methods and show that on average the proposed model gives an improvement of 2.8 BLEU points and 2 ROUGE points. Our contributions can be summarized as follows: (i) We use GCNs to incorporate structural in- formation for encoding query, history and KB entities in goal-oriented dialogues (ii) We use a se- quential attention mechanism to obtain query aware and history aware context representations (iii) We leverage co-occurrence frequencies and PPMI (positive-pointwise mutual information) values to construct contextual graphs for code-mixed utterances and (iv) We show that the proposed model obtains state-of-the-art results on the modified DSTC2 dataset and its recently released code-mixed versions. 2 RELATED WORK In this section we review the previous work in goal-oriented dialogue systems and describe the introduction of GCNs in NLP. Goal-Oriented Dialogue System : Initial goal-oriented dialogue systems (Young, 2000; Williams & Young, 2007) were based on dialogue state tracking (Williams et al., 2013; Henderson et al., 2014a;b) and included pipelined modules for natural language understanding, dialogue state track- ing, policy management and natural language generation. Wen et al. (2017) used neural networks for these intermediate modules but still lacked absolute end-to-end trainability. Such pipelined modules were restricted by the fixed slot-structure assumptions on the dialogue state and required per-module based labelling. To mitigate this problem Bordes et al. (2017) released a version of goal-oriented dialogue dataset that focuses on the development of end-to-end neural models. Such models need to reason over the associated KB triples and generate responses directly from the utterances without any additional annotations. For example, Bordes et al. (2017) proposed a Memory Network (Sukhbaatar et al., 2015) based model to match the response candidates with the multi-hop attention weighted representation of the conversation history and the KB triples in memory. Liu & Perez (2017) further added highway (Srivastava et al., 2015) and residual connections (He et al., 2016) to the memory network in order to regulate the access to the memory blocks. Seo et al. (2017) developed a variant of RNN cell which computes a refined representation of the query over multiple iterations before querying the memory. However, all these approaches retrieve the response from a set of candidate responses and such a candidate set is not easy to obtain in any new domain of interest. To account for this, Eric & Manning (2017); Zhao et al. (2017) adapted RNN based encoder-decoder models to generate appropriate responses instead of retrieving them from a candidate set. Eric et al. (2017) introduced a key-value memory network based generative model which integrates the underlying KB with RNN based encode-attend-decode models. Madotto et al. (2018) used memory networks on top of the RNN decoder to tightly integrate KB entities with the decoder to generate more infor- 2 Under review as a conference paper at ICLR 2019 mative responses. However, as opposed to our work, all these works ignore the underlying structure of the entity-relation graph of the KB and the syntactic structure of the utterances. GCNs in NLP : Recently, there has been an active interest in enriching existing encode-attend- decode models (Bahdanau et al., 2015) with structural information for various NLP tasks. Such structure is typically obtained from the constituency and/or dependency parse of sentences. The idea is to treat the output of a parser as a graph and use an appropriate network to capture the interactions between the nodes of this graph. For example, Eriguchi et al. (2016) and Chen et al. (2017) showed that incorporating such syntactical structures as Tree-LSTMs in the encoder can improve the per- formance of Neural Machine Translation (NMT). Peng et al. (2017) use Graph-LSTMs to perform cross sentence n-ary relation extraction and show that their formulation is applicable to any graph structure and Tree-LSTMs can be thought of as a special case of it. In parallel, Graph Convolutional Networks (GCNs) (Duvenaud et al., 2015; Defferrard et al., 2016; Kipf & Welling, 2017) and their variants (Li et al., 2015) have emerged as state-of-the-art methods for computing representations of entities in a knowledge graph. They provide a more flexible way of encoding such graph structures by capturing multi-hop relationships between nodes. This has led to their adoption for various NLP tasks such as neural machine translation (Marcheggiani et al., 2018; Bastings et al., 2017), semantic role labeling (Marcheggiani & Titov, 2017), document dating (Vashishth et al., 2018) and question answering (Johnson, 2017; Nicola De Cao, 2018). To the best of our knowledge ours is the first work that uses GCNs to incorporate dependency struc- tural information and the entity-entity graph structure in a single end-to-end neural model for goal- oriented dialogue. This is also the first work that incorporates contextual co-occurrence information for code-mixed utterances, for which no dependency structures are available. 3 BACKGROUND In this section we describe Graph Convolutional Networks (GCN) (Kipf & Welling, 2017) for undi- rected graphs and then describe their syntactic versions which work with directed labeled edges of dependency parse trees. 3.1 GCN FOR UNDIRECTED GRAPHS Graph convolutional networks operate on a graph structure and compute representations for the nodes of the graph by looking at the neighbourhood of the node. k layers of GCNs can be stacked to account for neighbours which are k-hops away from the current node. Formally, let G = (V,E) be an undirected graph where V is the set of nodes (let |V| = n) and E is the set of edges. Let X ∈ Rn×m be the input feature matrix with n nodes and each node xu(u ∈ V) is represented by an m-dimensional feature vector. The output of a 1-layer GCN is the hidden representation matrix H ∈ Rn×d where each d-dimensional representation of a node captures the interactions with its 1-hop neighbour. Each row of this matrix can be computed as: hv = ReLU ( ∑ u∈N(v) (Wxu + b) ) , ∀v ∈V (1) Here W ∈ Rd×m is the model parameter matrix, b ∈ Rd is the bias vector and ReLU is the rectified linear unit activation function. N(v) is the set of neighbours of node v and is assumed to also include the node v so that the previous representation of the node v is also considered while computing the new hidden representation. To capture interactions with nodes which are multiple hops away, multiple layers of GCNs can be stacked together. Specifically, the representation of node v after kth GCN layer can be formulated as: hk+1v = ReLU ( ∑ u∈N(v) (Wkhku + b k) ) , ∀v ∈V (2) where hku is the representation of the u th node in the (k −1)th GCN layer and h1u = xu. 3 Under review as a conference paper at ICLR 2019 3.2 SYNTACTIC GCN In a directed labeled graph G = (V,E), each edge between nodes u and v is represented by a triple (u,v,L(u,v)) where L(u,v) is the associated edge label. Marcheggiani & Titov (2017) modified GCNs to operate over directed labeled graphs, such as the dependency parse tree of a sentence. For such a tree, in order to allow information to flow from head to dependents and vice-versa, they added inverse dependency edges from dependents to heads such as (v,u,L(u,v)′) to E and made the model parameters and biases label specific. In their formulation, hk+1v = ReLU ( ∑ u∈N(v) (WkL(u,v)h k u + b k L(u,v)) ) , ∀v ∈V (3) Notice that unlike equation 2, equation 3 has parameters Wk L(u,v) and bk L(u,v) which are label spe- cific. Suppose there are L different labels, then this formulation will require L weights and biases per GCN layer resulting in a large number of parameters. To avoid this, the authors use only three sets of weights and biases per GCN layer (as opposed to L) depending on the direction in which the information flows. More specifically, Wk L(u,v) = V k dir(u,v) , where dir(u,v) indicates whether information flows from u to v, v to u or u = v. In this work, we also make bk L(u,v) = bk dir(u,v) instead of having a separate bias per label. The final GCN formulation can thus be described as: hk+1v = ReLU ( ∑ u∈N(v) (Wkdir(u,v)h k u + b k dir(u,v)) ) , ∀v ∈V (4) 4 MODEL We first formally define the task of end-to-end goal-oriented dialogue generation. Each dialogue of t turns can be viewed as a succession of user utterances (U) and system responses (S) and can be rep- resented as: (U1,S1,U2,S2, ..Ut,St). Along with these utterances, each dialogue is also accompa- nied by e KB triples which are relevant to that dialogue and can be represented as: (k1,k2,k3, ..ke). Each triple is of the form: (entity1,relation,entity2). These triples can be represented in the form of a graph Gk = (Vk,Ek) where V is the set of all entities and each edge in E is of the form: (entity1,entity2,relation) where relation signifies the edge label. At any dialogue turn i, given the (i) dialogue history H = (U1,S1,U2, ..Si−1), (ii) the current user utterance as the query Q = Ui and (iii) the associated knowledge graph Gk, the task is to generate the current response Si which leads to a completion of the goal. As mentioned earlier, we exploit the graph structure in KB and the syntactic structure in the utterances to generate appropriate responses. Towards this end we propose a model with the following components for encoding these three types of inputs. 4.1 QUERY ENCODER The query Q = Ui is the ith (current) utterance in the dialogue and contains |Q| tokens. We denote the embedding of the ith token in the query as qi We first compute the contextual representations of these tokens by passing them through a bidirectional RNN: bt = BiRNNQ(bt−1,qt) (5) Now, consider the dependency parse tree of the query sentence denoted by GQ = (VQ,EQ). We use a query specific GCN to operate on GQ, which takes {bi} |Q| i=1 as the input to the 1 st GCN layer. The node representation in the kth hop of the query specific GCN is computed as: ck+1v = ReLU ( ∑ u∈N(v) (Wkdir(u,v)c k u + g k dir(u,v)) ) , ∀v ∈VQ (6) where Wk dir(u,v) ,gk dir(u,v) are edge direction specific query-GCN weights and biases for the kth hop and c1u = bu. 4 Under review as a conference paper at ICLR 2019 h el lo w el co m e to th e ca m b ri d g e re st a u ra n t sy st em (a) GCN pt pt a2v V 1dir(u,v) graph conv connections dependency parse edges self loop connections h el lo w el co m e to th e ca m b ri d g e re st a u ra n t sy st em (b) RNN+GCN st a2v V 1dir(u,v) Figure 1: Illustration of the GCN and RNN+GCN modules which are used as encoders in our model. The notations are specific to the dialogue history encoder but both the encoders are same for the query. The GCN encoder is same for the KB except the graph structure. 4.2 DIALOGUE HISTORY ENCODER The history H of the dialogue contains |H| tokens and we denote the embedding of the ith token in the history by pi Once again, we first compute the hidden representations of these embeddings using a bidirectional RNN: st = BiRNNH(st−1,pt) (7) We now compute a dependency parse tree for each sentence in the history and collectively represent all the trees as a single graph GH = (VH,EH). Note that this graph will only contain edges between words belonging to the same sentence and there will be no edges between words across sentences. We then use a history specific GCN to operate on GH which takes st as the input to the 1st layer. The node representation in the kth hop of the history specific GCN is computed as: ak+1v = ReLU ( ∑ u∈N(v) (V kdir(u,v)a k u + o k dir(u,v)) ) , ∀v ∈VH (8) where V k dir(u,v) and ok dir(u,v) are edge direction specific history-GCN weights and biases in the kth hop and a1u = su. Such an encoder with a single hop of GCN is illustrated in figure 1(b) and the encoder without the BiRNN is depicted in figure 1(a). 4.3 KB ENCODER As mentioned earlier, GK = (VK,EK) is the graph capturing the interactions between the entities in the knowledge graph associated with the dialogue. Let there be m such entities and we denote the embeddings of the node corresponding to the ith entity as ei We then operate a KB specific GCN on these entity representations to obtain refined representations which capture relations between entities. The node representation in the kth hop of the KB specific GCN is computed as: rk+1v = ReLU ( ∑ u∈N(v) (Ukdir(u,v)r k u + z k dir(u,v)) ) , ∀v ∈VK (9) where Uk dir(u,v) and zk dir(u,v) are edge direction specific KB-GCN weights and biases in kth hop and r1u = eu. We also add inverse edges to EK similar to the case of syntactic GCNs in order to allow information flow in both the directions for an entity pair in the knowledge graph. 5 Under review as a conference paper at ICLR 2019 RNN-GCNRNN-GCN GCN Dialogue History KB EntitiesQuery ++ + + qt pt et c f j αjt h Q t a f j βjt r f j γjt θH θK h Q t h C t dt h el lo w el co m e to th e ca m b ri d g e y o u re st a u ra n t ........... I n ee d a n it a li a n to w n......... p re zz o ex p en si v e it a li a n n o rt h th e h o tp o t ..................... Decoder hHt h K t Figure 2: Illustration of sequential attention mechanism in RNN+GCN-SeA. 4.4 SEQUENTIAL ATTENTION We use an RNN decoder to generate the tokens of the response and let the hidden states of the decoder be denoted as: {di}Ti=1 where T is the total number of decoder timesteps. In order to obtain a single representation from the final layer (k = f) of the query-GCN node vectors, we use an attention mechanism as described below: µjt = v1tanh(W1c f j + W2dt−1) (10) αt = softmax(µt) (11) h Q t = ∑|Q| j′=1 αj′tc f j′ (12) Here v1,W1,W2 are parameters. Further, at each decoder timestep, we obtain a query aware representation from the final layer of the history-GCN by computing an attention score for each node/token in the history based on the query context vector hQt as shown below: νjt = v2tanh(W3a f j + W4dt−1 + W5h Q t ) (13) βt = softmax(νt) (14) hHt = ∑|H| j′=1 βj′ta f j′ (15) Here v2,W3,W4 and W5 are parameters. Finally, we obtain a query and history aware represen- tation of the KB by computing an attention score over all the nodes in the final layer of KB-GCN using hQt and h H t as shown below: ωjt = v3tanh(W6r f j + W7dt−1 + W8h Q t + W9h H t ) (16) γt = softmax(ωt) (17) hKt = ∑m j′=1 γj′tr f j′ (18) 6 Under review as a conference paper at ICLR 2019 Here v3,W6,W7,W8 and W9 are parameters. This sequential attention mechanism is illustrated in figure 2. For simplicity, we depict the GCN and RNN+GCN encoders as blocks. The internal structure of these blocks are shown in figure 1. 4.5 DECODER The decoder takes two inputs, viz., (i) the context which contains the history and the KB and (ii) the query which is the last/previous utterance in the dialogue. We use an aggregator which learns the overall attention to be given to the history and KB components. These attention scores: θHt and θKt are dependent on the respective context vectors and the previous decoder state dt−1. The final context vector is obtained as: hCt = θ H t h H t + θ K t h K t (19) h final t = [h C t ; h Q t ] (20) where [; ] denotes the concatenation operator. At every timestep the decoder then computes a prob- ability distribution over the vocabulary using the following equations: dt = RNN(dt−1, [h final t ; wt]) (21) Pvocab = softmax(V ′dt + b ′) (22) where wt is the decoder input at time step t, V ′ and b′ are parameters. Pvocab gives us a probability distribution over the entire vocabulary and the loss for time step t is lt = − log Pvocab(w∗t ), where w∗t is the t th word in the ground truth response. The total loss is an average of the per-time step losses. 4.6 CONTEXTUAL GRAPH CREATION For the dialogue history and query encoder, we used the dependency parse tree for capturing struc- tural information in the encodings. However, if the conversations occur in a language for which no dependency parsers exist, for example: code-mixed languages like Hinglish (Hindi-English) (Banerjee et al., 2018) , then we need an alternate way of extracting a graph structure from the ut- terances. One simple solution which worked well in practice was to create a word co-occurrence matrix from the entire corpus where the context window is an entire sentence. Once we have such a co-occurrence matrix, for a given sentence we can connect an edge between two words if their co-occurrence frequency is above a threshold value. The co-occurrence matrix can either contain co-occurrence frequency counts or positive-pointwise mutual information (PPMI) values (Church & Hanks, 1990; Dagan et al., 1993; Niwa & Nitta, 1994). 5 EXPERIMENTAL SETUP In this section we describe the datasets used in our experiments, the various hyperparameters that we considered and the models that we compared. 5.1 DATASETS The original DSTC2 dataset (Henderson et al., 2014a) was based on the task of restaurant reservation and contains transcripts of real conversations between humans and bots. The utterances were labeled with the dialogue state annotations like the semantic intent representation, requested slots and the constraints on the slot values. We report our results on the modified DSTC2 dataset of Bordes et al. (2017) where such annotations are removed and only the raw utterance-response pairs are present with an associated set of KB triples for each dialogue. For our experiments with contextual graphs we reported our results on the code-mixed versions of modified DSTC2, which was recently released by Banerjee et al. (2018) 1. This dataset has been collected by code-mixing the utterances of the English version of modified DSTC2 in four languages viz. Hindi (Hi-DSTC2), Bengali (Be-DSTC2), Gujarati (Gu-DSTC2) and Tamil (Ta-DSTC2), via crowdsourcing. Statistics about this dataset and example dialogues are shown in Appendix A. 1 https://github.com/sumanbanerjee1/Code-Mixed-Dialog 7 https://github.com/sumanbanerjee1/Code-Mixed-Dialog Under review as a conference paper at ICLR 2019 Model per-resp.acc BLEU ROUGE Entity F1 1 2 L Rule-Based (Bordes et al., 2017) 33.3 - - - - - MEMNN (Bordes et al., 2017) 41.1 - - - - - QRN (Seo et al., 2017) 50.7 - - - - - GMEMNN (Liu & Perez, 2017) 48.7 - - - - - Seq2Seq-Attn (Bahdanau et al., 2015) 46.0 57.3 67.2 56.0 64.9 67.1 Seq2Seq-Attn+Copy (Eric & Manning, 2017) 47.3 55.4 - - - 71.6 HRED (Serban et al., 2016) 48.9 58.4 67.9 57.6 65.7 75.6 Mem2Seq (Madotto et al., 2018) 45.0 55.3 - - - 75.3 GCN-SeA 47.1 59.0 67.4 57.1 65.0 71.9 RNN+CROSS-GCN-SeA 51.2 60.9 69.4 59.9 67.2 78.1 RNN+GCN-SeA 51.4 61.2 69.6 60.2 67.4 77.9 Table 1: Comparison of GCN-SeA with other models on English version of modified DSTC2 Dataset Model per-resp.acc BLEU ROUGE Entity F1 1 2 L Hi-DSTC2 Seq2Seq-Bahdanau Attn 48.0 55.1 62.9 52.5 61.0 74.3 HRED 47.2 55.3 63.4 52.7 61.5 71.3 Mem2Seq 43.1 50.2 55.5 48.1 54.0 73.8 GCN-SeA 47.0 56.0 65.0 55.3 63.0 72.4 RNN+CROSS-GCN-SeA 47.2 56.4 64.7 54.9 62.6 73.5 RNN+GCN-SeA 49.2 57.1 66.4 56.8 64.4 75.9 Be-DSTC2 Seq2Seq-Bahdanau Attn 50.4 55.6 67.4 57.6 65.1 76.2 HRED 47.8 55.6 67.2 57.0 64.9 71.5 Mem2Seq 41.9 52.1 58.9 50.8 57.0 73.2 GCN-SeA 47.1 58.4 67.4 57.3 64.9 69.6 RNN+CROSS-GCN-SeA 50.4 59.1 68.3 58.9 65.9 74.9 RNN+GCN-SeA 50.3 59.2 69.0 59.4 66.6 75.1 GU-DSTC2 Seq2Seq-Bahdanau Attn 47.7 54.5 64.8 54.9 62.6 71.3 HRED 48.0 54.7 65.4 55.2 63.3 71.8 Mem2Seq 43.1 48.9 55.7 48.6 54.2 75.5 GCN-SeA 48.1 55.7 65.5 56.2 63.5 72.2 RNN+CROSS-GCN-SeA 49.4 56.9 66.4 57.2 64.3 73.4 RNN+GCN-SeA 48.9 56.7 66.1 56.9 64.1 73.0 Ta-DSTC2 Seq2Seq-Bahdanau Attn 49.3 62.9 67.8 56.3 65.6 77.7 HRED 47.8 61.5 66.9 55.2 64.8 74.4 Mem2Seq 44.2 58.9 58.6 50.8 57.0 74.9 GCN-SeA 46.4 62.8 68.5 57.5 66.1 71.9 RNN+CROSS-GCN-SeA 50.8 64.5 69.8 59.6 67.5 78.8 RNN+GCN-SeA 50.7 64.9 70.2 59.9 67.9 77.9 Table 2: Comparison of RNN+GCN-SeA, GCN-SeA with other models on all code-mixed datasets 5.2 HYPERPARAMETERS We used the same train, test and validation splits as provided in the original versions of the datasets. We minimized the cross entropy loss using the Adam optimizer (Kingma & Ba, 2015) and tuned the initial learning rates in the range of 0.0006 to 0.001. For regularization we used an L2 norm of 0.001 in addition to a dropout (Srivastava et al., 2014) of 0.1. We used randomly initialized word embeddings of size 300. The RNN and GCN hidden dimensions were also chosen to be 300. We use GRU (Cho et al., 2014) cells for the RNNs. All parameters were initialized from a truncated normal distribution with a standard deviation of 0.1. 5.3 MODELS COMPARED We compare the performance of the following models. 8 Under review as a conference paper at ICLR 2019 (i) RNN+GCN-SeA vs GCN-SeA : We use RNN+GCN-SeA to refer to the model described in section 4. Instead of using the hidden representations obtained from the bidirectional RNNs, we also experiment by providing the token embeddings directly to the GCNs i.e. c1u = qu in equation 6 and a1u = pu in equation 8. We refer to this model as GCN-SeA. (ii) Cross edges between the GCNs: In addition to the dependency and contextual edges, we add edges between words in the dialogue history/query and KB entities if a history/query word exactly matches the KB entity. Such edges create a single connected graph which is encoded using a single GCN encoder and then separated into different contexts to perform the sequential attention. This model is referred to as RNN+CROSS-GCN-SeA. (iii) Frequency vs PPMI Contextual Graph : We experiment with the raw frequency co- occurrence graph structure and the PPMI graph structure for the code-mixed datasets, as explained in section 4.6. We refer to these models as GCN-SeA+Freq and GCN-SeA+PPMI. In both these models, the GCN takes inputs from a bidirectional RNN. (iv) GCN-SeA+Random vs GCN-SeA+Structure : We experiment with the model where the graph is constructed by randomly connecting edges between two words in a context. We refer to this model as GCN-SeA+Random. We refer to the model which either uses dependency or contextual graph instead of random graphs as GCN-SeA+Structure. 6 RESULTS AND DISCUSSIONS In this section we discuss the results of our experiments as summarized in tables 1,2, and 3. We use BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) metrics to evaluate the generation quality of responses. We also report the per-response accuracy which computes the percentage of responses in which the generated response exactly matches the ground truth response. In order to evaluate the model’s capability of correctly injecting entities in the generated response, we report the entity F1 measure as defined in Eric & Manning (2017). Results on En-DSTC2 : We compare our model with the previous works on the English version of modified DSTC2 in table 1. For most of the retrieval based models, the BLEU or ROUGE scores are not available as they select a candidate from a list of candidates as opposed to generating it. Our model outperforms all of the retrieval and generation based models. We obtain a gain of 0.7 in the per-response accuracy compared to the previous retrieval based state-of-the-art model of Seo et al. (2017), which is a very strong baseline for our generation based model. We call this a strong baseline because the candidate selection task of this model is easier than the response generation task of our model. We also obtain a gain of 2.8 BLEU points, 2 ROUGE points and 2.5 entity F1 points compared to current state-of-the-art generation based models. Results on code-mixed datasets and effect of using RNNs: The results of our experiments on the code-mixed datasets are reported in table 2. Our model outperforms the baseline models on all the code-mixed languages. One common observation from the results over all the languages (including En-DSTC2) is that RNN+GCN-SeA performs better than GCN-SeA. Similar observations were made by Marcheggiani & Titov (2017) for the task of semantic role labeling. Effect of using Hops: As we increased the number of hops of GCNs, we observed a decrease in the performance. One reason for such a drop in performance could be that the average utterance length is very small (7.76 words). Thus, there isn’t much scope for capturing distant neighbourhood information and more hops can add noisy information. Please refer to Appendix B for detailed results about the effect of varying the number of hops. Frequency vs PPMI graphs: We observed that PPMI based contextual graphs were slightly bet- ter than frequency based contextual graphs (See Appendix C). In particular, when using PPMI as opposed to frequency based contextual graph, we observed a gain of 0.95 in per-response accuracy, 0.45 in BLEU, 0.64 in ROUGE and 1.22 in entity F1 score when averaged across all the code-mixed languages. Effect of using Random Graphs: GCN-SeA-Random and GCN-SeA-Structure take the token embeddings directly instead of passing them though an RNN. This ensures that the difference in performance of the two models are not influenced by the RNN encodings. The results are shown in table 3 and we observe a drop in performance for GCN-Random across all the languages. This 9 Under review as a conference paper at ICLR 2019 Dataset Model per-resp. BLEU ROUGE Entity F1 acc 1 2 L En-DSTC2 GCN-SeA+Random 45.9 57.8 67.1 56.5 64.8 72.2 GCN-SeA+Structure 47.1 59.0 67.4 57.1 65.0 71.9 Hi-DSTC2 GCN-SeA+Random 44.4 54.9 63.1 52.9 60.9 67.2 GCN-SeA+Structure 47.0 56.0 65.0 55.3 63.0 72.4 Be-DSTC2 GCN-SeA+Random 44.9 56.5 65.4 54.8 62.7 65.6 GCN-SeA+Structure 47.1 58.4 67.4 57.3 64.9 69.6 Gu-DSTC2 GCN-SeA+Random 45.0 54.0 64.1 54.0 61.9 69.1 GCN-SeA+Structure 48.1 55.7 65.5 56.2 63.5 72.2 Ta-DSTC2 GCN-SeA+Random 44.8 61.4 66.9 55.6 64.3 70.5 GCN-SeA+Structure 46.4 62.8 68.5 57.5 66.1 71.9 Table 3: GCN-SeA with random graphs and frequency co-occurrence graphs on all DSTC2 datasets. shows that any random graph does not contribute to the performance gain of GCN-SeA and the dependency and contextual structures do play an important role. Ablations : We experiment with replacing the sequential attention by the Bahdanau attention (Bah- danau et al., 2015). We also experiment with various combinations of RNNs and GCNs as encoders. The results are shown in table 8 (Appendix D). We observed that GCNs do not outperform RNNs independently. In general, RNN-Bahdanau attention performs better than GCN-Bahdanau attention. The sequential attention mechanism outperforms Bahdanau attention as observed from the following comparisons (i) GCN-Bahdanau attention vs GCN-SeA, (ii) RNN-Bahdanau attention vs RNN-SeA (in BLEU and ROUGE) and (iii) RNN+GCN-Bahdanau attention vs RNN+GCN-SeA. Overall, the best results are always obtained by our final model which combines RNN, GCN and sequential attention. 7 CONCLUSION We showed that structure aware representations are useful in goal-oriented dialogue and we obtain state-of-the art performance on the modified DSTC2 dataset and its recently released code-mixed versions. We used GCNs to infuse structural information of dependency graphs and contextual graphs to enrich the representations of the dialogue context and KB. We also proposed a sequential attention mechanism for combining the representations of (i) query (current utterance), (ii) conver- sation history and (ii) the KB. Finally, we empirically showed that when dependency parsers are not available for certain languages such as code-mixed languages then we can use word co-occurrence frequencies and PPMI values to extract a contextual graph and use such a graph with GCNs for improved performance. REFERENCES Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1409.0473. Suman Banerjee, Nikita Moghe, Siddhartha Arora, and Mitesh M. Khapra. A dataset for building code-mixed goal oriented conversation systems. In Proceedings of the 27th International Confer- ence on Computational Linguistics, pp. 3766–3780. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/C18-1319. Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Simaan. Graph convolu- tional encoders for syntax-aware neural machine translation. In Proceedings of the 2017 Con- ference on Empirical Methods in Natural Language Processing, pp. 1957–1967. Association for Computational Linguistics, 2017. URL http://aclweb.org/anthology/D17-1209. Antoine Bordes, Y-Lan Boureau, and Jason Weston. Learning end-to-end goal-oriented dialog. International Conference on Learning Representations, 2017. URL http://arxiv.org/ abs/1605.07683. 10 http://arxiv.org/abs/1409.0473 http://aclweb.org/anthology/C18-1319 http://aclweb.org/anthology/D17-1209 http://arxiv.org/abs/1605.07683 http://arxiv.org/abs/1605.07683 Under review as a conference paper at ICLR 2019 Huadong Chen, Shujian Huang, David Chiang, and Jiajun Chen. Improved neural machine trans- lation with a syntax-aware encoder and decoder. In Proceedings of the 55th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1936– 1945. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1177. URL http://www.aclweb.org/anthology/P17-1177. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol- ger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Lin- guistics, 2014. doi: 10.3115/v1/D14-1179. URL http://www.aclweb.org/anthology/ D14-1179. Kenneth Ward Church and Patrick Hanks. Word association norms mutual information, and lex- icography. Computational Linguistics, 16(1), 1990. URL http://www.aclweb.org/ anthology/J90-1003. Ido Dagan, Shaul Marcus, and Shaul Markovitch. Contextual word similarity and estimation from sparse data. In 31st Annual Meeting of the Association for Computational Linguistics, 1993. URL http://www.aclweb.org/anthology/P93-1022. Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp. 3844–3852. Curran Associates, Inc., 2016. David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems 28, pp. 2224–2232. Curran Associates, Inc., 2015. Jeffrey L. Elman. Finding structure in time. COGNITIVE SCIENCE, 14(2):179–211, 1990. Mihail Eric and Christopher Manning. A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. In Proceedings of the 15th Conference of the Euro- pean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 468–473. Association for Computational Linguistics, 2017. URL http://aclweb.org/ anthology/E17-2075. Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D. Manning. Key-value retrieval networks for task-oriented dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, August 15-17, 2017, pp. 37–49, 2017. URL https://aclanthology.info/papers/W17-5506/w17-5506. Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. Tree-to-sequence attentional neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pp. 823–833. Association for Computational Linguis- tics, 2016. doi: 10.18653/v1/P16-1078. URL http://www.aclweb.org/anthology/ P16-1078. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90. Matthew Henderson, Blaise Thomson, and Jason D. Williams. The second dialog state tracking challenge. In Proceedings of the SIGDIAL 2014 Conference, The 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 18-20 June 2014, Philadelphia, PA, USA, pp. 263–272, 2014a. URL http://aclweb.org/anthology/W/W14/W14-4337.pdf. 11 http://www.aclweb.org/anthology/P17-1177 http://www.aclweb.org/anthology/D14-1179 http://www.aclweb.org/anthology/D14-1179 http://www.aclweb.org/anthology/J90-1003 http://www.aclweb.org/anthology/J90-1003 http://www.aclweb.org/anthology/P93-1022 http://aclweb.org/anthology/E17-2075 http://aclweb.org/anthology/E17-2075 https://aclanthology.info/papers/W17-5506/w17-5506 http://www.aclweb.org/anthology/P16-1078 http://www.aclweb.org/anthology/P16-1078 https://doi.org/10.1109/CVPR.2016.90 http://aclweb.org/anthology/W/W14/W14-4337.pdf Under review as a conference paper at ICLR 2019 Matthew Henderson, Blaise Thomson, and Jason D. Williams. The third dialog state tracking chal- lenge. In 2014 IEEE Spoken Language Technology Workshop, SLT 2014, South Lake Tahoe, NV, USA, December 7-10, 2014, pp. 324–329, 2014b. doi: 10.1109/SLT.2014.7078595. URL https://doi.org/10.1109/SLT.2014.7078595. Daniel D. Johnson. Learning graphical state transitions. International Conference on Learning Representations, 2017. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015. Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional net- works. International Conference on Learning Representations, 2017. Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated graph sequence neural networks. CoRR, abs/1511.05493, 2015. URL http://arxiv.org/abs/1511.05493. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, 2004. URL http://www.aclweb.org/anthology/W04-1013. Fei Liu and Julien Perez. Gated end-to-end memory networks. In Proceedings of the 15th Confer- ence of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, pp. 1–10, 2017. URL https: //aclanthology.info/papers/E17-1001/e17-1001. Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. Mem2seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In Proceedings of the 56th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1468–1478. Association for Computational Linguistics, 2018. URL http://aclweb.org/ anthology/P18-1136. Diego Marcheggiani and Ivan Titov. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1506–1515. Association for Computational Linguistics, 2017. URL http://aclweb.org/anthology/D17-1159. Diego Marcheggiani, Joost Bastings, and Ivan Titov. Exploiting semantics in neural machine trans- lation with graph convolutional networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 2 (Short Papers), pp. 486–492. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-2078. Ivan Titov Nicola De Cao, Wilker Aziz. Question answering by reasoning across documents with graph convolutional networks. arXiv preprint arXiv:1808.09920, 2018. Yoshiki Niwa and Yoshihiko Nitta. Co-occurrence vectors from corpora vs. distance vectors from dictionaries. In Proceedings of the 15th Conference on Computational Linguistics - Volume 1, COLING ’94, pp. 304–309, Stroudsburg, PA, USA, 1994. Association for Computational Linguis- tics. doi: 10.3115/991886.991938. URL https://doi.org/10.3115/991886.991938. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pp. 311–318, 2002. URL http://www.aclweb.org/anthology/P02-1040.pdf. Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for Computational Linguistics, 5:101–115, 2017. ISSN 2307-387X. URL https://www.transacl.org/ ojs/index.php/tacl/article/view/1028. Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. Query-reduction networks for question answering. International Conference on Learning Representations, 2017. 12 https://doi.org/10.1109/SLT.2014.7078595 http://arxiv.org/abs/1511.05493 http://www.aclweb.org/anthology/W04-1013 https://aclanthology.info/papers/E17-1001/e17-1001 https://aclanthology.info/papers/E17-1001/e17-1001 http://aclweb.org/anthology/P18-1136 http://aclweb.org/anthology/P18-1136 http://aclweb.org/anthology/D17-1159 http://aclweb.org/anthology/N18-2078 https://doi.org/10.3115/991886.991938 http://www.aclweb.org/anthology/P02-1040.pdf https://www.transacl.org/ojs/index.php/tacl/article/view/1028 https://www.transacl.org/ojs/index.php/tacl/article/view/1028 Under review as a conference paper at ICLR 2019 Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pp. 3776–3784, 2016. URL http://www.aaai.org/ocs/ index.php/AAAI/AAAI16/paper/view/11957. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learn- ing Research, 15(1):1929–1958, 2014. URL http://dl.acm.org/citation.cfm?id= 2670313. Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. CoRR, abs/1505.00387, 2015. URL http://arxiv.org/abs/1505.00387. Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end mem- ory networks. In Advances in Neural Information Processing Systems 28: Annual Con- ference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 2440–2448, 2015. URL http://papers.nips.cc/paper/ 5846-end-to-end-memory-networks. Shikhar Vashishth, Shib Sankar Dasgupta, Swayambhu Nath Ray, and Partha Talukdar. Dating docu- ments using graph convolution networks. In Proceedings of the 56th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), pp. 1605–1615. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/P18-1149. Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gasic, Lina M. Rojas Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 438–449. Association for Computational Linguistics, 2017. URL http://aclweb.org/anthology/E17-1042. Jason D. Williams and Steve J. Young. Partially observable markov decision processes for spoken dialog systems. Computer Speech & Language, 21(2):393–422, 2007. doi: 10.1016/j.csl.2006. 06.008. URL https://doi.org/10.1016/j.csl.2006.06.008. Jason D. Williams, Antoine Raux, Deepak Ramachandran, and Alan W. Black. The dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, The 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 22-24 August 2013, SU- PELEC, Metz, France, pp. 404–413, 2013. URL http://aclweb.org/anthology/W/ W13/W13-4065.pdf. Steve J. Young. Probabilistic methods in spoken-dialogue systems. Philosophical Transac- tions: Mathematical, Physical and Engineering Sciences, 358(1769):1389–1402, 2000. ISSN 1364503X. URL http://www.jstor.org/stable/2666825. Tiancheng Zhao, Allen Lu, Kyusong Lee, and Maxine Eskenazi. Generative encoder-decoder mod- els for task-oriented spoken dialog systems with chatting capability. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 27–36. Association for Computational Linguistics, 2017. URL http://aclweb.org/anthology/W17-5505. 13 http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11957 http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11957 http://dl.acm.org/citation.cfm?id=2670313 http://dl.acm.org/citation.cfm?id=2670313 http://arxiv.org/abs/1505.00387 http://papers.nips.cc/paper/5846-end-to-end-memory-networks http://papers.nips.cc/paper/5846-end-to-end-memory-networks http://aclweb.org/anthology/P18-1149 http://aclweb.org/anthology/E17-1042 https://doi.org/10.1016/j.csl.2006.06.008 http://aclweb.org/anthology/W/W13/W13-4065.pdf http://aclweb.org/anthology/W/W13/W13-4065.pdf http://www.jstor.org/stable/2666825 http://aclweb.org/anthology/W17-5505 Under review as a conference paper at ICLR 2019 A. CODE-MIXING STATISTICS AND DIALOGUE EXAMPLE Hindi Bengali Gujarati Tamil Vocabulary Size 1676 1372 1858 2185 Code-Mixed English Vocabulary 386 360 387 424 Native Language Vocabulary 739 477 912 1214 Others Vocabulary 551 535 559 547 Unique Utterances 6549 6274 6417 6666 Utterances with code-mixed words 5750 5703 5643 5632 Pure Native Language utterances 348 210 340 420 Pure English utterances 451 361 434 614 Average length of utterances 8.16 7.74 8.04 6.78 Average # of code-mixed utterances per dialog 12.11 14.28 11.80 12.96 Table 4: Statistics about code-mixing in the dataset (Banerjee et al., 2018). English Hindi-English codemixed user bot Hello, welcome to the Cambridge restaurant system. You can ask for restaurants by area, price range or food type. How may I help you ? Hello , Cambridge restaurant system mein aapka swagat hai. Aap pa sakte hai restaurant by area, price range ya food type. Aap ko kaise help kar sakta hu main ? user im looking for a moderately priced restaurant. mein moderately priced restaurant dhund raha hoon. bot What kind of food would you like ? Aap kaunse type ka food lena chahege? user it doesnt matter. matter nahi karta. bot What part of town do you have in mind ? aapke mind me town ka kaun sa part hai ? user the south part of town. south part of town. bot api call R cuisine south moderate api call R cuisine south moderate KB Triples: pizza hut cherry hinton R post code pizza hut cherry hinton post code pizza hut cherry hinton R cuisine italian pizza hut cherry hinton R location south pizza hut cherry hinton R phone pizza hut cherry hinton phone pizza hut cherry hinton R address pizza hut cherry hinton address pizza hut cherry hinton R price moderate pizza hut cherry hinton R rating 3 restaurant alimentum R post code restaurant alimentum post code restaurant alimentum R cuisine european restaurant alimentum R location south restaurant alimentum R phone restaurant alimentum phone restaurant alimentum R address restaurant alimentum address restaurant alimentum R price moderate restaurant alimentum R rating 10 user bot restaurant alimentum is a nice restaurant in the south of town serving modern european food. restaurant alimentum ek bohot hi accha restaurant hai south town me jo modern european food serve karta hai user can i get the address ? kya mujhe address mil sakta hai bot Sure, restaurant alimentum is on restaurant alimentum address. Bilkul, restaurant alimentum restaurant alimentum address par hai user and whats the phone number for that ? aur iska phone number kya hai ? bot The phone number of restaurant alimentum is restaurant alimentum phone. restaurant alimentum ka phone number hai restaurant alimentum phone user okay, thank you, good bye. theek, thank you, good bye. bot you are welcome. welcome, alvida. Table 5: Example dialogues from the DSTC2 dataset in English and in Hindi-English code-mixed languages. The English words which are mixed in Hindi are italicized. 14 Under review as a conference paper at ICLR 2019 B. EFFECT OF USING HOPS: Dataset Model per-resp. BLEU ROUGE Entity F1 acc 1 2 L En-DSTC2 GCN-SeA+K=1 47.1 59.0 67.4 57.1 65.0 71.9 GCN-SeA+K=2 48.4 59.7 68.5 58.4 66.2 72.8 GCN-SeA+K=3 46.4 59.4 67.3 56.9 64.8 68.8 Hi-DSTC2 GCN-SeA+K=1 47.0 56.0 65.0 55.3 63.0 72.4 GCN-SeA+K=2 40.4 53.2 61.8 50.5 59.7 60.2 GCN-SeA+K=3 19.0 29.7 42.2 28.9 38.5 00.5 Be-DSTC2 GCN-SeA+K=1 47.1 58.4 67.4 57.3 64.9 69.6 GCN-SeA+K=2 41.9 55.2 64.5 53.5 61.9 61.4 GCN-SeA+K=3 07.0 25.6 34.3 16.8 25.0 02.4 GU-DSTC2 GCN-SeA+K=1 48.1 55.7 65.5 56.2 63.5 72.2 GCN-SeA+K=2 43.3 53.5 63.7 53.4 61.5 64.2 GCN-SeA+K=3 20.8 36.5 47.3 34.1 45.1 17.3 Ta-DSTC2 GCN-SeA+K=1 46.4 62.8 68.5 57.5 66.1 71.9 GCN-SeA+K=2 44.4 61.5 67.2 55.8 64.7 68.8 GCN-SeA+K=3 36.4 56.1 62.2 49.9 59.9 56.0 Table 6: GCN-SeA with multiple hops on all DSTC2 datasets C. FREQUENCY VS PPMI CO-OCCURRENCE Dataset Model per-resp. BLEU ROUGE Entity F1 acc 1 2 L En-DSTC2 GCN-SeA+Freq 50.4 61.1 69.3 59.6 67.0 76.0 GCN-SeA+PPMI 50.5 60.7 69.3 59.7 67.0 77.4 Hi-DSTC2 GCN-SeA+Freq 48.7 56.9 65.5 56.1 63.5 74.5 GCN-SeA+PPMI 49.2 57.1 66.4 56.8 64.4 75.9 Be-DSTC2 GCN-SeA+Freq 49.0 59.0 68.2 58.5 65.7 72.7 GCN-SeA+PPMI 50.3 59.2 69.0 59.4 66.6 75.1 Gu-DSTC2 GCN-SeA+Freq 48.4 56.1 66.2 56.7 64.0 73.3 GCN-SeA+PPMI 48.9 56.7 66.1 56.9 64.1 73.0 Ta-DSTC2 GCN-SeA+Freq 49.2 64.1 69.5 59.0 67.1 76.7 GCN-SeA+PPMI 50.7 64.9 70.2 59.9 67.9 77.9 Table 7: RNN+GCN-SeA with different contextual graphs on all DSTC2 datasets 15 Under review as a conference paper at ICLR 2019 D. ABLATION RESULTS Dataset Model per-resp.acc BLEU ROUGE Entity F1 1 2 L Hi-DSTC2 Seq2seq-Bahdanau Attn 48.0 55.1 62.9 52.5 61.0 74.3 GCN-Bahdanau Attn 38.5 50.4 58.9 47.7 56.7 59.1 RNN+GCN-Bahdanau Attn 47.1 56.0 65.1 55.2 62.9 72.2 RNN-SeA 45.8 55.9 65.1 55.5 63.1 71.8 RNN+GCN-SeA 49.2 57.1 66.4 56.8 64.4 75.9 Be-DSTC2 Seq2seq-Bahdanau Attn 50.4 55.6 67.4 57.6 65.1 76.2 GCN-Bahdanau Attn 42.1 55.1 63.7 52.8 61.1 64.3 RNN+GCN-Bahdanau Attn 47.0 57.7 67.0 57.4 64.6 70.9 RNN-SeA 46.8 58.5 67.6 58.1 65.1 71.9 RNN+GCN-SeA 50.3 59.2 69.0 59.4 66.6 75.1 Gu-DSTC2 Seq2seq-Bahdanau Attn 47.7 54.5 64.8 54.9 62.6 71.3 GCN-Bahdanau Attn 38.8 49.5 59.2 48.3 56.8 58.0 RNN+GCN-Bahdanau Attn 46.5 55.5 65.6 55.9 63.4 70.6 RNN-SeA 45.4 56.0 66.0 56.6 63.9 69.8 RNN+GCN-SeA 48.9 56.7 66.1 56.9 64.1 73.0 Ta-DSTC2 Seq2seq-Bahdanau Attn 49.3 62.9 67.8 56.3 65.6 77.7 GCN-Bahdanau Attn 42.0 59.3 64.8 52.8 62.1 69.7 RNN+GCN-Bahdanau Attn 46.3 63.2 68.0 57.2 65.6 72.1 RNN-SeA 46.8 64.0 69.3 59.0 67.1 74.2 RNN+GCN-SeA 50.7 64.9 70.2 59.9 67.9 77.9 En-DSTC2 Seq2seq-Bahdanau Attn 46.0 57.3 67.2 56.0 64.9 67.1 GCN-Bahdanau Attn 45.7 58.1 66.5 55.9 64.1 70.1 RNN+GCN-Bahdanau Attn 47.4 59.5 67.9 57.7 65.6 72.9 RNN-SeA 47.0 60.2 68.5 58.9 66.2 72.7 RNN+GCN-SeA 51.4 61.2 69.6 60.2 67.4 77.9 Table 8: Ablation results of various models on all versions of DSTC2. 16 Introduction Related Work Background GCN for Undirected Graphs Syntactic GCN Model Query Encoder Dialogue History Encoder KB Encoder Sequential Attention Decoder Contextual Graph Creation Experimental setup Datasets Hyperparameters Models Compared Results and Discussions Conclusion