key: cord-024501-nl0gsr0c authors: Tan, Chunyang; Yang, Kaijia; Dai, Xinyu; Huang, Shujian; Chen, Jiajun title: MSGE: A Multi-step Gated Model for Knowledge Graph Completion date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47426-3_33 sha: doc_id: 24501 cord_uid: nl0gsr0c Knowledge graph embedding models aim to represent entities and relations in continuous low-dimensional vector space, benefiting many research areas such as knowledge graph completion and web searching. However, previous works do not consider controlling information flow, which makes them hard to obtain useful latent information and limits model performance. Specifically, as human beings, predictions are usually made in multiple steps with every step filtering out irrelevant information and targeting at helpful information. In this paper, we first integrate iterative mechanism into knowledge graph embedding and propose a multi-step gated model which utilizes relations as queries to extract useful information from coarse to fine in multiple steps. First gate mechanism is adopted to control information flow by the interaction between entity and relation with multiple steps. Then we repeat the gate cell for several times to refine the information incrementally. Our model achieves state-of-the-art performance on most benchmark datasets compared to strong baselines. Further analyses demonstrate the effectiveness of our model and its scalability on large knowledge graphs. Large-scale knowledge graphs(KGs), such as Freebase [1] , YAGO3 [2] and DBpedia [3] , have attracted extensive interests with progress in artificial intelligence. Real-world facts are stored in KGs with the form of (subject entity, relation, object entity), denoted as (s, r, o), benefiting many applications and research areas such as question answering and semantic searching. Meanwhile, KGs are still far from complete with missing a lot of valid triplets. As a consequence, many researches have been devoted to knowledge graph completion task which aims to predict missing links in knowledge graphs. Knowledge graph embedding models try to represent entities and relations in low-dimensional continuous vector space. Benefiting from these embedding models, we can do complicated computations on KG facts and better tackle the KG completion task. Translation distance based models [4] [5] [6] [7] [8] regard predicting a relation between two entities as a translation from subject entity to tail entity with the relation as a media. While plenty of bilinear models [9] [10] [11] [12] [13] propose different energy functions representing the score of its validity rather than measure the distance between entities. Apart from these shallow models, recently, deeper models [14, 15] are proposed to extract information at deep level. Though effective, these models do not consider: 1. Controlling information flow specifically, which means keeping relevant information and filtering out useless ones, as a result restricting the performance of models. 2. The multi-step reasoning nature of a prediction process. An entity in a knowledge graph contains rich latent information in its representation. As illustrated in Fig. 1 , the entity Michael Jordon has much latent information embedded in the knowledge graph and will be learned into the representation implicitly. However, when given a relation, not all latent semantics are helpful for the prediction of object entity. Intuitively, it is more reasonable to design a module that can capture useful latent information and filter out useless ones. At the meantime, for a complex graph, an entity may contain much latent information entailed in an entity, one-step predicting is not enough for complicated predictions, while almost all previous models ignore this nature. Multi-step architecture [16, 17] allows the model to refine the information from coarse to fine in multiple steps and has been proved to benefit a lot for the feature extraction procedure. In this paper, we propose a Multi-Step Gated Embedding (MSGE) model for link prediction in KGs. During every step, gate mechanism is applied several times, which is used to decide what features are retained and what are excluded at the dimension level, corresponding to the multi-step reasoning procedure. For partial dataset, gate cells are repeated for several times iteratively for more finegrained information. All parameters are shared among the repeating cells, which allows our model to target the right features in multi-steps with high parameter efficiency. We do link prediction experiments on 6 public available benchmark datasets and achieve better performance compared to strong baselines on most datasets. We further analyse the influence of gate mechanism and the length of steps to demonstrate our motivation. Link prediction in knowledge graphs aims to predict correct object entities given a pair of subject entity and relation. In a knowledge graph, there are a huge amount of entities and relations, which inspires previous work to transform the prediction task as a scoring and ranking task. Given a known pair of subject entity and relation (s, r), a model needs to design a scoring function for a triple (s, r, o), where o belongs to all entities in a knowledge graph. Then model ranks all these triples in order to find the position of the valid one. The goal of a model is to rank all valid triples before the false ones. Knowledge graph embedding models aim to represent entities and relations in knowledge graphs with low-dimensional vectors (e s , e r , e t ). TransE [4] is a typical distance-based model with constraint formula e s + e r − e t ≈ 0. Many other models extend TransE by projecting subject and object entities into relationspecific vector space, such as TransH [5] , TransR [6] and TransD [18] . TorusE [7] and RotatE [8] are also extensions of distance-based models. Instead of measuring distance among entities, bilinear models such as RESCAL [9] , DistMult [10] and ComplEx [11] are proposed with multiplication operations to score a triplet. Tensor decomposition methods such as SimplE [12] , CP-N3 [19] and TuckER [13] can also be seen as bilinear models with extra constraints. Apart from above shallow models, several deeper non-linear models have been proposed to further capture more underlying features. For example, (R-GCNs) [15] applies a specific convolution operator to model locality information in accordance to the topology of knowledge graphs. ConvE [14] first applies 2-D convolution into knowledge graph embedding and achieves competitive performance. The main idea of our model is to control information flow in a multi-step way. To our best knowledge, the most related work to ours is TransAt [20] which also mentioned the two-step reasoning nature of link prediction. However, in TransAt, the first step is categorizing entities with Kmeans and then it adopts a distance-based scoring function to measure the validity. This architecture is not an end-to-end structure which is not flexible. Besides, error propagation will happen due to the usage of Kmeans algorithm. We denote a knowledge graph as G = {(s, r, o)} ⊆ E × R × E , where E and R are the sets of entities, relations respectively. The number of entities in G is n e , the number of relations in G is n r and we allocate the same dimension d to entities and relations for simplicity. E ∈ R ne * d is the embedding matrix for The schematic diagram of our model with length of step 3. es and er represent embedding of subject entity and relation respectively. e i r means the query relation are fed into the i-th step to refine information.ẽs is the final output information, then matrix multiplication is operated betweenẽs and embedding matrix of entities E. At last, logistic sigmoid function is applied to restrict the final score between 0 and 1. entities and R ∈ R nr * d is the embedding matrix for relations. e s , e r and e o are used to represent the embedding of subject entity, relation and subject entity respectively. Besides, we denote a gate cell in our model as C. In order to obtain useful information, we need a specific module to extract needed information from subject entity with respect to the given relation, which can be regarded as a control of information flow guided by the relation. To model this process, we introduce gate mechanism, which is widely used in data mining and natural language processing models to guide the transmission of information, e.g. Long Short-Term Memory (LSTM) [21] and Gated Recurrent Unit (GRU) [22] . Here we adopt gating mechanism at dimension level to control information entailed in the embedding. To make the entity interact with relation specifically, we rewrite the gate cell in multi-steps with two gates as below: Two gates z and r are called update gate and reset gate respectively for controlling the information flow. Reset gate is designed for generating a new e s or new information in another saying as follows: Update gate aims to decide how much the generated information are kept according to formula (3):ẽ Hardmard product is performed to control the information at a dimension level. The values of these two gates are generated by the interaction between subject entity and relation. σ-Logistic sigmoid function is performed to project results between 0 and 1. Here 0 means totally excluded while 1 means totally kept, which is the core module to control the flow of information. We denote the gate cell as C. Besides, to verify the effectiveness of gate mechanism, we also list the formula of a cell that exclude gates as below for ablation study: With the gate cell containing several gating operations, the overall architecture in one gate cell is indeed a multi-step information controlling way. In fact, a single gate cell can generate useful information since the two gating operations already hold great power for information controlling. However, for a complex dataset, more fine and precise features are needed for prediction. The iterative multi-step architecture allows the model to refine the representations incrementally. During each step, a query is fed into the model to interact with given features from previous step to obtain relevant information for next step. As illustrated in Fig. 2 , to generate the sequence as the input for multi-step training, we first feed relation embedding into a fully connected layer: We reshape the output as a sequence [e 0 r , e 1 r , ..., e k r ] = Reshape(e r ) which are named query relations. This projection aims to obtain query relations of different latent aspects such that we can utilize them to extract diverse information across multiple steps. Information of diversity can increase the robustness of a model, which further benefits the performance. Query relations are fed sequentially into the gate cell to interact with subject entity and generate information from coarse to fine. Parameters are shared across all steps so multi-step training are performed in an iterative way indeed. Our score function for a given triple can be summarized as: where C k means repeating gate cell for k steps and during each step only the corresponding e i r is fed to interact with output information from last step. See Fig. 2 for better understanding. After we extract the final information, it is interacted with object entity with a dot product operation to produce final score. In previous RNN-like models, a cell is repeated several times to produce information of an input sequence, where the repeating times are decided by the length of the input sequence. Differently, we have two inputs e s and e r with totally different properties, which are embeddings of subject entity and relation respectively, which should not be seen as a sequence as usual. As a result, a gate cell is used for capturing interactive information among entities and relations iteratively in our model, rather than extracting information of just one input sequence. See Fig. 3 for differences more clearly. Training. At last, matrix multiplication is applied between the final output information and embedding matrix E, which can be called 1-N scoring [14] to score all triples in one time for efficiency and better performance. We also add reciprocal triple for every instance in the dataset which means for a given (s, r, t), we add a reverse triple (t, r −1 , s) as the previous work. We use binary crossentropy loss as our loss function: We add batch normalization to regularise our model and dropout is also used after layers. For optimization, we use Adam for a stable and fast training process. Embedding matrices are initialized with xavier normalization. Label smoothing [23] is also used to lessen overfitting. In this section we first introduce the benchmark datasets used in this paper, then we report the empirical results to demonstrate the effectiveness of our model. Analyses and ablation study are further reported to strengthen our motivation. Language System) are biomedical concepts such as disease and antibiotic. • Kinship [25] contains kinship relationships among members of the Alyawarra tribe from Central Australia. The details of these datasets are reported in Table 1 . The evaluation metric we use in our paper includes Mean Reciprocal Rank(MRR) and Hit@K. MRR represents the reciprocal rank of the right triple, the higher the better of the model. Hit@K reflects the proportion of gold triples ranked in the top K. Here we select K among {1, 3, 10}, consistent with previous work. When Hit@K is higher, the model can be considered as better. All results are reported with 'Filter' setting which removes all gold triples that have existed in train, valid and test data during ranking. We report the test results according to the best performance of MRR on validation data as the same with previous works. Table 3 . Link prediction results on UMLS and Kinship. For different datasets, the best setting of the number of iterations varies a lot. For FB15k and UMLS the number at 1 provides the best performance, however for other datasets, iterative mechanism is helpful for boosting the performance. The best number of iterations is set to 5 for WN18, 3 for WN18RR, 8 for FB15k-237 and 2 for Kinship. We do link prediction task on 6 benchmark datasets, comparing with several classical baselines such as TransE [4] , DistMult [10] and some SOTA strong baselines such as ConvE [14] , RotatE [8] and TuckER [13] . For smaller datasets UMLS and Kinship, we also compare with some non-embedding methods such as NTP [26] and NeuralLP [27] which learn logic rules for predicting, as well as MINERVA [28] which utilizes reinforcement learning for reasoning over paths in knowledge graphs. The results are reported in Table 2 and Table 3 . Overall, from the results we can conclude that our model achieves comparable or better performance than SOTA models on datasets. Even with datasets without inverse relations such as WN18RR, FB15k-237 which are more difficult datasets, our model can still achieve comparable performance. To study the effectiveness of the iterative multi-step architecture, we list the performance of different number of steps on FB15k-237 in Table 4 . The model settings are all exactly the same except for length of steps. From the results on FB15k-237 we can conclude that the multi-step mechanism indeed boosts the performance for a complex knowledge graph like FB15k-237, which verify our motivation that refining information for several steps can obtain more helpful information for some complex datasets. We report the convergence process of TuckER and MSGE on FB15k-237 dataset and WN18RR dataset in Fig. 4 . We re-run TuckER with exactly the same settings In Table 5 , we report the parameter counts of ConvE, TuckER and our model for comparison. Our model can achieve better performance on most datasets with much less parameters, which means our model can be more easily migrated to large knowledge graphs. As for TuckER, which is the current SOTA method, the parameter count is mainly due to the core interaction tensor W , whose size is d e * d r * d e . As the grow of embedding dimension, this core tensor will lead to a large increasing on parameter size. However, note that our model is an iterative architecture therefore only a very few parameters are needed apart from the embedding, the complexity is O(n e d + n r d). For evaluating time efficiency, we re-run TuckER and our model on Telsa K40c. TuckER needs 29 s/28 s to run an epoch on FB15k-237/WN18RR respectively, MSGE needs 17 s/24 s respectively, which demonstrate the time efficiency due to few operations in our model. To further demonstrate our motivation that gate mechanism and multi-step reasoning are beneficial for extracting information. We do ablation study with the following settings: • No gate: Remove the gates in our model to verify the necessity of controlling information flow. • Concat: Concatenate information extracted in every step together and feed them into a fully connected layer to obtain another kind of final information, which is used to verify that more useful information are produced by the procedure of multi-step. • Replicate: Replicate the relation to gain k same query relations for training. This is to prove that extracting diverse information from multi-view query relations is more helpful than using the same relation for k times. The experiment results are reported in Table 6 . All results demonstrate our motivation that controlling information flow in a multi-step way is beneficial for link prediction task in knowledge graphs. Especially a gated cell is of much benefit for information extraction. In this paper, we propose a multi-step gated model MSGE for link prediction task in knowledge graph completion. We utilize gate mechanism to control information flow generated by the interaction between subject entity and relation. Then we repeat gated module to refine information from coarse to fine. It has been proved from the empirical results that utilizing gated module for multiple steps is beneficial for extracting more useful information, which can further boost the performance on link prediction. We also do analysis from different views to demonstrate this conclusion. Note that, all information contained in embeddings are learned across the training procedure implicitly. In future work, we would like to aggregate more information for entities to enhance feature extraction, for example, from the neighbor nodes and relations. Freebase: a collaboratively created graph database for structuring human knowledge Yago3: A knowledge base from multilingual wikipedias DBpedia: a nucleus for a web of open data Translating embeddings for modeling multi-relational data Knowledge graph embedding by translating on hyperplanes Learning entity and relation embeddings for knowledge graph completion Knowledge graph embedding on a lie group Rotate: knowledge graph embedding by relational rotation in complex space A three-way model for collective learning on multi-relational data Embedding entities and relations for learning and inference in knowledge bases Complex embeddings for simple link prediction Simple embedding for link prediction in knowledge graphs Tensor factorization for knowledge graph completion Convolutional 2D knowledge graph embeddings Modeling relational data with graph convolutional networks Reasonet: learning to stop reading in machine comprehension Gated-attention readers for text comprehension Knowledge graph embedding via dynamic mapping matrix Canonicaltensor decomposition for knowledge base completion Translating embeddings for knowledge graph completion with relation attention mechanism Long short-term memory Learning phrase representations using RNN encoder-decoder for statistical machine translation Rethinking the inception architecture for computer vision Observed versus latent features for knowledge base and text inference Statistical predicate invention End-to-end differentiable proving Differentiable learning of logical rules for knowledge base reasoning Go for a walk and arrive at the answer: reasoning over paths in knowledge bases using reinforcement learning