key: cord-178783-894gkrsk
authors: Zhang, Rui; Hristovski, Dimitar; Schutte, Dalton; Kastrin, Andrej; Fiszman, Marcelo; Kilicoglu, Halil
title: Drug Repurposing for COVID-19 via Knowledge Graph Completion
date: 2020-10-19
journal: nan
DOI: nan
sha: 
doc_id: 178783
cord_uid: 894gkrsk

Objective: To discover candidate drugs to repurpose for COVID-19 using literature-derived knowledge and knowledge graph completion methods. Methods: We propose a novel, integrative, and neural network-based literature-based discovery (LBD) approach to identify drug candidates from both PubMed and COVID-19-focused research literature. Our approach relies on semantic triples extracted using SemRep (via SemMedDB). We identified an informative subset of semantic triples using filtering rules and an accuracy classifier developed on a BERT variant, and used this subset to construct a knowledge graph. Five SOTA, neural knowledge graph completion algorithms were used to predict drug repurposing candidates. The models were trained and assessed using a time slicing approach and the predicted drugs were compared with a list of drugs reported in the literature and evaluated in clinical trials. These models were complemented by a discovery pattern-based approach. Results: Accuracy classifier based on PubMedBERT achieved the best performance (F1= 0.854) in classifying semantic predications. Among five knowledge graph completion models, TransE outperformed others (MR = 0.923, Hits@1=0.417). Some known drugs linked to COVID-19 in the literature were identified, as well as some candidate drugs that have not yet been studied. Discovery patterns enabled generation of plausible hypotheses regarding the relationships between the candidate drugs and COVID-19. Among them, five highly ranked and novel drugs (paclitaxel, SB 203580, alpha 2-antiplasmin, pyrrolidine dithiocarbamate, and butylated hydroxytoluene) with their mechanistic explanations were further discussed. Conclusion: We show that an LBD approach can be feasible for discovering drug candidates for COVID-19, and for generating mechanistic explanations. Our approach can be generalized to other diseases as well as to other clinical questions.

anteed. On the other hand, de novo development and approval of an effective antiviral therapy can take more than a decade. In the absence of an effective vaccine or other therapies, there have been significant efforts in repurposing drugs approved for other diseases for COVID-19 treatment, some of which have been tested in clinical trials (e.g., dexamethasone [9] , hydroxychloroquine and lopinavir/ritonavir [10] ).

Computational approaches to drug repurposing have also garnered much attention to accelerate discovery of therapies for COVID-19 [11, 12] . Common computational drug repurposing methods include drug signature matching, molecular docking, genome-wide association studies, and network analysis [13] . These data-driven approaches involve systematic analysis of various types of biological and clinical data (e.g., gene expression, chemical structure, genome and protein sequences, and electronic health records) to generate hypotheses regarding repurposed use of approved or investigational drugs [13] . The potential of recent advances in artificial intelligence (AI) and machine learning for COVID-19 drug repurposing has also been highlighted [14] and several studies using these techniques have reported promising results [15] [16] [17] [18] . In particular, approaches leveraging network medicine [19] principles and biological knowledge graphs have been emphasized [14] .

Most of these computational approaches have focused on biological data, such as gene expression, protein-protein and drug-target interactions, and used SARS-CoV-2-related data. However, COVID-19-specific data is meaningful in the context of the larger body of diverse knowledge underpinning medicine and life sciences, a primary source of which is the biomedical literature. While some COVID-19 drug repurposing studies incorporated some literature-based knowledge [15, 18] , their focus has remained largely COVID-19-specific. We argue that efficiently and safely repurposing drugs to treat COVID-19 requires more effective integration of literature-based knowledge with biological data collected via high-throughput methods.

In this paper, we propose a novel literature-based discovery [20, 21] ap-proach for COVID-19 drug repurposing. Similar to related work [18] , we cast drug repurposing as a task of knowledge graph completion (or link prediction).

We use a large, literature-derived biomedical knowledge graph constructed from SemMedDB [22] as well as COVID-19 research literature [23] , as our data source.

We use several state-of-the-art, neural network-based algorithms [24] [25] [26] for the task, and also complement these approaches with an approach based on discovery patterns [27] . Furthermore, we highlight the role of discovery patterns in search of mechanistic explanations for the proposed drugs. Unlike most approaches that focus on COVID-19-specific knowledge [15, 18] , we consider a larger body of biomedical knowledge, as captured in the PubMed bibliographic database as well as in the COVID-19 research literature. Our results show that our approach can identify known drugs that have been used for COVID-19 and discover other novel drugs that can potentially be repurposed for COVID-19.

Significant computational work has already been done to prioritize FDAapproved drugs for repurposing to treat COVID-19 [11, 12] . For the most part, these studies can be categorized as molecular docking-based drug screening studies and network-based studies, the majority of them belonging to the former category. In molecular docking studies, small molecules in compound libraries are screened for effectiveness against the host proteins in the SARS-CoV-2 host interactome. Many studies of this kind have been reported, and some of the proposed drugs such as ritonavir, ribavirin, remdesivir, oseltamivir, have been used in practice and many are being evaluated in clinical trials [28] [29] [30] [31] [32] [33] [34] [35] .

While not as common as docking studies, network-based approaches to drug repurposing have also been explored. In one early study, a virus-related knowledge graph which consists of drug-target and protein-protein interactions and similarity networks from publicly available databases (e.g., DrugBank [36] , ChEMBL [37] , BioGRID [38] ) was constructed and network-based machine learning and statistical analysis were used to predict an initial list of COVID-19 drug candidates. This list was narrowed down based on text mining from the literature and gene expression profiles from COVID-19 patients, and a poly-ADPribose polymerase 1 (PARP1) inhibitor CVL218, was proposed for therapeutic use against COVID-19 [15] . Cava et al. [39] used gene expression profiles from public datasets to construct a protein-protein interaction network in conjunction with pathway enrichment analysis to identify 36 potential drugs, including nimesulide, thiabendazole, and fluticasone propionate. In another study, network proximity analyses of drug targets and HCoV-host interactions in the human interactome were used to prioritize 16 potential repurposed drugs, including melatonin, mercaptopurine, and sirolimus, which were validated by enrichment analyses of drug-gene signatures and transcriptome data in human cell lines. Potentially useful drug combinations (e.g., melatonin plus mercaptopurine) were also suggested [16] . A follow-up study combined network medicine approaches based on human interactome with clinical patient data from a COVID-19 registry to show that melatonin was associated with reduced likelihood of a positive SARS-CoV-2 laboratory test [17] . The approach was further extended to explore deep learning [18] . A comprehensive knowledge graph of drugs, diseases, and proteins/genes (named CoV-KGE) was constructed by combining molecular interaction information from the literature with knowledge from DrugBank. A knowledge graph embedding model, named RotatE [25] was used to represent the entities and the relationships in the knowledge-based in low-dimensional vector space. Using the ongoing COVID-19 trial data as a validation set, 41 high-confidence repurposed drug candidates (including dexamethasone, indomethacin, niclosamine, and toremifene) were identified, and further validated via an enrichment analysis of gene expression and proteomics data in SARS-CoV-2-infected human cells. Another study used node2vec graph embeddings and variational graph autoencoders for the same purpose [40] . [41] evaluated three algorithms (graph neural network, network proximity, and network diffusion) on a network of drug protein targets and disease-associated proteins for COVID-19 drug repurposing. While they obtained low correlations across the three algorithms, an ensembling approach that combined the predictions of all algorithms was shown to outperform the individual methods, ranking ritonavir, chloroquine, and dexamethasone among the most promising candidates. Some limited literature knowledge relevant to COVID-19 has been incorporated to network-based approaches; however, their focus remains largely on structured molecular interaction information encoded in databases.

Literature-based discovery (LBD) [20, 21] is a method of automatic hypothesis generation pioneered by Swanson [42] . Based on the concept of "undiscovered public knowledge", LBD seeks to uncover valuable hidden connections between disparate research literatures, and has been proposed as a potential solution for the problem of "research silos" (the view that scientific research areas are largely isolated from one another). The primary LBD paradigm is the so-called ABC model. In the open discovery form of this model, a relationship between two concepts A and B is known in one research area and another relationship between concepts B and C is known in another, and a potential relationship between concepts A and C is proposed. Conversely, in closed discovery, relationship AC is known, and a concept B is proposed as an explanation for the relationship AC. Extensions to ABC model have also been proposed, such as discovery browsing model that aims to elucidate more complex relationship paths between biomedical concepts [43, 44] . Most applications of LBD have been in the biomedical domain, beginning with Swanson's discovery of fish oil as a treatment for Raynaud disease [42] , a hypothesis supported subsequently by clinical studies. While early LBD systems focused primarily on term cooccurrence [45, 46] , semantic relations have been widely used in later years for representing scientific content of biomedical publications [27, [47] [48] [49] . More recently, distributed vector representations based on term or semantic relation co-occurrence have been gaining popularity [50] [51] [52] .

Drug repurposing has been one of the prominent applications of LBD [27, [53] [54] [55] [56] [57] [58] . For example, Hristovski et al. [27] used semantic discovery patterns following the ABC model to identify potential therapeutic uses for drugs. Zhang et al. [56] used discovery patterns and SemMedDB relations to identify potential prostate cancer drugs. Cohen et al. [55] used a vector representation approach based on semantic relations to predict a small number of active agents within a large library screened for activity against prostate cancer cells.

Knowledge graphs are represented as a collection of head entity-relation-tail entity triples (h,r,t), where entities correspond to nodes and relations to edges between them. Knowledge graph completion (or link prediction) is the task of predicting unseen relations between two existing entities or to predict the tail entity given the head entity and the relation (or head entity given the tail entity and the relation). Recent approaches to knowledge graph completion rely on knowledge graph embedding methods [59] , which learn a mapping from nodes and edges to continuous vector space that preserve the proximity structure of the knowledge graph and are amenable to application of machine learning methods.

Such methods include translational models, which use distance-based scoring functions (e.g., TransE [24] , TransH [60] , RotatE [25] ), and semantic matching models, which use similarity-based scoring functions (e.g., RESCAL [61] , DistMult [62] , ComplEx [63] , and Holographic Embeddings (HolE) [64] ). Graph convolutional networks [65, 66] as well as methods that use context-based encoding approach (KG-BERT [67] , STELP [26] ) have also been recently proposed.

Knowledge graph embedding techniques based on a network of drug, disease, and gene/protein entities, have been used to support drug repurposing for rare diseases [68] . Graph convolutional networks were used to model drug side effects resulting from drug-drug interactions [69] . A multimodal graph of proteinprotein interactions, drug-protein target interactions and drug-drug interactions was constructed from publicly available datasets. Sang et al. (2018) [70] constructed low-dimensional knowledge graph embeddings from SemMedDB relations and trained a Long Short-Term Memory (LSTM) model using known drug therapies from Therapeutic Target Database [71] , proposing potential drugs us-ing the trained model.

In this section, we first describe our data sources and the preprocessing steps that were taken to construct a literature knowledge graph from these data sources. Next, we discuss the knowledge graph completion methods that we used to predict candidate drugs for COVID-19 as well as the discovery patterns used for providing mechanistic explanations. Lastly, we detail various evaluation schemes that we used to automatically validate our predictions. A workflow diagram illustrating our approach is provided in Fig. 1 . 

We constructed our biomedical knowledge graph primarily from SemMedDB [22] , a repository of semantic relations automatically extracted from biomedi-cal literature using SemRep natural language processing (NLP) tool [72, 73] .

SemRep-extracted relations are in the form of subject-predicate-object triples (also called semantic predications) and are derived from unstructured text in Concepts are enriched with semantic type information (Disease or Syndrome, Pharmacologic Substance, etc.) and the relations are linked to the supporting article and sentence. SemMedDB has supported a wide range of computational applications, ranging from gene regulatory network inference [76] to in silico screening for drug repurposing [55] and medical reasoning [77] , and has also found widespread use for literature-based knowledge discovery and hypothesis generation [44, 48, [78] [79] [80] . In its most recent release (version 43, dated 8/28/2020) 1 , SemMedDB contains more than 107M relations from more than 

In this work, we focused on a subset of semantic relations derived from the combination of PubMed and CORD-19 datasets, predicted to be accurate and informative for drug repurposing.

First, we eliminated relations involving generic biomedical concepts (i.e., relations in which both subject and object were present in a GENERIC CONCEPT table of SemMedDB such as Pharmaceutical Preparations) and relations with identical subject and object arguments. Next, we excluded a subset of predicate types that were not expected to be useful for drug repurposing, such as part of and process of. The predicate types used are affects, associated with, augments, causes, coexists with, complicates, disrupts, inhibits, interacts with, manifestation of, predisposes, prevents, produces, stimulates, and treats. Lastly, we also excluded the relations in which the sub- In the second step, we eliminated uninformative semantic relations using loglikelihood ratio (G 2 ) and network degree centrality for the concepts (in-degree and out-degree). We assigned each semantic relation a G 2 score indicating how strongly the terms within a triple are associated with each other [82] . A high G 2 score means that the observed and expected frequencies are significantly different, indicating that the triple is less likely to occur by chance. For computational purposes, we created two three-dimensional contingency tables with indices i, j, and k. The first table (OT) holds observed frequencies of a triple from the knowledge graph and the second table (ET) contains the expected values assuming independence of terms in each triple. G 2 was then calculated using the equation

where n ijk is the cell i, j, k in OT, m ijk is the cell i, j, k in ET, and T = n ijk .

Next, we normalized all three measures (G 2 , in-degree, and out-degree) to the range [0, 1] and summed them up into a final score. The lower the score, the more specific and informative the relation is. For example, the relation

Operative Surgical Procedures-treats-Woman with high score is more general than relation interleukin-6-affects-Autoimmune Diseases. We kept all relations for which the score value was less than a threshold value α. We manually tuned the α value to achieve a balance between specificity of relations and their variability. We kept 30 % of all relations with the lowest score in the data set. We also kept all biomedical concepts that refer to COVID-19 terms (CUIs: C5203670, C5203671, C5203672, C5203673, C5203674, C5203675, C5203676). At the end of the preprocessing stage, the knowledge graph consists of 131 355 nodes and 2 558 935 relations.

The precision of semantic predications generated by SemRep vary by domain (e.g., molecular interactions are less precise than clinical relationships). To improve the precision of the relations used for drug repurposing, we extended the SemRep accuracy classifiers previously proposed [83, 84] . We fine-tuned a collection of Transformer-based pretrained language models to classify semantic predications as correct vs. incorrect. We used the following models: vanilla BERT (base size, cased and uncased) [85] , BioBERT [86] , BioClinicalBERT [87] , BlueBERT [88] , and PubMedBERT [89] .

To extend the coverage of our existing classifiers, we used 6492 predications annotated as correct vs. incorrect with respect to their source sentences. We leveraged 6000 annotations from a previous study [84] (Cohen's κ of 0.80) and

annotated 492 new predications. Annotation guidelines generated in the previous study was used. Two of the authors (HK and MF) and two health informat-ics graduate students annotated predications containing predicates of interest absent in the prior study (Fleiss' κ = 0.410, indicating moderate agreement).

The resulting annotated set was split into 80/10/10 training/validation/test sets. Hyperparameters were determined empirically and the learning rate was set to 1 × 10 −5 , the batch size was 16, the maximum number of epochs was set to 10 but early stopping was employed. Optimization was done using the Adam optimizer [90] with decoupled weight decay regularization using betas (0.9, 0.999) and decay 0.01. The pooled output from the BERT model was fed through a linear layer to produce logits that then underwent a softmax transformation to return class probabilities. A single Tesla V100 GPU was used to train the models. We compared the performance of various above-mentioned transformers. The best classifier was then used to filter incorrect semantic predications.

Consider a knowledge graph G = (E, R, E), where E refers to a set of entities, R denotes a set of possible relations, and T stands for a set of triples in the form (h)ead-(r)elation-(t)ail, formally denoted as {(h, r, t)} ⊂ E × R × E. The aim of knowledge graph completion is to infer new triples (h , r , t ) such that h , t ∈ E and r ∈ R. In this setting, the knowledge graph completion problem could be represented as a ranking task in which we learn a prediction function ψ(h, r, t) : E × R × E → R which generates higher scores for true triples and lower scores for false triples.

We explored three classes of knowledge graph completion methods: TransE [24] and RotatE [25] for translational models, DistMult [62] and ComplEx [63] for semantic matching models, and STELP [26] for context-based encoding.

These methods differ in the way that they encode entities and relations in a knowledge graph into a low-dimensional vector space (i.e., KG embedding).

Such distributed vector representation can be used for downstream reasoning and machine learning tasks.

TransE [24] describes a triplet (h, r, t) as a translation between head entity h and tail entity t through relation r in a continuous vector space, i.e., h + r ≈ t, where h, r, t ∈ R d is the embedding of h, r, and t, respectively. To measure plausibility of relations TransE employs a distance-based score function

We choose TransE because of its simplicity and good prediction performance.

However, TransE is able to model only one-to-one relations and fails to embed one-to-many, many-to-one, and many-to-many relations. To solve this problem, many other solutions have been proposed including RotatE [25] . RotatE treats each relation in a complex vector space as a rotation from the head entity to the tail entity, i.e., s(h, r, t) = |h • r − t| l1 , where • is a Hadamard product.

We selected RotatE as a counterpart to TransE, as TransE reportedly does not perform well on some data sets (e.g., B15k family of data sets), which require symmetric pattern modeling.

DistMult [62] is the simplest approach among semantic matching models. 

Semantic Triple Encoder for Link Prediction (STELP) [26] , is a contextbased encoding approach to knowledge graph completion. At its core is a Siamese BERT model that leverages sharing one set of weights across two models to produce encoded, contextual representations of the predications that are then 

where D is the set of correct triples, N (tp) is the set of corrupted triples for 

where γ is a scaling factor for the contribution of the contrastive loss.

At inference, STELP considers every entity-context combination for a given partial predication, (h, r) to find (t) or (r, t) to find (h), and ranks every pair using the sum of the positive class probability and the scaled negative Euclidean distance.

We replaced the vanilla base BERT model proposed in the STELP paper with BioBERT, trained on biomedical literature corpora. The 1 016 124 unique predications remaining after preprocessing were each corrupted to produce five negative predications for a total of 5 080 620 negative predications and a grand total of 6 096 744 predications. The hyperparameters were set to the same values as in the original STELP paper and the learning rate was set to 1e-5, the batch size was 16, the contrastive loss scaling factor was 1.0. Optimization was done using Adam with decoupled weight decay with betas (0.9, 0.999) and decay 0.01.

Training was run for 29 000 training iterations. Ranking was done by adding the scaled contrast score to the positive class probability and entities ordered in descending rank order.

All preprocessing was done using custom Bash and Python scripts. TransE, RotatE, DistMult, and ComplEx link prediction models were implemented in PyTorch using the DGL-KE package [91] for learning large-scale KG embeddings. The BERT models were based on HuggingFace BERT implementations using PyTorch. Pre-trained weights for BioBERT (BioBERT-Base v1.1 (+ PubMed 1M)) 2 , BioClinicalBERT 3 , PubMedBERT 4 and BlueBERT (BlueBERT-Base, Uncased, PubMed+MIMIC-III) 5 came from various sources associated with each paper. STELP was also implemented using a combination of a Hug-gingFace BERT model and PyTorch. Our source codes are also publicly available 6 .

Discovery patterns are defined as a set of constraints that need to be satisfied for the discovery of new relations between concepts [27] . Herein, we used discovery patterns for two purposes. First, we explored open discovery patterns to identify drugs that can be repurposed for COVID-19. Second, we used closed discovery patterns to propose plausible mechanisms for drugs identified via knowledge graph completion methods described above. Discovery patterns are expressed in terms of predication pairs (or predication chains). In particular, we focused on the following discovery pattern:

DrugA -inhibits|interacts with -ConceptB and 10/15/2020). We focus on the latter category in our qualitative evaluation below.

We semi-automatically generated a ground truth drug list, similar to the approach in other computational drug repurposing studies for COVID-19 [18] .

We downloaded the interventions used in COVID-19 drug trials from clinicaltrials.gov using the following search: https://clinicaltrials.gov/ct2/res

This yielded a set of 1167 clinical trials. We extracted the interventions from these studies and mapped the intervention terms to UMLS CUIs using MetaMap (v2016) [92] and filtered the resulting concepts by their semantic groups [93] , keeping only those concepts with the semantic group Chemicals & Drugs. We also considered the additional semantic types Therapeutic Procedure and Gene or Genome, which also appeared for some concepts in intervention lists. We removed the duplicates and some general concepts (e.g., Therapeutic procedure, Placebo) as well as incorrect mappings, which resulted in a final list of 285 concepts. The automatic evaluation (below) was performed against this set.

Time slicing is an evaluation technique often used in literature-based discovery and link prediction tasks [20] . The idea is to train models on data prior to a specific date and test them on data after that date and evaluate whether links that formed only after the cutoff date can be predicted from the trained model.

In this study, we trained our models on semantic relations extracted from publications dated 03/11/2020 or earlier and tested whether they can predict the drugs that have been proposed for COVID-19 since then or have been evaluated in clinical trials. This date was selected as cutoff, as it is the date on which WHO declared COVID-19 a pandemic. It is also a date by which enough biological knowledge about SARS-CoV-2 had accumulated, although COVID-19 therapies were still in their infancy, making it a suitable cutoff for time slicing experiments.

All five link prediction models were automatically assessed using a link prediction evaluation protocol proposed by Bordes et al. [24] . Suppose that X is a set of triples, Θ E be the embeddings of entities E, and Θ R be the embeddings of relations R. In the first, corruption step, we go through a set of triples and for each triple x = (h, r, t) ∈ X replace its head and tail with all other entities in E. Each triple is corrupted exactly 2|E| − 1 times. Formally, the corrupted triple is defined as:

where h = h and t = t. We employ the filtered setting protocol not taking into account any corrupted triple that already appears in the KG. In the second, scoring phase, original and corrupted triples are tested using the constructed classifier ψ. Intuition behind this is that the model will assign a higher score to the original triple and a lower score to the corrupted triple. In the third, evaluation phase, the proposed link prediction models are assessed using three measures: mean rank (MR), mean reciprocal rank (MRR), and Hits@k measure.

MR is an average rank assigned to the true predication, over all predications in a test set:

where rank h i and rank t i denote the rank position:

where the indicator function I[P ] is 1 iff P is true, and 0 otherwise.

MRR is the average inverse rank for all test triples and is formally computed as:

Hits@k measures the percentage of predications in which the true triple appears in the top k ranked triples, where k ∈ {1, 3, 10}; formally:

Our aim was to achieve low MR and high MRR and Hits@k.

In addition, we also performed a qualitative evaluation. One of the authors (MF) used Neo4j browser to assess the plausibility of some of the drugs highly ranked by the knowledge completion models, guided by literature search and review, and following the closed and open discovery paradigms.

We report the performance of the semantic relation accuracy classifier as well as the knowledge graph completion methods in this section.

The full table of results for the comparison of various BERT models for the accuracy classifier is included below ( Table 1) 

The link prediction results for all employed models are presented in Table 3 .

For MR a lower score is considered better, for all others a higher score is considered better. The score for each method is the mean value over all triplets in the testing set. sion, L 1 norm, learning rate η = 0.01 and regularization coefficient λ = 2×10 −8 .

Model training was limited to 20 000 epochs. Relatively small number of relations (15) ensure that all entities and relations can be smoothly embedded into the same vector space.

Next, we use t-SNE (t-distributed stochastic neighbor embedding) [94] algorithm to graphically represent embeddings of computed concepts in a twodimensional space (Figure 2 ). t-SNE algorithm enables reduction of highdimensional data into a low-dimensional space such that similar concepts are presented by nearby points. The plot demonstrates relatively good co-localization of selected concepts, especially for Suspected COVID-19 and paclitaxel.

Our results indicate that more complex knowledge graph completion models might be less efficient in drug repurposing tasks. Theoretical considerations suppose that TransE is outperformed by its successors [25, 62, 63] . However, differences in performances among DistMult, ComplEx, and RotatE are relatively small. All three models achieved low performance on MRR, Hits@1, inherently model only one-to-one relations and fails to represent one-to-many, many-to-one or even many-to-many relations, it shows its efficiency in embedding a large-scale complex biomedical knowledge graph, such as the extended SemMedDB used here. Empirical evidence shows that DistMult and ComplEx usually perform well for high-degree entities, but fails with low-degree entities [95] . Because we eliminated highly frequent concepts due to their lack of informativeness, it is possible that this is reflected in lower performance scores of both models.

The context-encoding model, STELP, showed rather poor performance in evaluation. One possibility is that the model was only able to learn high-level groupings for the predicates. This is likely the case as it was observed the model versus affects COVID-19 etc., it did not learn more granular features that allow it to differentiate between subjects within the context of treats COVID-19.

However, analysis of the t-SNE embedding and the qualitative evaluation show that the model mostly clustered the ground truth drugs into a couple of large clusters.

To further compare the drug rankings between TransE and STELP, we performed the Wilcoxon signed-rank test (p = 0.851), which indicates that no correlation was found between how the two models were ranking novel predications. Spearman's rank correlation between the novel predication rankings for both models was found to be −0.003 and this further supports the results of the Wilcoxon test. Table 4 and Table 5 show that there is very little agreement between TransE and STELP, particularly in the top 1000 rankings for each model.

It is worth noting that there were 47 items in common in the top 1000 rankings for both models. ranked triples for the specified model, calculating the absolute difference between the rankings from the two models for each of those triples, and calculating the statistics. For example, the triples that TransE ranked as the top 1000 triples we gathered, the absolute differences of rankings between TransE and STELP for those 1000 triples were calculated, and the statistics were calculated from those differences.

The can be possible to explore larger graphs than that explored in this work. On the other hand, with adequately large computational resources, it may be possible to optimize STELP hyperparameters and train over multiple random seeds to generate a model that obtains better results than TransE or RotatE, which are limited by their smaller representational capacity.

Discovery patterns based on semantic relations provide an intuitive way of exploring potential mechanistic links between biological phenomena. Neo4j and Cypher, its query language, are powerful tools that complement semantic relations nicely in quickly pinpointing promising research directions, although massive graphs present some challenges for effective query and retrieval. In addition, a domain expert is needed to sort out some of the noise in semantic relations (some of it obvious) due to text mining errors. However, given that predictions made by the knowledge completion models above are largely opaque, a human-in-the-loop discovery browsing approach based patterns [43, 44] remains an effective alternative to these more complex approaches, and also complements them by providing potential explanations.

The following classes of drugs have been used for the management of COVID-19 so far: antivirals (e.g., remdesivir), antibodies (e.g., convalescent plasma),

anti-inflammatory agents (e.g., dexamethasone), immunomodulators (e.g., interleukin inhibitors), anticoagulants (e.g., heparin), antifibrotics (e.g., tyrosine kinase inhibitors), and adjuvants (e.g., vitamin D) [96, 97] . In addition, several trials have studied antimalarials (e.g., hydroxychloroquine) and antiparasites (e.g., ivermectin), but evidence from trials do not support their use.

The knowledge graph completion models did not predict antivirals, antimalarials, and antiparasites, except for antivirals from the class neuraminidase inhibitors and the antimalarial artemisone. All the other drug classes and most of their members were predicted by the models. Dexamethasone, currently considered the most effective drug for reducing mortality in patients receiving oxygen, was the highest ranking drug from the RotatE model. It is possible that the models missed specific antivirals and antiparasites due to their mechanism of action, which usually involves binding to specific receptors, a relation type on which SemRep does relatively poorly. Despite this issue, qualitative assessment of the drugs predicted by the models was overall positive.

Using the open discovery pattern approach, we identified five promising drugs that were ranked highly and were not, to our knowledge, discussed in the literature, which we discuss below (paclitaxel, SB 203580, alpha 2-antiplasmin, pyrrolidine dithiocarbamate, and butylated hydroxytoluene). The same approach also yielded other highly ranked substances, which are currently evaluaetd in clinical trials, such as quercetin, melatoninm losartan, estradiol, and simvastatin. Note that the knowledge graph completion models predicted 7 of these drugs (excluding SB 203580, alpha 2-antiplasmin, and pyrrolidine dithiocarbamate). Figure 3 shows the resulting network from this discovery pattern generated by Neo4j browser.

Paclitaxel is used to treat several cancer types, including ovarian cancer, breast cancer, lung cancer, cervical cancer, and pancreatic cancer. It stabilizes the microtubule polymer and protects it from disassembly. Chromosomes are thus unable to achieve a metaphase spindle configuration. This blocks the progression of mitosis and prolonged activation of the mitotic checkpoint triggers apoptosis or reversion to the G0-phase of the cell cycle without cell division [98] .

The following patterns support the paclitaxel discovery:

1. paclitaxel-inhibits-interleukin-6-causes-COVID-19

2. paclitaxel-inhibits-NF-kappa B-associated with-COVID-19

3. paclitaxel-inhibits-interleukin-1, beta-associated with-COVID-19

4. paclitaxel-inhibits-Granulocyte Colony-Stimulating Factorassociated with-COVID-19

5. paclitaxel-inhibits-interleukin-10-predisposes-COVID-19

6. paclitaxel-inhibits-interleukin-8-predisposes-COVID-19

7. paclitaxel-inhibits-Thromboplastin-associated with-COVID-19

The first six patterns support a role for paclitaxel in alleviating the cytokine storm of COVID-19, triggered by dysfunctional immune response and mediating widespread lung inflammation. Paclitaxel may plausibly help as an immunosuppressive therapy to immunomediated damage in COVID-19 [99] . Thromboplastin (pattern 7) is a complex enzyme found in brain, lung, and other tissues and especially in blood platelets and functions in the conversion of prothrombin to thrombin in the clotting of blood and may be elevated in patients with COVID-19. As pulmonary microvascular thrombosis plays an important role in progressive lung failure in COVID-19 patients, paclitaxel may reduce the state of hypercoagulability by acting as an inhibitor of thromboplastin [100] .

The final pattern involves the interaction of paclitaxel with TLR4. Paclitaxel is known to have high affinity for TLR4 receptors. SARS-CoV-2 Spike protein binds with human innate immune receptors, mainly TLR4, increasing secretion of IL-6 and TNF-α and neuroimmune response. This suggests that paclitaxel may dislocate SARS-CoV-2 Spike proteins [101, 102] .

SB 203580 is a specific inhibitor of p38α, which suppresses downstream activation of MAPKAP kinase-2, involved in many cellular processes including stress and inflammatory responses and cell proliferation. The following patterns support the SB 203580 discovery:

ity, especially in patients with comorbidities such as hypertension, diabetes, and coronary heart disease [104] . The following patterns support the alpha 2-antiplasmin discovery:

1. Alpha 2-antiplasmin-inhibits-plasmin-predisposes-COVID-19

2. Alpha 2-antiplasmin-inhibits-fibrinogen-associated with-COVID-19

3. Alpha 2-antiplasmin-interacts with-IgY-associated with-COVID-19

More specifically, plasmin may cleave a newly inserted furin site in the S protein of SARS-CoV-2, which increases its infectivity and virulence in COVID-19.

In addition, fibrinogen levels are higher in COVID-19 patients and may contribute to hypercoagulability [104] . By inhibiting plasmin and fibrinogen (first two patterns), alpha 2-antiplasmin may confer protection to COVID-19. In addition, pattern 3 suggests a mechanism of protection via immunoglobulin Y (IgY). In the immunology field, IgY against acute respiratory tract infection has been developed for more than 20 years. Several IgY applications have been effectively confirmed in both human and animal health. IgY antibodies extracted from chicken eggs have been used in bacterial and viral infection therapy. IgY production has been proposed as immunization as an adjuvant therapy in viral respiratory infection caused by COVID-19 infection [105] . Chicken immunized with Alpha 2-antiplasmin and the peptide-specific antibody (IgY) was isolated from the egg yolks of hens that could be used as potential protections for COVID-19 patients [106] .

Pyrrolidine dithiocarbamate is a family of drugs used for metal chelation, induction of G1 phase and cell cycle arrest. It binds to zinc and the resulting complex can enter the cell and inhibit viral RNA-dependent RNA polymerase [107] . It is supported by the following patterns:

1. pyrrolidine dithiocarbamate-inhibits-NF-kappa B-associated with-

2. pyrrolidine dithiocarbamate-inhibits-interleukin-6associated with-COVID-19

3. pyrrolidine dithiocarbamate-inhibits-TNF protein, humanassociated with-COVID-19

The mechanisms suggested here are similar to those observed for the previous drugs. Pyrrolidine dithiocarbamate contains antioxidants and prevents inflammatory changes. It inhibits the expression of IL-6 and TNF, and NF-κB in the virus-infected chorion cells through its antiviral activity. It has been proposed for the treatment of influenza [107] and it may have potential as a therapeutic option for COVID-19.

Butylated hydroxytoluene is a lipophilic compound useful for its antioxidant properties. It is widely used to prevent free radical-mediated oxidation in fluids and other materials and is generally recognized as safe as a food additive. It has been postulated in the past as an antiviral drug. Open discovery identified the following relevant patterns:

1. Butylated Hydroxytoluene-inhibits-CD69 protein, human-

2. Butylated Hydroxytoluene-inhibits-Free Radicals-associated with-

3. Butylated Hydroxytoluene-inhibits-TNF protein, humanassociated with-COVID-19

4. Butylated Hydroxytoluene-inhibits-hydrogen peroxideassociated with-COVID-19

The first pattern indicates butylated hydroxytoluene as an inhibitor of CD69.

Studies have shown that the CD69+ cells were detected in the lung of patients with asthmatic and eosinophilic pneumonia, suggesting a crucial role for CD69

in the pathogenesis of such inflammatory diseases. CD69 is, potentially, a new therapeutic target for patients with intractable inflammatory disorders and tumors [108] . Therefore, by inhibiting CD69, butylated hydroxytoluene may halt potential inflammatory responses in COVID-19. However, CD69 does not appear to be a major player in the physiopathology of COVID-19 (the query "CD69 AND Covid-19" did not return any results in PubMed). Nonetheless this is noteworthy, because this pathway is suggested as a novel and important pathway for all immune responses [108] .

The crucial role of free radicals in COVID-19 has been acknowledged and an antioxidative therapeutic strategy for COVID-19 has been suggested [109] .

Along these lines, patterns 2-3 point to antioxidant function of butylated hydroxytoluene by scavenging free radicals and inhibiting reactive oxygen species [110] .

Our approach relies on accuracy of the predications extracted by SemRep.

SemRep precision is about 0.70 and its recall around 0.42 [73] . While the accuracy classifier helped us improve the accuracy of the predications used, the remaining errors were still significant, impacting the knowledge graph completion task.

In addition, despite aggressive filtering, the graph formed by the relations in extended SemMedDB is very large, making it difficult to apply computationally intensive models like STELP. In this study, we examined a sub-graph which, inevitably, results in a loss of information available to knowledge graph completion techniques. While we were still able to apply modeling techniques to a fairly large sub-graph focusing on drug repurposing, there exists a larger, complementary sub-graph that may provide further drug candidates.

As noted above, the TransE model benefited from hyperparameter tuning using a grid search method to find an optimal configuration. Similarly, STELP would likely benefit from a similar tuning to find an optimal configuration. For example, a single linear layer was used on the pooled output from the BioBERT model to produce the logits when increasing the representational capacity of the linear layer, by depth or width, might allow for STELP to develop a richer model of the underlying space formed by the BioBERT contextualized embeddings.

Our methods were limited to knowledge from the literature. Other types of biological data (e.g., protein-protein interactions, drug-target interactions, gene/protein sequences, pharmacogenomic and pharmacokinetic data) are likely to benefit identification of drug candidates, as shown to some extent by other studies [14] , as well as our prior work [53] . However, the computational resources needed for training models based on such massive data can be prohibitive. TransE and similar methods seem more promising in that respect.

Lastly, with our in silico approach, we can of course only propose drug candidates for repurposing. To evaluate whether these drugs could indeed act as therapeutic agents for COVID-19, clinical studies are needed. However, the fact that we were able to identify some drugs known to have some benefit for COVID-19 (e.g., dexamethasone) via purely computational methods that rely only on automatically extracted literature knowledge is encouraging.

In this study, we proposed an approach that combines literature-based discovery and knowledge graph completion for COVID-19 drug repurposing. Unlike similar efforts that largely focused on COVID-19-specific knowledge, we incorporated knowledge from a wider range of biomedical literature. We used state-of-the-art knowledge graph completion models as well as simple but effective discovery patterns to identify candidate drugs. We also demonstrated the use of these patterns for generating plausible mechanistic explanations, showing the complementary nature of both methods.

The approach proposed here is not specific to COVID-19 and can be used to repurpose drugs for other diseases. It can also be generalized to answer other clinical questions, such as discovering drug-drug interactions or identifying drug adverse effects.

As COVID-19 pandemic continues its spread and disruption around the globe, we are reminded how the spread of infectious diseases is increasingly common and future pandemics ever more likely. Innovative computational methods leveraging existing biomedical knowledge and infrastructure could help us plan for, respond to and mitigate the effects of such global health crises. Drug repurposing is a key piece of this response, and our approach provides an efficient computational method to facilitate this goal.

SB 203580-inhibits-interleukin-6 -causes

SB 203580-inhibits-TNF protein

SB 203580-inhibits-interleukin-1, beta-associated with

SB 203580-inhibits-NF-kappa B-associated with

SB 203580-inhibits-Interleukin-1-causes

SB 203580-inhibits-Granulocyte-Macrophage Colony-Stimulating Factor -associated with

SB 203580-inhibits-Macrophage Colony-Stimulating Factorassociated with

Similarly to paclixatel, all patterns involving SB 203580 point to a potential inhibition of the hyperinflammatory response in COVID-19

of the protein kinases p38α in inflammation and innate immunity was found when the compound SB203580 suppressed tumor necrosis factor (TNF) production in monocytes, and this resulted in inhibition of septic (infammatory) shock

Alpha 2-antiplasmin is a serine protease inhibitor responsible for inactivating plasmin

Draft landscape of COVID-19 candidate vaccines

Placebo-controlled Study of AZD1222 for the Prevention of COVID-19 in Adults

Statement on AstraZeneca Oxford SARS-CoV-2 vaccine, AZD1222, COVID-19 vaccine trials temporary pause

Safety and immunogenicity of an rad26 and rad5 vector-based heterologous prime-boost covid-19 vaccine in two formulations: two open, non-randomised phase 1/2 studies from russia

Researchers highlight 'questionable' data in russian coronavirus vaccine trial results

Dexamethasone in hospitalized patients with covid-19-preliminary report

Effect of Hydroxychloroquine in Hospitalized Patients with COVID-19: Preliminary results from a multi-centre, randomized, controlled trial

Current status of COVID-19 therapies and drug repositioning applications

COVID-19 drug repurposing: A review of computational screening methods, clinical trials, and protein interaction assays

Drug repurposing: progress, challenges and recommendations

Artificial intelligence in COVID-19 drug repurposing, The Lancet Digital Health

A data-driven drug repositioning framework discovered a potential therapeutic agent targeting COVID-19

Networkbased drug repurposing for novel coronavirus 2019-ncov/sars-cov-2

A network medicine approach to investigation and population-based validation of disease manifestations and drug repurposing for covid-19

Repurpose open data to discover therapeutics for covid-19 using deep learning

Network medicine: a networkbased approach to human disease

Literature based discovery: models, methods, and trends

Emerging approaches in literature-based discovery: techniques and performance review

SemMedDB: a PubMed-scale repository of biomedical semantic predications

CORD-19: The Covid-19 Open Research Dataset

Translating embeddings for modeling multi-relational data

RotatE: Knowledge Graph Embedding by Relational Rotation in Complex sSpace

Semantic triple encoder for fast open-set link prediction

Exploiting semantic relations for literature-based discovery

A sars-cov-2 protein interaction map reveals targets for drug repurposing

Discovery of sars-cov-2 antiviral drugs through large-scale compound repurposing

Analysis of therapeutic targets for sars-cov-2 and discovery of potential drugs by computational methods

Anti-hcv, nucleotide inhibitors, repurposing against covid-19

Virtual screening and repurposing of fda approved drugs against covid-19 main protease

Using integrated computational approaches to identify safe and rapid treatment for sars-cov-2

Fast identification of possible drug treatment of coronavirus disease-19 (covid-19) through computational drug repurposing study

Ribavirin, remdesivir, sofosbuvir, galidesivir, and tenofovir against sars-cov-2 rna dependent rna polymerase (rdrp): A molecular docking study

Drugbank: a knowledgebase for drugs, drug actions and drug targets

Chembl: a large-scale bioactivity database for drug discovery

Biogrid: a general repository for interaction datasets

In silico discovery of candidate drugs against covid-19

Predicting potential drug targets and repurposable drugs for covid-19 via a deep generative model for graphs

Network medicine framework for identifying drug repurposing opportunities for covid-19

Fish oil, Raynaud's syndrome, and undiscovered public knowledge., Perspectives in biology and medicine

Graph-based methods for discovery browsing with semantic predications

Semantic MEDLINE for discovery browsing: using semantic predications and the literature-based discovery paradigm to elucidate a mechanism for the obesity paradox

An interactive system for finding complementary literatures: a stimulus to scientific discovery

Using concepts in literature-based discovery: Simulating swanson's raynaud-fish oil and migraine-magnesium discoveries

Using the literature-based discovery paradigm to investigate drug mechanisms

Exploring relation types for literature-based discovery

Context-driven automatic subgraph creation for literature-based discovery

Reflective random indexing and indirect inference: A scalable method for discovery of implicit connections

Finding schizophrenia's prozac emergent relational similarity in predication space

Embedding of semantic predications

Combining semantic relations and dna microarray data for novel hypotheses generation

Using literature-based discovery to identify novel therapeutic approaches

Predicting high-throughput screening results with scalable literature-based discovery methods

Exploiting literature-derived knowledge and semantics to identify potential prostate cancer drugs

A new method for prioritizing drug repositioning candidates extracted by literature-based discovery

Literature-based discovery of new candidates for drug repurposing

Knowledge graph embedding: A survey of approaches and applications

Knowledge graph embedding by translating on hyperplanes

A three-way model for collective learning on multi-relational data., in: ICML

Embedding entities and relations for learning and inference in knowledge bases

Complex embeddings for simple link prediction

Holographic embeddings of knowledge graphs

Convolutional 2d knowledge graph embeddings

Modeling relational data with graph convolutional networks

Kg-bert: Bert for knowledge graph completion

A literaturebased knowledge graph embedding method for identifying drug repurposing opportunities in rare diseases

Modeling polypharmacy side effects with graph convolutional networks

Gredel: A knowledge graph embedding based method for drug discovery from biomedical literatures

Ttd: therapeutic target database

The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text

Broad-coverage biomedical relation extraction with semrep

The Unified Medical Language System

The Unified Medical Language System (UMLS): integrating biomedical terminology

Augmenting microarray data with literature-based knowledge to enhance gene regulatory network inference

A Reasoning And Hypothesis-Generation Framework Based On Scalable Graph Analytics Enabling Discoveries

Link prediction on the semantic medline network

Are abstracts enough for hypothesis generation?

Investigating the role of interleukin-1 beta and glutamate in inflammatory bowel disease and epilepsy using discovery browsing

Keep up with the latest coronavirus research

Extending the log-likelihood measure to improve collocation identification, Master's thesis

Mining biomedical literature to explore interactions between cancer drugs and dietary supplements

Evaluating active learning methods for annotating semantic predications

Bert: Pre-training of deep bidirectional transformers for language understanding

Biobert: a pre-trained biomedical language representation model for biomedical text mining

Proceedings of the 2nd Clinical Natural Language Processing Workshop

Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets

Domain-specific language model pretraining for biomedical natural language processing

Adam: A method for stochastic optimization

Training knowledge graph embeddings at scale

An overview of MetaMap: historical perspective and recent advances

Aggregating UMLS semantic types for reducing conceptual complexity

Visualizing data using t-SNE

A capsule network-based embedding model for knowledge graph completion and search personalization

Pharmacologic treatments for coronavirus disease 2019 (covid-19): a review

Pathophysiology, transmission, diagnosis, and treatment of coronavirus disease 2019 (COVID-19): a review

How taxol/paclitaxel kills cancer cells

The trinity of COVID-19: immunity, inflammation and intervention

COVID-19: coagulopathy, risk of thrombosis, and the rationale for anticoagulation

The role of TLR4 in chemotherapy-driven metastasis

Is Tolllike receptor 4 involved in the severity of COVID-19 pathology in patients with cardiometabolic comorbidities?

What goes up must come down: molecular basis of MAP-KAP kinase 2/3-dependent regulation of the inflammatory response and its inhibition

Elevated plasmin (ogen) as a common risk factor for COVID-19 susceptibility

IgY-turning the page toward passive immunization in COVID-19 infection

Purification of human α2-antiplasmin with chicken IgY specific to its carboxy-terminal peptide

Antiviral function of pyrrolidine dithiocarbamate against influenza virus: the inhibition of viral gene replication and transcription

A new therapeutic target: the CD69-Myl9 system in immune responses

Tackle the free radicals damage in COVID-19

Understanding the chemistry behind the antioxidant activities of butylated hydroxytoluene (BHT): A review

We thank François-Michel Lang, Leif Neve, and Jim Mork for their assistance with processing the CORD-19 dataset with SemRep and providing updates to SemMedDB. We acknowledge Tom Rindflesch for his encouragement with the project.