key: cord-0932340-mulercjr authors: Reese, Justin T.; Unni, Deepak; Callahan, Tiffany J.; Cappelletti, Luca; Ravanmehr, Vida; Carbon, Seth; Shefchek, Kent A.; Good, Benjamin M.; Balhoff, James P.; Fontana, Tommaso; Blau, Hannah; Matentzoglu, Nicolas; Harris, Nomi L.; Munoz-Torres, Monica C.; Haendel, Melissa A.; Robinson, Peter N.; Joachimiak, Marcin P.; Mungall, Christopher J. title: KG-COVID-19: a framework to produce customized knowledge graphs for COVID-19 response date: 2020-11-09 journal: Patterns (N Y) DOI: 10.1016/j.patter.2020.100155 sha: de2dbee710928cd10c997d7f2c997bb07086024d doc_id: 932340 cord_uid: mulercjr Integrated, up-to-date data about SARS-CoV-2 and COVID-19 is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. While rich biological knowledge exists for SARS-CoV-2 and related viruses (SARS-CoV, MERS-CoV), integrating this knowledge is difficult and time consuming, since much of it is in siloed databases or in textual format. Furthermore, the data required by the research community varies drastically for different tasks—the optimal data for a machine learning task, for example, is much different from the data used to populate a browsable user interface for clinicians. To address these challenges, we created KG-COVID-19, a flexible framework that ingests and integrates heterogeneous biomedical data to produce knowledge graphs (KGs), and applied it to create a KG for COVID-19 response. This KG framework can also be applied to other problems in which siloed biomedical data must be quickly integrated for different research applications, including future pandemics. Although most coronaviruses typically cause common-cold symptoms in humans, three betacoronaviruses have emerged in the last few decades that can cause a range of serious manifestations including pneumonia and death: the severe acute respiratory syndrome (SARS) coronavirus (SARS-CoV-1), the Middle East respiratory syndrome coronavirus (MERS-CoV), and the novel betacoronavirus that emerged in late 2019, subsequently named SARS-CoV-2, the agent of the disease COVID-19 (Gandhi et al., 2020) The rapid spread of SARS-CoV-2 has led to a global pandemic. COVID-19 is a complex disease involving many biological processes and pathways, each of which involves many genes. Initial symptoms of COVID-19 typically include fever, cough, fatigue, anorexia, anosmia, myalgia, and diarrhea. In some patients, severe illness ensues roughly one week after the initial onset of symptoms, and can present with rapidly progressive respiratory failure (Berlin et al., 2020) . In addition to the symptoms highlighted, infections can lead to secondary health problems such as blood clots (Srivastava, 2020) , tissue J o u r n a l P r e -p r o o f necrosis, organ damage, and, in some cases, cardiac failure. Given that the research community is still learning about COVID-19, understanding its symptoms and their underlying pathological mechanisms, which are still being uncovered, is of vital importance. Many possible treatments for different aspects and stages of COVID-19 are being actively pursued. Evidence suggests that remdesivir (DrugBank:DB14761), a broad-spectrum antiviral medication, can shorten the time to recovery in adults hospitalized with COVID-19 infection and pneumonia (though the effect is not statistically significant) (Beigel et al., 2020) and more recent evidence suggests that dexamethasone (DrugBank:DB01234), a corticosteroid that suppresses inflammation, may reduce mortality in patients with severe COVID-19 (Horby et al., 2020) . However, currently no treatment is available to prevent progression of COVID-19 to severe disease, and our knowledge of the causes and optimal medical management of the symptoms and resulting clinical complications of COVID-19 are limited. A large amount of biomedical and molecular data are available to aid the massive research effort to address the COVID-19 pandemic. Before the pandemic began, there existed a large amount of biomedical data for coronaviruses other than SARS-CoV-2 (SARS-CoV and MERS- CoV (de Wit et al., 2016) as well as many other pathogenic and non-pathogenic coronaviruses), such as viral genome and transcriptome sequences, viral/host gene interactions, gene function, epidemiological data, and clinical case data. Much of this information is now also available for SARS-CoV-2. In addition, there is also a large amount of data about drugs that may offer a treatment for COVID-19, as well as the protein targets for each drug. However, researchers are confronted with a number of technical challenges when trying to use existing data to discover actionable knowledge about COVID-19. The data needed to address a given question are typically isolated in different databases and employ different identifiers. These data sources are also often stored in different formats, requiring transformation or J o u r n a l P r e -p r o o f preprocessing in order to serve the task at hand. For example, to examine the function of proteins targeted by FDA-approved antiviral drugs, one must download and integrate drug, drug target, and FDA approval status data from, for example, Drug Central in a custom-made TSV format (Ursu et al., 2019) and functional annotations from, for example, Gene Ontology in GPAD format (The Gene Ontology Consortium, 2019). Furthermore, many data sets are updated periodically, which requires researchers to re-download and re-process data in order to perform their analysis on the most current data. To tackle the daunting challenge of bringing together these disparate sources of information and extracting useful knowledge from them, we employed knowledge graphs (KGs). Knowledge graphs are a way to represent and integrate heterogeneous data and their interrelationships. In a KG, discrete entities or pieces of information form distinct nodes interconnected by edges, where both nodes and edges are typed using a hierarchical system such as an ontology (Nickel et al., 2016) . For example, nodes of type 'protein' representing individual entities (such as human ACE2 or SARS-CoV-2 Spike) can be interconnected via edges of type 'orthologous to' or 'interacts with', and these nodes can be connected with other kinds of nodes representing diseases, drugs, and so on. This kind of representation is amenable to complex queries (e.g. "which drugs target a host protein that interacts with a viral protein?"), and also to graph-based machine learning (ML) techniques. There have been a few parallel efforts to construct KGs to integrate COVID-19 data, each integrating different data sources and constructed for different purposes. Several efforts have constructed KGs by ingesting and transforming scientific literature (Daniel Domingo-Fernández, Shounak Baksi, Bruce Schultz, Yojana Gadiya, Reagon Karki, Tamara Raschka, Christian J o u r n a l P r e -p r o o f Ebeling, Martin Hofmann-Apitius, and Alpha Tom Kodamullil, 2020) (https://lg-covid-19hotp.cs.duke.edu/), some with a few additional types of data also included, such as confirmed case and mortality data (https://github.com/covidgraph/); clinical information, drug trial, and sequencing data (https://www.wikidata.org/wiki/Wikidata:WikiProject_COVID-19); drug, drug trial and genome sequence data (https://ds-covid19.res.ibm.com/); diseases, chemicals, and genes . Other KG efforts ingest a wider array of data, including diseases, genes, proteins and their structural data, drugs, and drug side effects (Khan et al., 2020) ; pathways, proteins, genes, drugs, diseases, anatomical terms, phenotypes, microbiome (https://spoke.ucsf.edu/); genes, proteins, diseases, phenotypes, genome sequences (Hassani-Pak et al., 2020) (https://knetminer.com/); geographic, viral genes, genes and proteins (https://github.com/sbl-sdsc/coronavirus-knowledge-graph). Several projects have focused specifically on integrating a wide variety of COVID-19 data to create KGs to investigate drug repurposing (Ge et al.; Li et al., 2020) (https://github.com/gnn4dr/DRKG). The effort described here is unique in that it allows users to more flexibly remix specific data types from specific data sources (by virtue of its use of the KGX tool), it integrates more tightly with ontologies (HPO, Mondo, and GO) and with downstream machine learning tools (i.e. Embiggen), it offers a more detailed summary of the contents of its KG in a machine readable format, it covers a wider range of input data sources, and it automatically incorporates new and updated data. Here, we present a comprehensive COVID-19 KG derived from 14 knowledge sources and containing 377,482 nodes and 21,433,063 edges. The KG is freely available for download at https://kg-hub.berkeleybop.io/kg-covid-19/, and the framework to produce the KG is freely available at https://github.com/Knowledge-Graph-Hub/kg-covid-19. The knowledge graph was constructed using modern ontology best practices whereby different data sources were normalized and merged. KG-COVID-19 allows flexible remixing of component subgraphs for J o u r n a l P r e -p r o o f users interested in specific areas. We demonstrate several use cases including graph-based machine learning. We created KG-COVID-19 to address the challenge of integrating data for COVID-19 response. KG-COVID-19 is a framework that enables the creation of customized KGs containing COVID-19 knowledge for different applications. For example, a drug repurposing application would make use of protein data linked with approved drugs, while a biomarker application could utilize data on gene expression linked with pathways. The methodology is not limited to COVID-19, but could support data integration for any biomedical research effort. Additionally, KG-COVID-19 was designed to utilize a wide variety of human and non-human data resources in order to model important relationships and processes underlying human disease mechanisms. For example, in order to model host response factors in humans, it is necessary to also include mechanisms of virology and viral genes. Our process for generating the KG was designed to support interoperability, preserve provenance, and provide the ability to flexibly mix and match data from different sources. The workflow is divided into three steps: data download (fetch the input data), transform (convert the input data to KGX interchange format), and merge (combine all transformed sources) ( Figure 1 ). The download step retrieves data from multiple sources using a YAML file that specifies the source URLs ( Figure 1A) . Our experience has shown that this step is a frequent point of failure in many extract, transform, and load (ETL) pipelines and separating out this step helps mitigate this issue. The data sources we ingest are focused on our use case: drug repurposing (e.g., drug and drug target data, protein interaction data, ontologies important in disease such as the Human Phenotype Ontology (HPO) and the Mondo disease ontology). However, we also ingest data sources that our user community requests by opening tickets on our project GitHub page (https://github.com/Knowledge-Graph-Hub/kg-covid-19). The transform step ( Figure 1B ) involves parsing the input files and transforming them to a graph-based representation. We have devised a simple yet expressive format called KGX interchange format: https://github.com/NCATS-Tangerine/kgx/blob/master/data-preparation.md a serialization (i.e. the process of converting an object into a format, usually text, that can be This step is informed by a YAML file that specifies what data sets should be included, to allow for flexible remixing of subgraphs. In addition to selecting different component data sets to be merged, the user can also filter nodes and edges from each source by the node 'category' and 'edge_label', allowing fine grained control of the resulting graph. By default, all nodes and edges from all component data sets are merged. Optionally, the merged graph can be loaded into any triple/RDF store or Neo4j database. While our framework offers flexibility in deciding how best to transform each data source, KG-COVID-19 follows some general design principles to maintain the quality of the resulting KG. Our framework is designed to allow users to easily reproduce the KGs used in downstream analysis. The download and transform steps save all ingested data and the transformed data locally after running the pipeline to produce a KG. In addition, we provide prebuilt versions of our KG (https://kg-hub.berkeleybop.io/kg-covid-19/). A new build is constructed each month, and also whenever changes are made to the code in the KG-COVID-19 framework. Each build J o u r n a l P r e -p r o o f contains the date the build was constructed, the exact commands that were run to produce the KG, the input data that was ingested, the transformed subgraphs for each source, detailed statistics about the contents of the build, and the KG itself in RDF, KGX TSV, and Blazegraph journal format. We use a core set of standardized ontologies and the Biolink Model (https://biolink.github.io/biolink-model), a biological data model for categorizing nodes and edges, to facilitate interoperability and data summarization. To ensure Biolink Model compliance, a Biolink category and a Biolink predicate are required for the categorization of nodes and edges, respectively. Since Biolink predicates are typically very broad in scope, the edge can be further categorized by adding a more specific description in the 'relation' property using a term from the Relation Ontology (Smith et al., 2005) . Categorization using ontologies and the Biolink Model provides a convenient way to assess what types of data have been ingested from each source, record provenance information, and also facilitates interoperability with other transformed data sets. Only the subset of features in each data set that are likely to be useful for downstream applications are preserved, and only statements for authoritative or trusted sources are ingested (for example, assertions about protein interactions are not ingested from a drug database, a trusted resource like the IntAct Molecular Interaction Database would be preferred for protein interactions). Identifier (ID) normalization is crucial for ensuring connectedness and the utility of the graph. We refer to the Biolink Model to provide the preferential order of identifier prefixes to be used for a particular Biolink class. For example, in the case of Gene class (https://biolink.github.io/biolinkmodel/docs/Gene) the model prescribes HGNC, NCBIGene, ENSEMBL, where the order of prefixes matters: identifiers from HGNC namespace are given a higher priority than NCBIGene and ENSEMBL. In the case of Protein class, the model prescribes UniProtKB identifiers. For drugs and other chemical compounds, the model recommends the following: CHEBI, CHEMBL, DrugBank, PubChem. Identifiers can also be normalized by adding cross-references to other identifiers in the 'xrefs' property of nodes, which is the 'xrefs' column in the KGX interchange format TSV describing the nodes. Each ingest adds a 'provided_by' column in the edge TSV file, which ensures that graphs into which the data are merged ( Figure 1C ) contain a record of which ingest produced each edge. The preservation of all files used to generate the graph in the download step ( Figure 1A ) makes it possible to trace each node and edge to the entries in the input file that generated them. PubMed IDs are added to the 'publication' column, where available, to provide additional provenance. The KG-COVID-19 framework contains tooling for common graph operations. The framework can create training and test data sets in graph form for machine learning applications such as training classifiers or regressors for link prediction (see Experimental Procedures). It also includes a query function that can execute prewritten or custom SPARQL queries on a given SPARQL endpoint (by default, our endpoint: http://kg-hubrdf.berkeleybop.io/blazegraph/#query). A schematic diagram of all data sources currently ingested is shown in Figure 3 . While we designed KG-COVID-19 to allow flexible reuse and remixing of data to produce custom KGs, our immediate use case is to provide a COVID-19 KG that can be used for machine learning to produce actionable knowledge about COVID-19 ( Figure 4 ). This use case demonstrates several features of KG-COVID-19, namely: normalization and merging of disparate data sources with different namespaces and formats, flexible remixing of component subgraphs, and a regular update cycle to keep up with new knowledge. We follow the workflow described in Figure 1 to produce the KG-COVID-19 KG. From the final merged graph, KG-COVID-19 produces training and test data sets suitable for machine learning applications (see Experimental Procedures). Embiggen (paper in preparation), our implementation of node2vec and related machine learning algorithms, is applied to this KG to generate embeddings, vectors in a low dimensional space which capture the relationships in the KG. Embiggen is trained iteratively to identify optimal node2vec hyperparameters (walk length, number of walks, p, q etc.) and to then train classifiers (e.g., logistic regression, random forest, support vector machines) that can be used for link prediction. The trained classifiers can then be applied to produce actionable knowledge: drug to disease links, drug to gene links, and drug to protein links. The latter would indicate a drug that might be useful for COVID-19 treatment. To demonstrate the usefulness of KG-COVID-19 for machine learning applications, we created embeddings for nodes and edges from the KG-COVID-19 knowledge graph and visualized the embeddings in two dimensions using a t-SNE plot ( Figure 6 ). While only the graph structure and no biological typing of nodes was used to generate the embeddings, the nodes of the same type appear to be located closer to each other when projected into latent space than nodes of differing biological types (i.e. genes are closer to other genes than they are to drugs) a phenomenon that is often observed in hierarchically structured data (Kobak and Berens, 2019) and a feature for which t-SNEs are known (Maaten and Hinton, 2008) . This indicates that the embeddings encode biological information that can be used for machine learning. While the initial development of KG-COVID-19 has focused on our machine learning applications, other use cases have emerged. As part of the National Virtual Biotechnology Laboratory (NVBL), we have found it useful to perform hypothesis-based querying of the KG to identify viral and human proteins that make attractive drug targets (Office of Science, 2020). For example, we have queried the KG to retrieve from our KG host proteins that are known to interact with viral proteins, and these are further filtered according to whether these host proteins are targets of approved drugs ( Figure 5 ). These data are further analyzed with downstream analyses to assess their suitability for drug repurposing. Our KG is also part of a federated query used by the NVBL to collate and share up to date information related to COVID-19 and SARS-CoV-2. In addition, the National COVID Cohort Collaborative (N3C) has incorporated our KG as an ontologically-informed way to combine their clinical data sets (by virtue of our integration with GO, HPO, and Mondo). The N3C also uses our KG to incorporate all of our transformed and harmonized data, saving them the onerous task of collecting and integrating all of those data sources individually. The idea behind a KG-Hub is to provide a platform for building and exchanging knowledge graphs by following a set of guidelines and design principles (https://knowledge-graphhub.github.io/) that facilitates interoperability and reproducibility. The goal of a KG Hub is to serve as a collective resource to simplify the process of generating biological and biomedical KGs and thus reducing the barrier for entry to new participants. It also serves as a central resource to enable discovery and exchange of knowledge graphs. KG-Hub is designed to be an open-source community-supported resource. We are committed to maintaining this resource and welcome new national and international collaborations to help support this work. Our KG-COVID-19 framework adopts KG-Hub design principles and thus can be considered as the first instance of KG-Hub. Since the usefulness of a KG depends on its connectedness, ID normalization is crucial. Normalization of IDs for SARS-CoV-2 entities in particular is challenging, for several reasons. First, SARS-CoV-2 produces identical cleavage products from different polyproteins, and UniProt assigns a different ID to each of these identical cleavage products. For example, UniProt uses PRO_0000338259 to identify the cleavage product nsp5, the 3C-like protease, if it is cleaved from replicase polyprotein 1a, and PRO_0000449623 if it is cleaved from replicase polyprotein 1ab. Protein Ontology, in contrast, uses PR_000050274, irrespective of the polyprotein from which it was cleaved. Note that the UniProt "PRO_" prefix is unrelated to the Protein Ontology namespace. For our KG, it is crucial that identical proteins be represented with a single node such that other information can be efficiently linked to them. We arbitrarily chose PRO_0000449623 as the ID to represent this cleavage product, and all other IDs for this cleavage product are stored as cross references for this node in our KG. Second, each J o u r n a l P r e -p r o o f cleavage product can have a large number of synonyms. For example, nsp5 has at least 40 synonyms that are used in the literature (e.g., 3CL-PRO, 3CLp, Mpro, 3C-like proteinase). Furthermore, some synonyms (e.g. 'S' for spike protein) are difficult to recognize when applying NLP to SARS-CoV-2 literature, which represents a further challenge for computationally identifying the occurrences of such entities in text. We have compiled our canonical IDs, synonyms, and cross references for each SARS-CoV-2 protein and cleavage product in our KG in a publicly available file in GPI format: https://github.com/Knowledge-Graph-Hub/kg-covid-19/blob/master/curated/ORFs/uniprot_sars- Knowledge graphs provide a way of integrating heterogeneous data from different sources and combining different data modalities. KG-COVID-19 generates a KG for COVID-19 focused around molecular and chemical information, enabling users to conduct complex queries over relevant biological entities as well as machine learning analyses to generate graph embeddings for making predictions. The lightweight framework we have developed provides a rapid route for bringing together new sources of data and knowledge, including KGs from several different sources, to form a "hub" to support COVID response efforts. Justin Reese, justinreese@lbl.gov, https://orcid.org/0000-0002-2170-2250 This study did not generate any physical material. The Python code for KG-COVID-19 is available from the project wiki: The framework to produce our KG is essentially an ETL system with additional tooling to facilitate downstream uses (e.g. to produce subgraphs for ML training, run SPARQL queries, etc.). To ensure that the code remains functional and to detect breaking changes in data from upstream sources, we run our pipeline and unit tests regularly using a continuous integration system (https://www.jenkins.io/). This pipeline emits a KG that integrates all available data sources, in both TSV and RDF format, and also loads this KG into a Blazegraph database. A J o u r n a l P r e -p r o o f YAML file containing an inventory of the Biolink categories and Biolink associations of all data in the KG is also produced during the merge step (Figure 1 ). On a commodity server with 200 GB of memory, generation of the knowledge graph containing all source data requires a total of 3.7 hours (0.13 hours, 1.5 hours, and 2.1 hours for the download, transform, and merge step, respectively), with a peak memory usage of 34.4 GB and disk use of 37 GB. To We generated embeddings from our KG using Embiggen, our Python library for graph embedding and machine learning, using node2vec with a skip-gram model, 128 embedding dimensions, and parameters p and q of 1 (which are typically used default parameters for node2vec) (Grover and Leskovec, 2016) . Embiggen is freely available at https://github.com/monarch-initiative/embiggen. These embeddings were used to generate a t-SNE plot that represents the embeddings for each node in two-dimensional space, using MulticoreTSNE (https://github.com/DmitryUlyanov/Multicore-TSNE) (Figure 6 ). Figure 1 . The KG-COVID-19 framework for producing KGs. The framework is divided into three modular steps: download, transform, and merge. A) The download step retrieves all data sets needed for ingestion using a set of URLs specified in a YAML file. B) The transform step applies Python code that is specific to each source to transform the most useful elements of each source and emit a graph in TSV format. C) The merge step uses a YAML file to read the user-specified data sets (among those produced in the transform step) and merge them into a single KG. Different YAML files can be constructed to mix and match different input data from B, but each merge operation yields a single merged graph. Both the transform and merge steps rely heavily on KGX, a powerful tool for manipulating knowledge graphs (https://github.com/NCATS-Tangerine/kgx). A. In order to train classifiers for use in link prediction, training and test graphs are first produced from the original KG-COVID-19 graph (see Experimental Procedures). These graphs are used by Embiggen to generate random walks, embeddings, and finally a classifier. The test graphs are used to assess the J o u r n a l P r e -p r o o f performance of the classifier. This step is performed iteratively in order to identify optimal hyperparameters. B. The classifiers are applied to the KG-COVID-19 to perform link prediction in order to identify links that correspond to actionable knowledge: for example, links between drugs and the COVID-19 disease, links between drugs and SARS-CoV-2 protein targets, and links between drugs and host proteins that are involved in COVID-19 disease processes. An effective response to the COVID-19 pandemic relies on integration of many different types of data available about SARS-CoV-2 and related viruses. KG-COVID-19 is a framework for producing knowledge graphs that can be customized for downstream applications including machine learning tasks, hypothesis-based querying, and browsable user interface to enable researchers to explore COVID-19 data and discover relationships. -KG-COVID-19 is a framework for producing customized COVID-19 knowledge graphs -Our knowledge graph and framework is free, open-source, and FAIR -KG-COVID-19 integrates a wide range of COVID-19-related data in an ontology-aware way -Our KG has been applied to use cases including ML tasks, hypothesis-based querying An effective response to the COVID-19 pandemic relies on integration of many different types of data available about SARS-CoV-2 and related viruses. KG-COVID-19 is a framework for producing knowledge graphs that can be customized for downstream applications including machine learning tasks, hypothesis-based querying, and browsable user interface to enable researchers to explore COVID-19 data and discover relationships. Remdesivir for the Treatment of Covid-19 -Preliminary Report Severe Covid-19 TTD: Therapeutic Target Database Knowledge Graph: a computable, multi-modal, cause-and-effect knowledge model of COVID-19 pathophysiology Mild or Moderate Covid-19 ChEMBL: a large-scale bioactivity database for drug discovery A datadriven drug repositioning framework discovered a potential therapeutic agent targeting COVID-19 node2vec: Scalable Feature Learning for Networks KnetMiner: a comprehensive approach for supporting evidence-based gene discovery and complex trait analysis across species COVID-19Base: A knowledgebase to explore biomedical entities related to COVID-19 The art of using t-SNE for single-cell transcriptomics Network bioinformatics analysis provides insight into drug repurposing Visualizing Data using t-SNE Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species A Review of Relational Machine Learning for Knowledge Graphs The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease Relations in biomedical ontologies Association between COVID-19 and cardiovascular disease STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets The Gene Ontology Resource: 20 years and still GOing strong Gene Ontology Causal Activity Modeling (GO-CAM) moves beyond GO annotations to structured descriptions of biological functions and systems PharmGKB: the Pharmacogenomics Knowledge Base DrugCentral: online drug compendium DrugCentral 2018: an update COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation SARS and MERS: recent insights into emerging coronaviruses Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2