key: cord-0176353-vugtei9s authors: Unni, Deepak R.; Moxon, Sierra A.T.; Bada, Michael; Brush, Matthew; Bruskiewich, Richard; Clemons, Paul; Dancik, Vlado; Dumontier, Michel; Fecho, Karamarie; Glusman, Gustavo; Hadlock, Jennifer J.; Harris, Nomi L.; Joshi, Arpita; Putman, Tim; Qin, Guangrong; Ramsey, Stephen A.; Shefchek, Kent A.; Solbrig, Harold; Soman, Karthik; Thessen, Anne T.; Haendel, Melissa A.; Bizon, Chris; Mungall, Christopher J.; Consortium, the Biomedical Data Translator title: Biolink Model: A Universal Schema for Knowledge Graphs in Clinical, Biomedical, and Translational Science date: 2022-03-25 journal: nan DOI: nan sha: 9b18fbe281496ad72bdd18e0a5883d235ebdfd87 doc_id: 176353 cord_uid: vugtei9s Within clinical, biomedical, and translational science, an increasing number of projects are adopting graphs for knowledge representation. Graph-based data models elucidate the interconnectedness between core biomedical concepts, enable data structures to be easily updated, and support intuitive queries, visualizations, and inference algorithms. However, knowledge discovery across these"knowledge graphs"(KGs) has remained difficult. Data set heterogeneity and complexity; the proliferation of ad hoc data formats; poor compliance with guidelines on findability, accessibility, interoperability, and reusability; and, in particular, the lack of a universally-accepted, open-access model for standardization across biomedical KGs has left the task of reconciling data sources to downstream consumers. Biolink Model is an open source data model that can be used to formalize the relationships between data structures in translational science. It incorporates object-oriented classification and graph-oriented features. The core of the model is a set of hierarchical, interconnected classes (or categories) and relationships between them (or predicates), representing biomedical entities such as gene, disease, chemical, anatomical structure, and phenotype. The model provides class and edge attributes and associations that guide how entities should relate to one another. Here, we highlight the need for a standardized data model for KGs, describe Biolink Model, and compare it with other models. We demonstrate the utility of Biolink Model in various initiatives, including the Biomedical Data Translator Consortium and the Monarch Initiative, and show how it has supported easier integration and interoperability of biomedical KGs, bringing together knowledge from multiple sources and helping to realize the goals of translational science. The use of graphs to formalize the representation of human knowledge dates back to the origins of Artificial Intelligence (AI) and the use of semantic networks for knowledge representation 1,2 . The term 'knowledge graph' (KG) is gaining popularity and is generally used to encompass a range of graphoriented representation frameworks, including Resource Description Framework (RDF) triple stores and labeled property-graph databases such as Neo4j. Examples of general-domain KGs include the Google Knowledge Graph and Wikidata 3 . Within the biomedical sciences, examples include SemMedDB 4 , Hetionet 5 , Implicitome 6 , Monarch Initiative 7 , the biological subset of Wikidata 8 , SPOKE 9 , and KG-COVID-19 10 . While KGs have been defined in various ways, perhaps the most intuitive definition is a graph in which the nodes represent real-world entities and the edges represent known relationships between those entities 11 . In a KG, the knowledge or 'facts' are represented as statements, with each statement modeled as two nodes linked together by an edge representing the relationship between them. The statements can have additional properties, metadata, and qualifying attributes that further capture the meaning of the statement and characterize the properties of nodes and edges. Because the basic structure of a KG is generic, the knowledge contained within a KG can be heterogeneous and mutable and still be representable in the graph. The representation of knowledge as simple connections between core entities makes iterative, rapid development of KGs possible. In addition, by leveraging the graph data structure and using various inference strategies, one can infer new edges or connections between nodes in a graph. Ontology-oriented KGs allow deductive inference through logical rules, from basic rules such as the Gene Ontology (GO) 'true path' rule 12 to more sophisticated methods like Description Logic inference 13 . Ontology-oriented KGs are also amenable to machine learning approaches such as embedding in vector space 14 , which supports the application of deep neural networks for tasks such as link prediction and node classification. Within the biomedical sciences, ontology-oriented KGs have been used for tasks such as drug repurposing 5 , target prioritization 15 , and phenotype profile matching 7 . Several ontologies and schemas for representing biomedical knowledge are available. A constellation of domain-specific ontologies from the Open Biological and biomedical Ontology (OBO) Foundry 16 can be used for modeling knowledge. For example, the Semantic Science Integrated Ontology (SIO) 17 is used for representing scientific data and knowledge. The Wikidata Ontology 18 is used by Wikidata for representing knowledge. In terms of schemas, schema.org is used for representing metadata about entities and relationships to other entities. BioSchemas is an extension of schema.org for representing metadata about biological entities. While existing efforts in modeling knowledge have been valuable, a unified data model that bridges across multiple ontologies, schemas, and data models does not exist. Here, we present Biolink Model as an open-source, universal data model that defines entities and the relationships between these entities within translational science. Biolink Model is a data model for organizing data in biomedical KGs. The model serves both as a map for bringing together data from different sources under one unified model, and as a bridge between ontological domains. Biolink Model is composed of several modeling elements, including a hierarchy of defined Classes, Properties (with defined Types), Predicates, Mixins, and Associations ( Table 1) . Domain knowledge in a KG that conforms to Biolink Model is represented using Associations. An Association minimally includes a subject and an object (Biolink Model classes) related by a Biolink Model predicate, together comprising its core triple (statement or primary assertion). The subject and object of an Association are foundational domain concepts (e.g. genes, diseases, chemicals, phenotypes), whose Internationalized Resource Identifiers (IRIs) come from community standard ontologies (e.g. HGNC, MONDO, ChEBI, HPO). The predicate is a Biolink Model element that represents the relationship between the subject and object. Associations may also include slots to hold additional metadata about the core triple, primarily information about the provenance and evidence supporting the assertion (Figure 1 ). An example of an Association represented in Biolink Model. In (a), the green ovals represent the subject and object classes, connected by a predicate. Together, the classes and the predicate constitute a statement or 'core triple' in the model. Edge properties provide further context and qualification to the core triple. The entire diagram, including the core triple and its provenance, represents a Biolink Model 'Association'. In (b), we see a specific example of a 'biolink:DiseaseToPhenotypicFeatureAssociation', where the subject is 'biolink:Disease', the object is 'biolink:PhenotypicFeature', and the predicate is 'biolink:has_phenotype'. In addition, the 'biolink:publications' property (lavender oval) records the provenance of the core triple. Biolink Model aims to address several challenges that obstruct the interoperability between KGs, including: 1) the need for expertise to transform data between tabular, RDF, and graphical models; 2) sparse and/or inconsistent application of ontologies or other controlled vocabularies, as well as differences in the identifiers that are used for storing instances of nodes within a graph; and 3) the lack of a standard approach to model the intersection of ontological domains (e.g., the relationships between genes and diseases). Using the framework provided by the Linked data Modeling Language (LinkML), Biolink Model is distributed in a variety of formats, including YAML, JSON-Schema, SQL-DDL, Python/Java classes, and RDF. Additionally, Unified Modeling Language (UML) diagrams provide a visual representation of the model. Biolink Model is accessible in frameworks familiar to a wide variety of developers and database engineers. Because the model can be distributed in different formats, the model elements can also be validated using toolchains that already exist (e.g., JSONSchema validation, SQL constraints), thus speeding up the reconciliation of tabular data, ontologies, and graphs. The biomedical field has been a leader and champion of ontology development. However, this has sometimes led to the development of multiple ontologies or controlled vocabularies for the same domain concept. When this happens, KG creators must identify which vocabulary best suits their needs, as well as understand how to apply concepts from the chosen ontology to their class instances. Biolink Model helps solve this challenge by indicating to users which ontologies should be used for instances of its classes via identifier prefixes (id_prefixes), mappings, and associations. Biolink Model, while constantly evolving, supports a variety of use cases in clinical, biomedical, and translational science. We highlight several examples here. The Translator Consortium has adopted Biolink Model as an open-source upper-level data model that supports semantic harmonization and reasoning across diverse Translator 'knowledge sources' 15 . The model serves a central role in the Translator program and forms the architectural basis of the Translator system, as described below. The Translator program aims to develop a comprehensive, relational, N-dimensional infrastructure designed to integrate disparate data sources-including objective signs and symptoms of disease, drug effects, chemical and genetic interactions, cell and organ pathology, and other relevant biological entities and relations-and reason over the integrated data to rapidly derive biomedical insights 22 . The ultimate goal of Translator is to augment human reasoning and thereby accelerate translational science and knowledge discovery. To achieve this ambitious goal, the Translator project assembled a diverse interdisciplinary team and a variety of biomedical data sources, including electronic health record data, clinical trial data, genomic and other -omics data, chemical reaction data, and drug data. However, the Translator data sources were in formats that were not compatible or interoperable. Moreover, groups within the Translator Consortium had integrated the data sources as knowledge sources within independent KGs, but these KGs were developed using different technologies and formalisms such as property graphs in Neo4j and semantically-linked data via RDF and OWL. In order to interoperate between knowledge sources and reason across KGs, Biolink Model was adopted as the common dialect to provide rich annotation metadata to the nodes and edges in disparate graphs, thus enabling queries over the entire Translator KG ecosystem, despite incompatibilities in the underlying data sources. The result was a federated, harmonized ecosystem that supports advanced reasoning and inference to derive biomedical insights based on user queries. Medicine Institute (PMI) at the University of Alabama at Birmingham. PMI investigators posed the following natural-language question to the Translator Consortium: what chemicals or drugs might be used to treat neurological disorders such as epilepsy that are associated with genomic variants of RHOBTB2? The investigators noted that RHOBTB2 variants cause an accumulation of RHOBTB2 protein and that this accumulation is believed to be the cause of the neurological disorder. To answer the PMI investigator's question, Translator team members structured the following query: (Figure 2) . Because of the hierarchical structure of the Biolink model, the use of biolink:related_to also will return more specific predicates such as biolink:negatively_regulates and biolink:positively_regulates. The objective was to identify drugs or chemicals that might regulate RHOBTB2 in some manner and thereby reduce the variant-induced accumulation of RHOBTB2 and associated neurological symptoms. As all nodes and edges within the Translator KG ecosystem are annotated to Biolink Model classes and attributes, a Translator query can be constructed from a natural-language user question and return results across a multitude of independent data sources. In addition, because the model employs hierarchical classes, with inheritance and polymorphism, naturallanguage queries translated to graph queries using Biolink Model syntax can be constructed at varying levels of granularity and return results from all levels of the hierarchy. Finally, because Biolink Model provides attributes on both edges and nodes that record provenance and evidence for these knowledge statements, each result is annotated with the trail of evidence that supports it. When Translator team members sent the query to the Translator system, it returned several candidates of interest to PMI investigators, including fostamatinib disodium (CHEMBL.COMPOUND:CHEMBL3989516) and ruxolitinib (CHEMBL.COMPOUND:CHEMBL1789941). A review of the supporting evidence provided by Translator indicates that these are approved drugs that either directly or indirectly reduce or otherwise regulate the expression of RHOBTB2. Thus, Biolink Model helped Translator teams bring data together into a single system, thereby reducing the burden on the user to find and manually assemble data from these independent resources. While developed in concert with the use cases of the Translator Consortium, Biolink Model has been reused in other applications, including KGX and Knowledge Graph Exchange Archive (KGEA), which rely on universal schemas for data model structure and data integration. In addition, the Illuminating the Druggable Genome (IDG) project uses Biolink Model as a schema for its integrated view of genomic, phenomic, and biochemical data. Similarly, the Monarch Initiative uses Biolink Model as a schema for its integrated view of genomic, phenomic, and biochemical data. Both IDG and Monarch incorporate a broad spectrum of data from a variety of sources, with each source modeling their data using different approaches, independent identifier systems, and heterogeneous data representations. Biolink provides the semantic harmonization required to integrate these disparate data sources. Other initiatives that rely on Biolink Model for data and knowledge harmonization include KG-COVID-19 10 , ECO-KG, KG-ENVPOLYREG, and KG-Microbe. The success of Biolink Model can be attributed to its community-biologists, clinicians, data curators, developers, subject matter experts, and ontologists-all of whom have contributed their requirements, perspectives, and expertise to help build a flexible semantic data model. Biolink Model provides a blueprint to harmonize existing data sources and accelerate the development of new knowledge by leveraging a multitude of domain and technical expertise, captured in a variety of ontologies and existing models (via semantic mappings), within a single modeling framework that is easy to read, write, reuse and distribute. Moreover, Biolink Model is grounded in semantic web technologies (characterized by classes and slots with their own IRIs, SKOS mappings to existing ontologies, descriptions, identifier prefixes, and domain and range constraints) and captures biomedical expertise as a computable knowledge artifact that can be read and interpreted by both machines and humans alike. Because Biolink Model is platform-agnostic, open-source, and publicly accessible, and because it can be translated into a variety of data modeling formats, it encourages people from different backgrounds and with different expertise to work together to evolve the model. Most importantly, Biolink Model supports the harmonization of KGs and underlying data sources in a manner that adheres to FAIR principles 23 and facilitates applications across a broad spectrum of biomedical use cases, thereby democratizing and accelerating translational science. All authors declare no conflicts of interest. Encyclopedia of Artificial Intelligence National Physical Laboratory Wikidata: a free collaborative knowledge base Annotated Semantic Predications from SemMedDB Systematic integration of biomedical knowledge prioritizes drugs for repurposing The Implicitome: A Resource for Rationalizing Gene-Disease Associations The Monarch Initiative in 2019: an integrative data and analytic platform connecting phenotypes to genotypes across species WikiGenomes: an open web application for community consumption and curation of gene annotation data in Wikidata. Database . Narnia Knowledge Network Embedding of Transcriptomic Data from Spaceflown Mice Uncovers Signs and Symptoms Associated with Terrestrial Diseases KG-COVID-19: A Framework to Produce Customized Knowledge Graphs for COVID-19 Response. Patterns (N Y) Creating the gene ontology resource: design and implementation Use of OWL within the Gene Ontology Word mover's distance for affect detection Biomedical Data Translator Consortium. Toward A Universal Biomedical Data Translator OBO Foundry in 2021: operationalizing open data principles to evaluate ontologies The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery Collaborative Approach to Developing a Multilingual Ontology: A Case Study of Wikidata. Metadata and Semantic Research Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data Mondo Disease Ontology: harmonizing disease concepts across the world OWL 2 Web Ontology Language Structural Specification and Functional-Style Syntax Deconstructing the Translational Tower of Babel The FAIR Guiding Principles for scientific data management and stewardship. Sci Data The authors are grateful to members of the Publications Committees at the National Center for Advancing Translational Sciences, the National Institute of Environmental Health Sciences, and the National Institute on Aging for their review and approval of the manuscript for publication. Moreover, the authors are appreciative of the unwavering leadership and support provided by the Extramural Leadership Team and the Intramural Research Program at NCATS.