key: cord-0550840-orfhvmop authors: D'Souza, Jennifer; Monteverdi, Anita; Haris, Muhammad; Anteghini, Marco; Farfar, Kheir Eddine; Stocker, Markus; Santos, Vitor A.P. Martins dos; Auer, Soren title: The Digitalization of Bioassays in the Open Research Knowledge Graph date: 2022-03-28 journal: nan DOI: nan sha: 2a1b06460a9102eb74fe861653f031cab5cb7342 doc_id: 550840 cord_uid: orfhvmop Background: Recent years are seeing a growing impetus in the semantification of scholarly knowledge at the fine-grained level of scientific entities in knowledge graphs. The Open Research Knowledge Graph (ORKG) https://www.orkg.org/ represents an important step in this direction, with thousands of scholarly contributions as structured, fine-grained, machine-readable data. There is a need, however, to engender change in traditional community practices of recording contributions as unstructured, non-machine-readable text. For this in turn, there is a strong need for AI tools designed for scientists that permit easy and accurate semantification of their scholarly contributions. We present one such tool, ORKG-assays. Implementation: ORKG-assays is a freely available AI micro-service in ORKG written in Python designed to assist scientists obtain semantified bioassays as a set of triples. It uses an AI-based clustering algorithm which on gold-standard evaluations over 900 bioassays with 5,514 unique property-value pairs for 103 predicates shows competitive performance. Results and Discussion: As a result, semantified assay collections can be surveyed on the ORKG platform via tabulation or chart-based visualizations of key property values of the chemicals and compounds offering smart knowledge access to biochemists and pharmaceutical researchers in the advancement of drug development. Mainstream publishing digital libraries such as the general domain encyclopedias, e-commerce products, maps, etc., have undergone a transformative digi-talization of their traditional document-based publication based on new means of information organization and access as Knowledge Graphs (KGs) -large semantic networks of fine-grained entities and relationships. As such the semantic knowledge is automatically customizable into knowledge views of smaller scopes given a query. This means a semantic query (e.g., using SPARQL) over a KG from semantified product descriptions, for instance, could select and aggregate similar properties such as price, manufacturer, dimensions, etc., to generate a comparison of products. Thus for digital libraries, in general, the current developments in increasing dissemination of commercial research information systems (e.g. Pure by Elsevier) and social networks (e.g. ResearchGate) as well as non-European initiatives (e.g. Open Knowledge Network https://www.nitrd.gov/ nitrdgroups/index.php?title=Open_Knowledge_Network of the major US research funders) demonstrate that the transition from document-based to finegrained, digitalized knowledge-based information flows is necessary and imminent. Historically, traditional libraries evolved to digital libraries to meet the access needs of their patrons. In the present day, the digital library is evolving toward digitalization not just of metadata and keywords, but also of content to meet the information needs of patrons with the promise of smart access methods to the fine-grained knowledge directly. Aligned with the digitalization impetus, the Open Research Knowledge Graph (ORKG) [6, 18] digital library project addresses scholarly content digitalization as a distributed, decentralized, and collaborative structured scholarly knowledge creation process that can be powered with automated semantification modules via a continuous, ongoing development cycle of autonomously maintained AI micro-services. The focus in the ORKG is: to obtain fine-grained semantified 'research results,' aka scholarly contributions, as knowledge graphs such that research progress can be made skimmable within user-definable knowledge scopes over key scientific entities and properties. To this end, this paper supports the rapid assimiliation of digitalized knowledge in the scholarly data domain of biological assays (bioassays) by proposing an AI-based semantification service. The current coronavirus pandemic situation sheds critical light on advancing the drug development research lifecycle for which bioassays are crucial, hence we focus on this domain. A bioassay is, by definition, a standard biochemical test procedure used to determine the concentration or potency of a stimulus (physical, chemical, or biological) by its effect on living cells or tissues [15, 17] . Toward machine-interpretability, a biossay description can be represented as fine-grained semantified triples using the Bioassay ontology (BAO) [31, 1] with main information categories such as perturbagen, format, design, detection technology, meta target, endpoint, that need to be captured in order for them to be a meaningful semantic representation, and which imports other ontologies as well such as the Cell Line Ontology (CLO) [27] , Gene Ontology (GO) [12] , and the NCBI Taxonomy (https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome. html/). Bioassays being highly diverse are clearly a non-trivial semantification domain posing challenges to standardizing and integrating the data with the goal to maximize their scientific and ultimately their public health impact as the assay screening results are carried forward into drug development programs. See Appendix A for more information. ORKG-assays, the semantification service proposed in this paper for bioassays, will be the first Life Science domain supported by an automated semantification micro-service in the ORKG. To our knowledge, it fosters the development of the first end-to-end bioassay digitalization workflow in the overall scholarly community as well. The workflow involves four steps. 1) Querying a bioassay depositor for their unstructured or semi-structured assays. Commonly, bioassays raw data are obtained via the PubChem depository [32, 16] -a major depositor of bioassays from various research institutes. 2) Semantifying the assay via the ORKG-assays AI model. 3) Linking the depositor provided assay crossreferences to their scientific articles. And, 4) integrating the bioassay semantic graph to the ORKG. Programmed in Python, ORKG-assays provides webbased and programmatic tools for semantifying bioassay texts. The semantified bioassay once entered in the ORKG is editable via user-friendly frontend interfaces, is surveyable via tabulations [23] or 2-D chart visualizations [33] , and is queryable for various scientific semantic ORKG relationships. ORKG-assay demonstrates high semantification performance F1 scores above 80% and has been chosen after diverse methodological tests including the state-of-the-art, bidirectional transformer-based SciBERT model discussed in prior work [3, 4, 5] . This module is further explained in Section 3. Finally, note that the ORKG will not serve as a mere mirror of other Bioassay depositories, but will itself be a unique application of a highly-structured science-wide knowledge graph of scholarly contributions which incoporates semantified bioassays as well. Summing up, ORKG-assays offers a highly accurate and pragmatic semantification model alleviating unrealistic expectations on scientists to semantify their bioassays from scratch, but instead offers them a mere curatorial role of the automatic annotations. The pace with which novel bioassays are being submitted suggests that we have only begun to explore the scope of possible assay formats and technologies to interrogate complex biological systems. Thus this data domain, specifically, promises long-standing future application discovery many of which remain potentially untapped. Furthermore, inspired by the method we demonstrate, by drastically reducing the time required to semantify data for other scholarly domains as well, digitalization can be realistically advocated to become a standard part of the publication process. A biochemist wants to compare existing bioassays that have been historically performed on the SARS virus. Assuming the availability of bioassays represented as machine-readable logical triple statements, a survey over the digitalized assay KG data can, in fact, be directly computed. Concretely, see the Fig.1 survey of two bioassays computed over the open research knowledge graph of bioassays in the ORKG platform frontend. Note that such an information access mechanism not only directly addresses the information need of the biochemist but also al- leviates their effort to otherwise have to sift through volumes of unstructured bioassay descriptions to spot the key information. Specifically, taking advantage of the ORKG, the biochemist could even dynamically compute other tailored surveys to their information need. For example, they might want to gain a comprehensive view only of those bioassays for molecules tested for the SARS virus which did not elicit a significant effect against the pathology. From the insights in such a dynamically computed view, they can consequently avoid testing the same molecule again, while focusing their attention on discovering more effective molecules. As another example, by having an expansive view on the various bioassays already tested in the literature, it should be possible to easily choose whether to repeat the experiments reproducing the same conditions of another research group or to try a new way to test compounds efficacy changing the type of bioassay or the type of cell cultures. Taking a broader vision, experiment silos are a problem frequently raised in research and a comparison of experimental models among research groups is not so frequent. In the context of the proposed work over bioassays, the ORKG digitalization service alleviates this problem of experimental silos by connecting all bioassays in its scholarly knowledge graph (appendix B illustrates the semantic description of one assay overlaid with its KG in the ORKG). Such KGs makes the bioassay data precisely findable in accord with the FAIR standards [34] advocated for research information. Overall, this could lead to a considerable minimization of time in research -important especially during emergencies such as the pandemic we currently face caused by the Sars-Cov-2 virus. In this section, we delve into the details of the heart of the ORKG-assays workflow involving the AI-based semantification module. In general, AI-based automated scholarly KG construction, or the semantification of scholarly knowledge, is addressed by identifying and classifying entities and relations in scientific articles [2, 7, 10, 13, 14, 21, 22] as sequence labeling and classification objectives, respectively. Instead, we address the problem of bioassay semantification with a clustering objective. We choose clustering from our corpus observations that bioassays with similar text descriptions are semantified with similar sets of logical statements. Thus, the bioassays could be clustered based on their text descriptions and each cluster could be collectively semantified by the labels of the trained cluster. In contrast, while entity and relation classification would also serve as sound strategies, we would unnecessarily rely on more complex and time-consuming methods. We refer the reader to our prior work [5] wherein we have contrasted a classification versus a clustering objective for bioassay semantification. Last but not the least, an advantage of clustering is that it is in principle unsupervized without dependence on gold-standard data. In the following subsections, the model formalism, implementation details, and experimental evaluation results are provided. Formalism. Let K be the total number of clusters of bioassays represented by the set C = {c 1 , c 2 , ..., c K }. B train = {b 1 , b 2 , ..., b n } corresponds to the total bioassays in the training set used to obtain optimal cluster centroids; and V train = {v 1 , v 2 , ..., v n } is the vectorized representation of each bioassay to fit the clustering model. Note, K < n. Further, each cluster c x is associated with all the distinct logical statements of the bioassays in the respective cluster group. If cluster c x is fitted with two bioassays b p and b q in the training set, then c x is associated with a union of the logical statements of b p and b q . Thus, new logical statements sets ls cx associated with the K clusters are formed as {ls c1 , ls c2 , ..., ls c K }. After the clustering model is fitted with V train , semantification is performed. Each new bioassay b test is assigned based on v test to its closest cluster and semantified with the logical statements set of that cluster. Limitations. 1. The semantification of new bioassays will be limited only to the triples/statements of the training data. Note, there is no available dataset of bioassays that completely encapsulate all labels of the BAO. Thus, if a new bioassay has new triples from the BAO, the training set would need to be expanded with additional assays annotated with the unseen triples, and the model retrained. 2. Non-ontologized statements of bioassays cannot be captured by the clustering approach. Other than the standardized ontologized statements, bioassays are also annotated with assay-specific statements such as "has temperature value → 25 degree celcius" or "has incubation time value → 20 minute." Such statements require a different semantification strategy as a text reader of entities. Implementation. Each bioassay text is first converted into vector representations. We compare two different vectorizations: 1) TF-IDF [26] ; and 2) SciBERT embeddings [8] . The TF-IDF vector is fitted on a training collection of assays. Whereas the SciBERT embeddings are directly queried for their pretrained 768 dimensional vectors. We employ the K-means clustering algorithm [19] . To determine the optimal clusters size K, we employ the elbow optimization strategy that tries to select the smallest number of clusters accounting for the largest amount of variation in the data [30] . Dataset. For our model design and experimental observations, we relied on a corpus of annotated bioassays that were provided by the BAO group [11] (cloned in https://github.com/jd-coderepos/bioassays-ie). The dataset contains 983 human-annotated assays in all with 5,514 unique semantic statements from the BAO. On an average, each assay is annotated with 57 semantic statements with a maximum of 162 and a minimum of 7 statements. As mentioned earlier, these statements are obtained from the BAO. For in-depth information on the BAO, we refer the reader to the original papers [31, 1] . Cross Validation and Metrics. For clustering, we performed 3-fold cross validation experiments with a training/test set distribution of approximately 600 and 300 assays, respectively. The test set assays were selected such that they were unique between the folds. We measure the standard micro-precision, recall, and F1 scores for bioassay semantification per-fold experiment. The final scores are then averaged over the three folds. Evaluation Models. 1. Naive method. The top-n most frequent statements are computed across the whole dataset and each bioassay is uniformly annotated with these top-n statements. 2. Clustering. This is the model built in this work. Additionally, we introduce a labels frequency threshold parameter within the clusters. E.g., if a threshold of 4 is applied, then the test bioassays are semantified with only the statements that appeared 4 or more times within the cluster groups when the semantic statements from the various bioassays in that cluster were aggregated. Implicitly, we see that the application of label frequency thresholds, altogether drops some statements from consideration which, at the outset, imposes a performance disadvantage for the thresholded method applications. Nevertheless, it still offers generality versus specificity semantification methodological effectiveness insights. The empirical results are presented in Tables 1 and 2 , and discussed below in detail under three research questions (RQs). RQ 1: Is semantification by the top-n statements an effective method? In each of the five main columns in Table 1 , viz. 'top 10' through 'top 50,' n corresponds to the number of the most frequent statements assigned for each assay. E.g., 'top 10' is the 10 most common statements; 'top 50' is the 50 common statements. The results show that an increase in the number of statements ('n') for semantification insignificantly increases recall but at a significant cost to precision. In light of this, we asked ourselves: could the naive method achieve greater than 50% F1? The answer is no. For this to occur either P or R has to cross the 50% threshold while the other value be close enough to average to a 50% F1. But the results show that this is certainly unlikely, since the highest recall of 0.09% ('top 50' column in Table 1 ) is achieved at 43.48% precision which is only steadily declining having achieved a peak value of 64.52% at 'top 20' statements. Thus, the semantification task cannot be solved by the naive method since it cannot handle the semantification pattern variations across bioassays, proved for those in our dataset. RQ 2: Is clustering suitable for bioassay semantification? Examining the bold F1 scores in Table 2 shows that it is. Note that the lowest best F1 scores among the compared parameter settings are for 'Labels freq ≥ 4' at 0.63 and 0.66 for SciBERT and TF-IDF vectorizations, respectively. This shows the method can achieve a performance better than chance. On the other hand, the highest best F1 scores are for 'Labels freq ≥ 1' at 0.77 and 0.83 for SciBERT and TF-IDF vectorizations, respectively, which are strong performances for practical purposes. RQ 3: What can be concluded from TF-IDF versus SciBERT vectorization? This is a case-in-point for computing data-specific vectors. While SciBERT [8] is pretrained on a dataset of Computer Science and Biomedical scholarly articles, articles are still characteristically distinct from bioassay texts in terms of length and sectional organization. Bioassays are short descriptions of 1 or 2 paragraphs with either none or very few sections. Thus, we hypothesize the straightforward TF-IDF vectorization on a data source of bioassays would create better semantic representations of the data in vector space. Our hypothesis is empirically proven by the results in Table 2 , where in all experiment settings, TF-IDF vectorization outperforms the scholarly-articles-based pretrained SciBERT model. The highest F1 obtained by SciBERT is 0.77, while using TF-IDF is 0.83. Note the up and down arrows in the table reflect an increasing or decreasing scores trend. In this respect, vectorization by TF-IDF or SciBERT show similar increases/decreases. ORKG-assays will now be discussed as its implementation w.r.t. the KG Lifecycle requirements [28] consisting of the graph creation, hosting, curation, and deployment modules. The ORKG-assays micro-service belongs in an early stage of graph creation, i.e. when generating the graph itself. Thus, while the graph creation module handling the normalization of variously formatted graph data is beyond the scope of ORKG-assays, it addresses extracting the assay texts from heterogeneous bioassay depositories each with different file formats, generating a BAO-based structured graph. Such relevant details are discussed below. The design of a digital library of the future handling digitalized data should, based a common understanding of data and information between various stakeholders, integrate these technologies in the infrastructure and processes of search and knowledge exchange. Fig. 3 . End-to-end ORKG-assays semantification pipeline which practically realizes the digitalization of digitized data shown in Fig.? ? as a conceptual model involving data sources, data retrieval, an annotation service, and resulting triple statements. "description": { "Currently there are no small molecule tools to investigate the biological functions of apelin and its receptor. Apelin is the endogenous peptide ligand for the G-protein coupled receptor (GPCR) APJ (angiotensin II receptor-like 1, AGTRL-1 and APLNR)." }, { "labels": { "DNA construct": "Expressing the Apelin receptor", "assay measurement type": "endpoint assay", "has assay control": [ "negative control", "positive control" ], "has assay footprint": "1536 well plate", "has assay format": "cell-based format", data space of scholarly contributions satisfies this core design objective. Foremost, as open source software (https://gitlab.com/TIBHannover/orkg), it enables a large number of partners, users and stakeholders to contribute. Programmed in the Kotlin language and Spring framework, it has cloud native design to afford scalability and extensibility. Technically, at its core, it consists of a scalable data management infrastructure with a flexible graph-based data model accessed via lightweight APIs. As exportable data formats, the service currently implements the long-established open standards RDF/RDF-Schema in accordance with FAIR Data Principles to provide maximum interoperability (https://www.orkg.org/orkg/export-data). All data and information stored in the ORKG is made available under an open license as open data and open knowledge, so that the community can use this data for integration with other services, new applications or domain-specific analyses. Overall, the platform consists of three main subsystems: frontend, backend, and clients. The frontend is a single-page application (SPA) providing a user-visible interface by which users can contribute, explore, and curate research data. It uses the React framework. For styling, Bootstrap with the package Reactstrap. With this technology, various flexible UI elements are currently supported such as a statement browser for curating/viewing structured scholarly contributions information (see Figs. 1 and 5) . It also supports graph views of the data thus offering an alternative and complementary way to interact with ORKG content (see the overlaid graph in Fig.5 ). Next, and crucially, the ORKG supports the possibility of creating templates that specify the structure of content types, and using those templates when describing research contributions. E.g., the https://www.orkg.org/orkg/template/R70247 specification for bioassays. The backend consists of several (micro-)services: the REST API, as well as components responsible for similarity, comparison, annotation, curation, and AI information extraction. These services are coordinated by a HTTP reverse proxy (Nginx). Furthermore, the REST API is built using the Hexagonal Architecture approach to facilitate splitting out backend functionality into additional microservices easily, based on need, thus supporting easy extensibility. Moreover, this architecture permits use of different storage technologies based on suitability to the micro-service use cases as, in it, the domain logic is isolated from storage concerns. As storage systems, it currently leverages the power of property graphs (Neo4j) and relational databases (PostgreSQL). A central aspect of data storage in the ORKG is the preservation of provenance and evolution (similar to wikis), so that changes can be tracked transparently at any time. Finally, the third subsystem are wrappers around the ORKG REST API to allow direct interaction from other projects, e.g. the ORKG Python client leveraged in Jupyter notebooks. Data preparation This step relies on public access availability to an assay depository's querying mechanism. PubChem, reported to have over 1 million assays [20] , is queryable via its public REST API for its bioassays where some assays have depositor-provided cross-references to scientific articles in PubMed. Depending on the depositor, the data could be returned in JSON, XML, or CSV. We implemented a specific pipeline for "The Scripps Research Molecular Screening Center" which returned JSON query responses. It reported nearly 1,600 bioassays. However, to prepare the data, the bioassay description-specific sections had to be located in its JSON response file and the text then extracted. The text was merged from two separate parts, viz. assay overview and assay protocol summary. We noted that this parsing heuristic can be applied to most depositor responses, although there maybe some exceptions. From the 1,600 assays, 182 contained no parseable text descriptions. Semantification We designed this component using automated techniques and a user interface to help scientists curate their data with minimum effort. The hybrid design was based on the premise that pure machine learning is insufficiently accurate, and that expecting scientists to find the time to semantify their assays manually is unrealistic. Further, having scientists in-the-loop could help address annotations outside the scope of the training data which can later be fed back to improve the model. Note for the assays outside the scope of the training data, the semantifier returns no annotations. For the queried research institute source, of the total assays with text, 496 were semantifiable by our tool, whereas the rest were not. Charcteristically, an assay can belong to more than one paper and a paper can contain more than one assay. We presented ORKG-assays -an end-to-end digitalization workflow of unstructured scholarly descriptions specifically addressing the problem of the digitalization of bioassays within a next-generation digital library, the ORKG. By nature of the design of the ORKG as a research infrastructure, it supports integrating the explicit semantic representation of contributions from publications with a large number of other information sources and infrastructures. They include: Metadata through services like Crossref, ORCID, etc.; Multimedia content, e.g. lectures and demos via TIB AV portal; Collaborative authoring via dokie.li etc.; Research data management, e.g. potentially the EOSC ELIXIR data platform; Open courseware, e.g. SlideWiki.org; Thesauri and Ontologies, via the OBO Foundry services API, or the NCBI, Medline, MESH taxonomies; Data linking via DataCite. Owing to this vastly interconnected graph, the resulting data from the ORKG-assays microservice in the ORKG will be Findable, Accessible, Interoperable, and Reusable, in other words will conform to the FAIR principles which were offered as guidelines for the creation of scholarly data [34] . A typical bioassay involves a stimulus (e.g. chemicals) applied to a subject (e.g. animals, tissues, plants). The corresponding response (e.g. death) of the subject is thereby triggered and measured. Thus, a bioassay is a type of an experiment with a domain-specific scope and purpose. Many early examples of bioassays used animals to test the carcinogenicity of chemicals. Animal bioassays have been used in all to evaluate the safety of chemicals in foods, drugs, and cosmetics; to re-create known human diseases. Aside from which, environmental bioassays are also performed where chemicals primarily associated with occupational or environmental exposures are evaluated [9] . The bioassay is a centerpiece in human health risk assessment and the regulation of chemicals [24] and are pivotal in pre-clinical research for drug development. By revealing whether a compound or biologic has the desired effect on a biological target, bioassays can drive decision-making throughout the drug discovery process, to ultimately bring new drugs to patients. Results from bioassays influence decisions about whether the chemical warrants further study in human clinical trials, potentially leading to its release onto the market. Thus, as an important precursor step, bioassays should be carefully planned to ensure they are optimized for their specific purpose. There are well-established noncommercial centers such as the Molecular Libraries Probe Production Centers Network (MLPCN) [25] or EU Openscreen https://www.eu-openscreen.eu/ and commercial facilities https://www.aureliabio.com/ that develop and perform bioassays. In the digital space, PubChem https://pubchem.ncbi.nlm. nih.gov/ is a major depository for unstructured or semi-structured bioassay descriptions and chemical compounds. In this vein, complementing the digital space technologies, with the ORKG, we take a step further by handling digitalized scholarly information and, in the context of this paper, that of bioassays data via the ORKG-assays workflow. We propose advanced information access technology based on knowledge graphs for generating a FAIR-compliant [34] semantic description of unstructured bioassays so as to automatically compute surveys or search over their key information. A following natural question is: Is there a standard vocabulary for the semantification of unstructured bioassays? Broadly speaking, ontologies have traditionally been used in biology to organize information within a domain and, to a lesser extent, to annotate experimental data. Consider the Gene Ontology [12] , the hundreds of ontologies in the Open Biological and Biomedical Ontologies (OBO) Foundry [29] , as well as the National Center for Biomedical Ontologies https://bioportal.bioontology.org/. Keeping precedent, bioassay properties and values have also been comprehensively organized with the Bioassay Ontology [31, 1] with information regarding assay formats (e.g. cell-based vs. biochemical), readout technologies, reagents employed, and details of the biological system interrogated. The BAO, is thus indeed leveraged as the reference vocabulary for bioassay semantification. Evolving bioassay ontology (bao): modularization, integration and applications ScienceIE): semi-supervised end-to-end entity and relation extraction Scibert-based semantification of bioassays in the open research knowledge graph Representing semantified biological assays in the open research knowledge graph Easy semantification of bioassays Towards an open research knowledge graph SemEval 2017 task 10: ScienceIE -extracting keyphrases and relations from scientific publications Scibert: A pretrained language model for scientific text Historical perspective on the use of animal bioassays to predict carcinogenicity: evolution in design and recognition of utility Domain-independent extraction of scientific concepts from research articles Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation The gene ontology (go) database and informatics resource Ai-kg: An automatically generated knowledge graph of artificial intelligence SemEval-2021 Task 11: NLPContribution-Graph -Structuring Scholarly NLP Contributions for a Research Knowledge Graph Uses of bioassay in entomology Pubchemsr: A search and retrieval tool for pubchem Statistical method in biological assay Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge Literature information in pubchem: associations between pubchem records and scientific articles UIUC BioNLP at SemEval-2021 task 11: A cascade of neural models for structuring scholarly NLP contributions Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction Comparing research contributions in a scholarly knowledge graph Historical perspective of the cancer bioassay. Scandinavian journal of work Open access high throughput drug discovery in the public domain: a mount everest in the making TF-IDF A bioinformatics analysis of the cell line nomenclature Knowledge graph lifecycle: Building and maintaining knowledge graphs The obo foundry: coordinated evolution of ontologies to support biomedical data integration Integration k-means clustering method and elbow method for identification of the best customer profile cluster Bioassay ontology (bao): a semantic description of bioassays and high-throughput screening results Pubchem's bioassay database Towards customizable chart visualizations of tabular data using knowledge graphs The fair guiding principles for scientific data management and stewardship