key: cord-0977215-1p2gx23f
authors: Baltoumas, Fotis A.; Zafeiropoulou, Sofia; Karatzas, Evangelos; Paragkamian, Savvas; Thanati, Foteini; Iliopoulos, Ioannis; Eliopoulos, Aristides G.; Schneider, Reinhard; Jensen, Lars Juhl; Pafilis, Evangelos; Pavlopoulos, Georgios A.
title: OnTheFly2.0: a text-mining web application for automated biomedical entity recognition, document annotation, network and functional enrichment analysis
date: 2021-05-17
journal: bioRxiv
DOI: 10.1101/2021.05.14.444150
sha: 71c01be6a6d9771b2c85e9316dbd0239d6af4891
doc_id: 977215
cord_uid: 1p2gx23f

Extracting and processing information from documents is of great importance as lots of experimental results and findings are stored in local files. Therefore, extracting and analysing biomedical terms from such files in an automated way is absolutely necessary. In this article, we present OnTheFly2.0, a web application for extracting biomedical entities from individual files such as plain texts, Office documents, PDF files or images. OnTheFly2.0 can generate informative summaries in popup windows containing knowledge related to the identified terms along with links to various databases. It uses the EXTRACT tagging service to perform Named Entity Recognition (NER) for genes/proteins, chemical compounds, organisms, tissues, environments, diseases, phenotypes and Gene Ontology terms. Multiple files can be analysed, whereas identified terms such as proteins or genes can be explored through functional enrichment analysis or be associated with diseases and PubMed entries. Finally, protein-protein and protein-chemical networks can be generated with the use of STRING and STITCH services. To demonstrate its capacity for knowledge discovery, we interrogated published meta-analyses of clinical biomarkers of severe COVID-19 and uncovered inflammatory and senescence pathways that impact disease pathogenesis. OnTheFly2.0 currently supports 197 species and is available at http://onthefly.pavlopouloslab.info.

several inflammatory and senescence pathways that impact COVID-19 pathogenesis have been unraveled after analyzing six clinical articles with mentions to clinical biomarkers of severe COVID-19.

The OnTheFly 2.0 pipeline consists of four steps ( Figure 1 ): i) uploading of input files and conversion from their original format to HTML, ii) identification of bioentities with EXTRACT, iii) functional annotation on a set of selected identifiers and iv) network analysis. A detailed description of these steps is provided in the following subsections. 

Once the files have been uploaded, users can annotate them with the help of EXTRACT tagging service (5) . EXTRACT performs dictionary-based NER using the highly efficient tagger software (33) to detect words and phrases, which correspond to biomedical entities. In detail, EXTRACT is capable of identifying environment descriptive terms from Environment Ontology (e.g., desert, forest) (34) , organism mentions from NCBI Taxonomy (35) , tissue terms from BRENDA Tissue Ontology (36) , disease mentions from Disease Ontology (37) , phenotypes from Mammalian Phenotype Ontology (38) , biological processes, cellular components, molecular functions from Gene Ontology (39, 40) , small chemical molecules from PubChem (41) , non-coding RNAs from RAIN (42) , and protein-coding genes from STRING (15) . In the implementation of OnTheFly 2.0 , NER can be performed for a list of 197 organisms.

Once the annotation parameters (entity types and organisms) have been set and a NER process has been completed, OnTheFly 2.0 will return the annotated document with all of the recognized terms linked and highlighted using different colors ( Figure 2 ). On mouse-click action on a term, OnTheFly 2.0 will generate a popup window with details about the biomedical entity and links to external databases. In case of term disambiguation (e.g., when a term comes from several organisms or corresponds to more than one entity type), OnTheFly 2.0 will report all of the possible options. For a more comprehensive summary, all of the identified terms along with their database identifiers and links are collected in an interactive table and can be exported as a CSV file. The table results can be narrowed down after filtering for entity type (e.g., genes/proteins, diseases) at any stage. The annotation process is presented in Figure 2 . (43) . A) The PDF abstract in its simple form. B) The annotated abstract using M. musculus as an organism. C) Popup windows with information about the identified term. The term is colored according to its type and original links to external databases are provided. D) A summary table with some of the identified terms.

OnTheFly 2.0 uses two tools, g:Profiler (27, 28) and aGOtool (26) , to provide rich functional enrichment analysis for a selected set of genes/proteins collected by one or multiple files. The user can customize parameters for the enrichment analysis and choose from a list of 197 organisms. OnTheFly 2.0 uses g:Profiler to identify enriched functional terms from Gene Ontology (39, 40) , pathways from KEGG (11), Reactome (12) and WikiPathways (44) , protein complexes from CORUM (45) , expression data from Human Protein Atlas (46) , regulatory motifs from TRANSFAC (47) and miRTarBase (48) , and phenotypes from the Human Phenotype Ontology (49) . The analysis results from g:Profiler are complemented by further enrichment analyses from aGOtool to also identify enriched terms from the UniProt keyword classification system, protein families and domains from Pfam (50) and InterPro (51), as well as human diseases from the DISEASES database (52) . g:Profiler and aGOtool test for statistically significant enrichment by using Fisher's exact test to compare the user-defined input dataset (foreground) to a background set from organism-specific genes annotated in the Ensembl database (53) and UniProt Reference Proteomes (54), respectively. The resulting p-values are corrected for multiple testing using either g:SCS (only in case of g:Profiler), Bonferroni correction or Benjamini-Hochberg false discovery rate (FDR), all of which can be used as thresholds for the results. Enrichment analysis is performed using ENSEMBL IDs as input, while results can be reported as Entrez, UniProt, EMBL, ENSEMBL and RefSeq gene/protein names/identifiers, based on the user's choice.

Functional enrichment results are reported in interactive searchable tables displaying details about each functional term. One can expand each row of the table to see which of the identified genes/proteins were found to be associated with the functional term. For example, in the case of a KEGG pathway, one can see how many proteins or genes were found to be related to it and get redirected to the KEGG repository to see the actual schema of the pathway in a static form with all of the detected genes/proteins highlighted. In case of g:Profiler, an interactive Manhattan plot is offered for a clearer overview. In this plot, functional terms are grouped along the x-axis and colored by their data source, whereas the y-axis shows the significance (p-value) of each term. Hovering over a data point reveals a tooltip with key information about the functional term. Finally, the most significant functional terms are shown as a bar chart, which the user can customize to show the desired number of terms. All of the aforementioned reports can be exported and saved in various file formats (CSV, XLS, PDF). An overview is shown in Figure 3 . 

OnTheFly 2.0 uses the aGOtool to allow users to find scientific articles that mention surprisingly many of the genes/proteins identified in the uploaded input files. While conceptually similar to the functional enrichment analyses just described, publication enrichment analysis serves a very different purpose, namely to help the user identify scientific publications of relevance to the gene/protein list. The publication enrichment analysis in aGOtool is based on a text corpus of all PubMed abstracts and full-text articles from the PubMed Central Open Access subset. These have been run through the same NER tagger used in EXTRACT and the results are updated with new documents on a weekly basis. Consequently, all documents have been automatically annotated with the genes mentioned within them, thus turning every document into a gene set. These millions of gene sets are then used by aGOtool in the same manner as all other gene sets.

We make use of this functionality to provide publication enrichment functionality in OnTheFly 2.0 for the list of 197 organisms. The user can select up to 1,000 of the genes/proteins identified in the uploaded files for analysis, which will then be submitted to aGOtool to test each document from the precomputed corpus for statistically significant enrichment, again using Fisher's exact test. The resulting p-values as well as Bonferroni-corrected p-values and Benjamini-Hochberg FDR values can be used for filtering the results. Results are reported in interactive searchable tables displaying details about each literature term (scientific publication or disease). Links are provided for publications and diseases to PubMed. In addition, users are able to rank the most significant publications using barchart plots and manually adjust the number of the reported results with the use of a slide bar. All of the aforementioned reports can be exported and saved in various file formats (CSV, XLS, PDF).

In addition to the aforementioned enrichment options, OnTheFly 2.0 offers the capability to construct and visualize biomolecular interaction networks for a set of 197 organisms. This task is performed using the APIs of the STRING (15) and STITCH (16) databases for protein-protein and protein-chemical interactions, respectively. The users may submit their dataset obtained from the uploaded documents to retrieve interactions and visualize the results as networks with the interacting entities presented as nodes and their interactions as edges. For computational efficiency reasons, in its current version, OnTheFly 2.0 allows a maximum of 500 proteins per request for STRING and 100 proteins or small molecules per request for STITCH.

STRING and STITCH classify interactions between two entities (proteins or small molecules) as either physical (i.e., part of the same biomolecular complex), or functional (i.e., involved in the same pathway/process). To this end, OnTheFly 2.0 requires users to select whether to include the Full set of interactions (both physical and functional) or the Physical subnetwork exclusively. Users can also specify the cutoff on the Interaction Score. Finally, users can choose whether each edge should show the type(s) of evidence (e.g., experiments or text mining) supporting it (Evidence mode) or if the thickness of the edge should instead show the interaction score (Confidence mode).

In addition to the above, in protein-chemical networks network edges can be formatted based on Molecular Action or Binding Affinity. By choosing Molecular Action, the edges in the network will represent the type (activation, inhibition, catalysis etc.) as well as the effect (positive, negative or unspecified) of each protein-chemical interaction. By choosing Binding Affinity, the edge thickness will indicate the binding affinity between the proteins and bound chemicals. The resulting network is shown in a separate Network Viewer panel, preserving the characteristic STRING network layout and style. An example of such networks is shown in Suppl. Fig. 1 . In addition, options are given to view the generated network in STRING (proteinprotein) or STITCH (protein-chemical) for further analysis. Finally, one can export a network as an image or as a tab-delimited file compatible with external network visualization applications.

OnTheFly 2.0 is a web application implemented in R, using the R/Shiny package as well as HTML, CSS and JavaScript. The Shiny and ShinyJS packages are used as mediators to establish the connection between the R and JavaScript functions. The API of the EXTRACT web service which utilizes the tagger text mining utility, is used to perform NER. Functional enrichment analysis is performed using the g:Profiler2 package (R implementation of g:Profiler) and aGOtool. Biological networks are constructed and visualized using the STRING API, as implemented in the STRING and STITCH databases. OnTheFly 2.0 is available as a web service, and as a standalone package through a GitHub repository. The standalone version is fully functional in native Linux and other Unix-based operating systems. It can also run on Windows, by utilizing a Windows Subsystem for Linux (WSL) or other similar compatibility layers (e.g., Cygwin). The web service is fully functional in all major web browsers (Google Chrome, Mozilla Firefox, Microsoft Edge, Tor, Apple Safari, Opera).

To demonstrate the capacity of OnTheFly 2.0 for rapid extraction of biological information and knowledge discovery, we analyzed six published meta-analysis reports on clinical biomarkers of severe COVID-19 (55-60) (Suppl. Table 1 ). Texts in PDF format were annotated by NER, results filtered to manually remove false positives and jointly processed for functional enrichment analysis. Reassuringly, we found "Respiratory failure", "Pneumonia" and "COVID-19" to be among the most significantly enriched diseases (Suppl. Table 2 ). The GO enrichment for biological processes (Suppl. Table 3 ) identified several GO terms related to inflammation, cell activation and response to stress, in line with COVID-19 being associated with exaggerated lung inflammation and systemic immune dysfunction. Similarly, the annotated text terms were found to be enriched for molecular functions that are associated with cytokine activity and cytokine receptor signaling (Suppl. Table 4 ). These results were supported by the UniProt keyword analysis, which revealed "Cytokine", "Inflammatory response", "Host-virus interaction", and "Host cell receptor for virus entry" to all be enriched (Suppl. Table 5 ). Analysis of putative protein-protein interactions (physical and functional associations) through the STRING option of OnTheFly 2.0 uncovered a cluster of interacting cytokines and other immune components that is pertinent to the "cytokine storm" of severe COVID-19 ( Figure 4) . Cytokines are also a recurring theme in the publication enrichment results, which as one would hope further included several COVID-19 studies (Suppl. Table 6 ).

Cellular/extracellular components predicted to be associated with biomarkers of severe COVID-19 included extracellular space (GO:0005615, GO:0005576), plasma membrane (GO:0009897, GO:0009986, GO:0098552) and, interestingly, membrane microdomains (also called "membrane rafts"; GO:0098857, GO:0045121) (Suppl. Table 7 ). The latter emerge as important cellular components implicated in i) the initial binding of SARS-CoV-2 to ACE2 receptor, ii) virus internalization and iii) cell-to-cell transmission (reviewed in (61)). Pertinent to knowledge discovery, this biological information was extracted in the absence of specific reference to membrane microdomains in any of the six meta-analysis reports that were interrogated.

Several relevant KEGG pathways were also extracted (Suppl. Table 8 ), including "coronavirus disease -COVID-19" (KEGG: 05171; Suppl. Fig. 2) , "viral protein interaction with cytokine and cytokine receptor" (KEGG: 04061) and "cytokine-cytokine receptor interaction" (KEGG: 04060). Interestingly, "Yersinia infection" (KEGG: 05135) was also identified as a relevant KEGG pathway with high probability (p-value<10 -8 ). Yersinia pestis is the causative pathogen for pneumonic plague, one of the world's deadliest infectious diseases. Yersinia pestis infects pneumocytes and alveolar macrophages, triggering inflammasome-mediated IL-1β/IL-18 cytokine release (62) that is followed by neutrophil influx, exaggerated inflammation and lung tissue damage (63) . These immune and lung tissue reactions to Yersinia pestis are reminiscent of those to severe SARS-CoV-2 infection (61) and warrant further insights into the immunological mechanisms of response to these unrelated pathogens. Of additional interest is the predicted involvement of the "IL-17 signaling pathway" (KEGG: 04657) in severe COVID-19 (Suppl. Table  8 ) which is supported by a recent study reporting T cell skewing towards Th17, a specialized CD4 + effector T cell lineage characterized by secretion of IL-17 and IL-17F cytokines in patients with COVID-19 pneumonia (64) .

We also explored the REACTOME option of OnTheFly 2.0 to map and analyze biological pathways that are over-represented in the validation example. As shown in Suppl. Table 9 , several cytokine pathways were predicted to be significantly associated with biomarkers of severe COVID-19. We note that predicted REACTOME pathways included "cellular senescence" despite the absence of specific references to this biological term in any of the six annotated meta-analysis reports under study. In line with this prediction, COVID-19 pneumonia has recently been associated with immunosenescence (64) and accelerated aging of pneumocytes (65) . Overall, the aforementioned analyses underscore the practical utility of OnTheFly 2.0 to rapidly extract biological information from texts and hence assisting knowledge discovery (Figure 4) . 

OnTheFly 2.0 has been redeveloped to use current technologies and overcome many of the problems of its predecessor (66) . The GUI has been completely rewritten to no longer rely on a Java applet and instead using R, Shiny, CSS, HTML and JavaScript technologies. The backend document format conversion has also been considerably improved, replacing commercial Windows-based converters with open-source, Unix-based ones, which furthermore do a much better job preserving the original document layout. Moreover, compared to its predecessor, OnTheFly 2.0 comes with a broader spectrum of term types it can identify and supports OCR technology for processing images. Uploaded files are only stored temporarily in the OnTheFly 2.0 server just for parsing and no file backups, copies or personal data are kept. A more detailed comparison between OnTheFly 1.0 and OnTheFly 2.0 is presented in Table 1 . 

OnTheFly 2.0 is a powerful tool for identifying terms in locally stored documents varying from texts and PDFs to Office and image files. Users can identify terms such as proteins, genes, chemical compounds, organisms, tissues, environments, diseases, phenotypes and gene ontologies and perform a functional enrichment and network analysis upon selecting a set of biomedical entities. Furthermore, popup windows with informative summaries about a term and its links to external repositories are also generated. OnTheFly 2.0 can aid researchers in annotating locally stored documents and further exploring and analysing their identified biomedical entities in a fully automated way. We believe that due to its offered capabilities and ease of use, OnTheFly 2.0 will reach a broad spectrum of users varying from experimentalists to bioinformaticians. 

A survey of named entity recognition and classification

Text-mining solutions for biomedical research: enabling integrative biology

Text mining resources for the life sciences

2020) Named Entity Recognition and Relation Detection for Biomedical Information Extraction

EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation

PubTator: a web-based text mining tool for assisting biocuration

HunFlair: An Easy-to-Use Tool for State-of-the-Art Biomedical Named Entity Recognition

BioTextQuest(+): a knowledge integration platform for literature mining and concept discovery

Towards reliable named entity recognition in the biomedical domain

OGER++: hybrid multi-type entity recognition

KEGG: integrating viruses and cellular organisms

The reactome pathway knowledgebase

The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases

The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions

The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets

STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data

Exploring Networks in the STRING and Reactome Database

Cytoscape: a software environment for integrated models of biomolecular interaction networks

Gephi: An Open Source Software for Exploring and Manipulating Networks

NORMA-The network makeup artist: a web tool for network annotation visualization

A survey of visualization tools for biological network analysis

A Guide to Conquer the Biological Network Era Using Graph Theory

DAVID-WS: a stateful web service to facilitate gene/protein list analysis

PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API

WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs

Avoiding abundance bias in the functional annotation of post-translationally modified proteins

Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update)

2020) gprofiler2 --an R package for gene list functional enrichment analysis and namespace conversion toolset g:Profiler

Gene Set Analysis: Challenges, Opportunities, and Future Research

Gene set analysis methods: a systematic comparison

Online publishing via pdf2htmlEX

An Overview of the Tesseract OCR Engine

Real-time tagging of biomedical entities

The environment ontology: contextualising biological and biomedical entities

NCBI Taxonomy: a comprehensive update on curation, resources and tools

The BRENDA Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources

Human Disease Ontology 2018 update: classification, content and workflow expansion

The mammalian phenotype ontology: enabling robust annotation and comparative analysis

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

The Gene Ontology resource: enriching a GOld mine

PubChem in 2021: new data content and improved web interfaces

RAIN: RNA-protein Association and Interaction Networks

NMI and IFP35 serve as proinflammatory DAMPs during cellular infection and injury

WikiPathways: connecting communities

CORUM: the comprehensive resource of mammalian protein complexes-2019

Proteomics. Tissue-based map of the human proteome. Science

The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation

2020) miRTarBase 2020: updates to the experimentally validated microRNAtarget interaction database

The Human Phenotype Ontology in 2021

Pfam: The protein families database in 2021

The InterPro protein families and domains database: 20 years on

DISEASES: text mining and data integration of disease-gene associations

UniProt: the universal protein knowledgebase in 2021

Hematologic, biochemical and immune biomarker abnormalities associated with severe illness and mortality in coronavirus disease 2019 (COVID-19): a meta-analysis

A metaanalysis of potential biomarkers associated with severity of coronavirus disease 2019 (COVID-19)

Cytokine elevation in severe and critical COVID-19: a rapid systematic review, meta-analysis, and comparison with other inflammatory syndromes

Diagnostic and prognostic value of hematological and immunological markers in COVID-19 infection: A meta-analysis of 6320 patients

Predictors of adverse prognosis in COVID-19: A systematic review and meta-analysis

Predictors of mortality in hospitalized COVID-19 patients: A systematic review and meta-analysis

COVID-19 enters the expanding network of apolipoprotein E4-related pathologies

Yersinia pestis activates both IL-1β and IL-1 receptor antagonist to modulate lung inflammation during pneumonic plague

Early host cell targets of Yersinia pestis during primary pneumonic plague

Marked T cell activation, senescence, exhaustion and skewing towards TH17 in patients with COVID-19 pneumonia

Alveolar type II cells harbouring SARS-CoV-2 show senescence with a proinflammatory phenotype

OnTheFly: a tool for automated document-based text annotation, data linking and network generation