key: cord-271693-7tg21up3
authors: Zheng, Fan; Zhang, She; Churas, Christopher; Pratt, Dexter; Bahar, Ivet; Ideker, Trey
title: Identifying persistent structures in multiscale ‘omics data
date: 2020-10-03
journal: bioRxiv
DOI: 10.1101/2020.06.16.151555
sha: 
doc_id: 271693
cord_uid: 7tg21up3

In any ‘omics study, the scale of analysis can dramatically affect the outcome. For instance, when clustering single-cell transcriptomes, is the analysis tuned to discover broad or specific cell types? Likewise, protein communities revealed from protein networks can vary widely in sizes depending on the method. Here we use the concept of “persistent homology”, drawn from mathematical topology, to identify robust structures in data at all scales simultaneously. Application to mouse single-cell transcriptomes significantly expands the catalog of identified cell types, while analysis of SARS-COV-2 protein interactions suggests hijacking of WNT. The method, HiDeF, is available via Python and Cytoscape.

Significant patterns in data often become apparent only when looking at the right scale. For example, single-cell RNA sequencing data can be clustered coarsely to identify broad categories of cells (e.g. mesoderm, ectoderm), or analyzed more sharply to delineate highly specific subtypes (e.g. pancreas islet β-cells, thymus epithelium) [1] [2] [3] . Likewise, protein-protein interaction networks can inform groups of proteins spanning a wide range of spatial dimensions, from protein dimers (e.g. leucine zippers) to larger complexes of dozens or hundreds of subunits (e.g. proteasome, nuclear pore) to entire organelles (e.g. centriole, mitochondria) [4] . Many different approaches have been devised or applied to detect structures in biological data, including standard clustering, network community detection, and low-dimensional data projection [5] [6] [7] , some of which can be tuned for sensitivity to objects of a certain size or scale (so-called 'resolution parameters') [8, 9] . Even tunable algorithms, however, face the dilemma that the particular scale(s) at which the significant biological structures arise are usually unknown in advance.

Guidelines for detecting robust patterns across scales come from the field of topological data analysis, which studies the geometric "shape" of data using tools from algebraic topology and pure mathematics [10] . A fundamental concept in this field is "persistent homology" [11] , the idea that the core structures intrinsic to a dataset are those that persist across different scales.

Recently, this concept has begun to be applied to analysis of 'omics data and particularly biological networks [12, 13] . Here, we sought to integrate concepts from persistent homology with existing algorithms for network community detection, resulting in a fast and practical multiscale approach we call the Hierarchical community Decoding Framework (HiDeF).

HiDeF works in the three phases to analyze the structure of a biological dataset (Methods). To begin, the dataset is formulated as a similarity network, depicting a set of biological entities (e.g. genes, proteins, cells, patients, or species) and pairwise connections among these entities (representing similarities in their data profiles). The goal of the first phase is to detect network communities, i.e. groups of densely connected biological entities. Communities are identified continually as the spatial resolution is scanned, producing a comprehensive pool of candidates across all scales of analysis (Fig. 1a) . In the second phase, candidate communities arising at different resolutions are pairwise aligned to identify those that have been redundantly identified and are thus persistent (Fig. 1b) . In the third phase, persistent communities are analyzed to identify cases where a community is fully or partially contained within another (typically larger) community, resulting in a hierarchical assembly of nested and overlapping biological structures ( Fig. 1c,d) . HiDeF is implemented as a Python package and can be accessed interactively in the Cytoscape network analysis and visualization environment [14] (Availability of data and materials).

We first explored the idea of measuring community persistence via analysis of synthetic datasets [15] in which communities were simulated and embedded in the similarity network at two different scales (Supplementary Fig. 1a; Methods) . Notably, the communities determined to be most persistent by HiDeF were found to accurately recapitulate the simulated communities at the two scales (Supplementary Fig. 1b-g) . In contrast, applying community detection algorithms at a fixed resolution had limited capability to capture both scales of simulated structures simultaneously (Supplementary Fig. 2; Methods) .

We next evaluated whether persistent community detection improves the characterization of cell types. We applied HiDeF to detect robust nested communities within cell-cell similarity networks based on the mRNA expression profiles of 100,605 single cells gathered across the organs and tissues of mice (obtained from two datasets in the Tabula Muris project [16] ; Methods).

These cells had been annotated with a controlled vocabulary of cell types from the Cell Ontology (CO) [17] , via analyses of cell-type-specific expression markers [16] . We used groups of cells sharing the same annotations to define a panel of 136 reference cell types and measured the degree to which each reference cell type could be recapitulated by a HiDeF community of cells (Methods). We compared these results to TooManyCells [18] and Conos [19] , two recently developed methods that generate nested communities of single cells in divisive and agglomerative manners, respectively (Methods). Reference cell types tended to better match communities generated by HiDeF than those of other approaches, with 65% (89/136) having a highly overlapping community (Jaccard index > 0.5) in the HiDeF hierarchy ( Fig. 2a,b,   Supplementary Fig. 3a,b) . This favorable performance was observed consistently when adjusting HiDeF parameters to formulate a simple hierarchy, containing only the strongest structures, or a more complex hierarchy including additional communities that are less persistent but still significant (Fig. 2c, Supplementary Fig. 3c) .

The top-level communities in the HiDeF hierarchy corresponded to broad cell lineages such as "T cell", "B cell", and "epidermal cell". Finer-grained communities mapped to more specific known subtypes (Fig. 2d) or, more frequently, putative new subtypes within a lineage. For example, "epidermal cell" was split into two distinct epidermal tissue locations, skin and tongue; further splits suggested the presence of still more specific uncharacterized cell types (Fig. 2e) .

HiDeF communities also captured known cell types that were not apparent from 2D visual embeddings (Supplementary Fig. 4a,b) , and also suggested new cell-type combinations. For example, astrocytes were joined with two communities of neuronal cells to create a distinct cell type not observed in the hierarchies of TooManyCells [18] , Conos [19] , or a two-dimensional data projection with UMAP [20] (Fig. 2f, Supplementary Fig. 4c ). This community may correspond to the grouping of a presynaptic neuron, postsynaptic neuron, and a surrounding astrocyte within a so-called "tripartite synapse" [21] .

Next, we applied HiDeF to analyze protein-protein interaction networks, with the goal of characterizing protein complexes and higher-order protein assemblies spanning spatial scales.

We benchmarked this task by the agreement between HiDeF communities and the Gene Ontology (GO) [22] , a database that manually assigns proteins to cellular components, processes, or functions based on curation of literature (Methods). Application to protein-protein interaction networks from budding yeast and human found that HiDeF captured knowledge in GO more significantly than previous pipelines proposed for this task, including the NeXO approach to hierarchical community detection [23] and standard hierarchical clustering of pairwise protein distances calculated by three recent network embedding approaches [24] [25] [26] (Fig. 3a, Fig. 7) .

We also applied HiDeF to analyze a collection of 27 human protein interaction networks [27, 28] . We found significant differences in the distributions of community sizes across these networks, loosely correlating with the different measurement approaches used to generate each network. For example, BioPlex 2.0, a network characterizing biophysical protein-protein interactions by affinity-purification mass-spectrometry (AP-MS) [29] , was dominated by small communities of 10-50 proteins, whereas a network based on mRNA coexpression [30] tended towards larger-scale communities of >50 proteins. In the middle of this spectrum, the STRING network, which integrated biophysical protein interactions and gene co-expression with a variety of other features [31] , contained both small and large communities (Fig. 3c) . In agreement with the observation above, the hierarchy of BioPlex had a relatively shallow shape in comparison to that of STRING (and other integrated networks including GIANT and PCNet [27, 32] ), in which communities across many scales formed a deep hierarchy (Fig. 3d ,e; Availability of data and materials).

In contrast to clustering frameworks, HiDeF recognizes when a community is contained by multiple parent communities, which in the context of protein-protein networks suggests that the community participates in diverse pleiotropic biological functions. For example, a community corresponding to the MAPK (ERK) pathway participated in multiple larger communities, including RAS and RSK pathways, sodium channels, and actin capping, consistent with the central roles of MAPK signaling in these distinct biological processes [33] (Supplementary Fig. 8) . The hierarchies of protein communities identified from each of these networks have been made available as a resource in the NDEx database [34] (Availability of data and materials).

To explore multiscale data analysis in the context of an urgent public health issue, we considered a recent application of AP-MS that characterized interactions between the 27 SARS-COV-2 viral subunits and 332 human host proteins [35] . We used network propagation to select a subnetwork of the BioPlex 3.0 human protein interactome [36] proximal to these 332 proteins (1948 proteins and 22,835 interactions) and applied HiDeF to identify its community structure (Methods). Among the 251 persistent communities identified (Fig. 3f) , we noted one consisting of human Transducin-Like Enhancer (TLE) family proteins, TLE1, TLE3, and TLE5, which interacted with SARS-COV2 Nsp13, a highly conserved RNA synthesis protein in corona and other nidoviruses (Fig. 3g) [37] . TLE proteins are well-known inhibitors of the Wnt signaling pathway [38] . Inhibition of WNT, in turn, has been shown to reduce coronavirus replication [39] and recently proposed as a COVID-19 treatment [40] . If interactions between Nsp13 and TLE proteins can be shown to facilitate activation of WNT, TLEs may be of potential interest as drug targets.

Community persistence provides a basic metric for distilling biological structure from data, which can be tuned to select only the strongest structures or to include weaker patterns that are less persistent but still significant. This concept applies to diverse biological subfields, as demonstrated here for single cell transcriptomics and protein interaction mapping. While these subfields currently employ very different analysis tools which largely evolve separately, it is perhaps high time to seek out core concepts and broader fundamentals around which to unify some of the ongoing development efforts. To that effect, the methods explored here have wide applicability to analyze the multiscale organization of many other biological systems, including those related to chromosome organization, the microbiome and the brain.

Consider an undirected network graph , representing a set of biological objects (vertices) and a set of similarity relations between these objects (edges). Examples of interest include networks of cells, where edges represent pairwise cell-cell similarity in transcriptional profiles characterized by single-cell RNA-seq, or networks of proteins, where edges represent pairwise protein-protein biophysical interactions. We seek to group these objects into communities (subsets of objects) that appear at different scales and identify approximate containment relationships among these communities, so as to obtain a hierarchical representation of the network structure. The workflow is implemented in three phases. Phase I identifies communities in at each of a series of spatial resolutions . Phase II identifies which of these communities are persistent by way of a panresolution community graph ! , in which vertices represent communities, including those identified at each resolution, and each edge links pairs of similar communities arising at different resolutions. Persistent communities correspond to large components in ! . Phase III constructs a final hierarchical structure that represents containment and partial containment relationships (directed edges) among the persistent communities (vertices).

Community detection methods generally seek to maximize a quantity known as the network modularity, as a function of community assignment of all objects [41] . A resolution parameter integrated into the modularity function can be used to tune the scale of the communities identified [9, 42, 43] , with larger/smaller scale communities having more/fewer vertices on average (Fig.   1a) . Of the several types of resolution parameter that have been proposed, we adopted that of the Reichardt-Bornholdt configuration model [42] , which defines the generalized modularity as:

where ⃗ defines a mapping from objects in to community labels; " is the degree of vertex ;

is the total number of edges in ; is the resolution parameter; ( , ) indicates that vertices and are assigned to the same community by ⃗ ; and is the adjacency matrix of . To determine 

Two values satisfying the above formula are defined as -proximal. The sampling step, which was practically set to 0.1 to sufficiently capture the interesting structures in the data; it is conceptually similar to the Nyquist sampling frequency in signal processing [44] . We used $"% = 0.001, which we found always resulted in the theoretical minimum number of communities, equal to the number of connected components in . We used $&' = 20 for single-cell data ( Fig. 2 

To identify persistent communities, we define the pairwise similarity between any two communities and as the Jaccard similarity of their sets of objects, ( ) and ( ): 

We initialize a hierarchical structure represented by , a directed acyclic graph (DAG) in which each vertex represents a persistent community. A root vertex is added to represent the community of all objects. The containment relationship between two vertices, and , is quantified by the containment index (CI):

which measures the fraction of objects in shared with . An edge is added from to in if ( , ) is larger than a threshold ( is -contained by ). Since ( , ) < for all , (a property established by the procedure for connecting similar communities in phase II), setting ≥ 2 /(1 + ) guarantees to be acyclic. In practice we used a relaxed threshold = , which we found generally maintains the acyclic property but includes additional containment relations.

In the (in our experience rare) event that cycles are generated in , i.e. ( , ) ≥ and ( , ) ≥ , we add a new community to , the union of and , and remove and from .

Finally, redundant relations are removed by obtaining a transitive reduction [45] of , which represents the hierarchy returned by HiDeF describing the organization of communities.

The biological objects assigned to each community are expanded to include all objects assigned to its descendants. Throughout this study, we used the parameters = 0.75, = 5, = 75. Note that since is a threshold of minimum persistence, the results under a larger value of ′ can be produced by simply removing communities with persistence lower than ′ (Figs. 2c, 3a- Fig. 9 ). Different combinations of parameters and typically do not significantly change the performance of HiDeF in the benchmark tests on protein-protein interaction networks (Supplementary Fig. 6 ), except that certain parameters (e.g. = 0.9) are less robust to network perturbation (i.e. randomly deleting edges from networks). We found that combining HiDeF with node embedding resolved this issue and further improved the performance and robustness (Supplementary Fig. 7 ; see sections below).

Simulated network data were generated using the Lancichinetti-Fortunato-Radicchi (LFR) method [15] (Supplementary Figs. 1,2) . We used an available implementation (LFR benchmark graphs package 5 at http://www.santofortunato.net/resources) to generate benchmark networks with two levels of embedded communities, a coarse-grained (macro) level and a fine-grained (micro) level. Within each level, a vertex was exclusively assigned to one community. Two parameters, c and f, were used to define the fractions of edges violating the simulated community structures at the two levels. All other edges were restricted to occur between vertices assigned to the same community (Supplementary Fig. 1a) . We fixed other parameters of the LFR method to values explored by previous studies [9] . Some community detection algorithms include iterations of local optimization and vertex aggregation, a process that, like HiDeF, also defines a hierarchy of communities, albeit as a tree rather than a DAG. We demonstrated that without scanning multiple resolutions, this process alone was insufficient to detect the simulated communities at all scales (Supplementary Fig. 2) .

We used Louvain and Infomap [46, 47] , which have stable implementations and have shown strong performance in previous community detection studies [48] . For Louvain, we optimized the and other parameters to default. In general, these settings generated trees with two levels of communities. Note that Infomap sometimes determined that the input network was nonhierarchical, in which cases the coarse-and fine-grained communities were identical by definition.

Mouse single-cell RNA-seq data ( Fig. 2; Supplementary Fig. 3 Identical analyses were applied to the FACS and the droplet datasets respectively, yielding a hierarchy of 273 and 279 communities respectively (Fig. 2d) . ScanPy 1.4.5 [49] was used to create tSNE or UMAP embeddings and associated two-dimensional visualizations [20] as baselines for comparison (Fig. 2e,f; Supplementary Fig. 3a,b) . Through previous analysis of the single-cell RNA data, all cells in these datasets had been annotated with matching cell-type classes in the Cell Ontology (CO) [17] . Before comparing these annotations with the communities detected by HiDeF, we expanded the set of annotations of each cell according to the CO structure, to ensure the set also included all of the ancestor cell types of the type that was annotated. For example, CO has the relationship "[keratinocyte] (is_a) [epidermal_cell]", and thus all cells annotated as "keratinocyte" are also annotated as "epidermal cell". The CO was obtained from http://www.obofoundry.org/ontology/cl.html and processed by the Data Driven Ontology Toolkit (DDOT) [50] retaining "is_a" relationships only.

We compared HiDeF to TooManyCells [18] and Conos [19] as baseline methods. The former is a divisive method which iteratively applies bipartite spectral clustering to the cell population until the modularity of the partition is below a threshold; the latter uses the Walktrap algorithm to agglomeratively construct the cell-type hierarchy [51] . We chose to compare with these methods because their ability to identify multiscale communities was either the main advertised feature or had been shown to be a major strength. TooManyCells (version 0.2.2.0) was run with the parameter "min-modularity" set to 0.025 as recommended in the original paper [18] , with other settings set to default. This process generated dendrograms (binary trees) with 463 communities. The Walktrap algorithm was run from the Conos package (version 1.2.1) with the parameter "step" set to 20 as recommended in the original paper [19] , yielding a dendogram.

The greedyModularityCut method in the Conos package was used to select N fusions in the original dendrogram, resulting in a reduced dendrogram with 2N+1 communities (including N internal and N+1 leaf nodes). Here we used N = 125, generating a hierarchy with 251 communities (Fig. 2c) .

The communities in each hierarchy were ranked to analyze the relationships between celltype recovery and model complexity (Fig. 2c, Supplementary Fig. 3c) . HiDeF communities were ranked by their persistence; Conos and TooManyCells communities were ranked according to the modularity scores those methods associate with each branch-point in their dendrograms.

Conos/Walktrap uses a score based on the gain of modularity in merging two communities,

whereas TooManyCells uses the modularity of each binary partition.

We obtained a total of 27 human protein interaction networks gathered previously by survey studies [27, 28] , along with one integrated network from budding yeast (S. cerevisiae) that had been used in a previous community detection pipeline, NeXO [23] . This collection contained two versions of the STRING interaction database, with the second removing edges from text mining (labeled STRING-t versus STRING, respectively; Fig. 3 ). Benchmark experiments for the recovery of the Gene Ontology (GO) were performed with STRING and the yeast network ( Fig.   3a,b, Supplementary Fig. 4) . The reference GO for yeast proteins was obtained from http://nexo.ucsd.edu/. A reference GO for human proteins was downloaded from http://geneontology.org/ via an API provided by the DDOT package [50] .

HiDeF was directly applied to all of the above benchmark networks. The NeXO communities were obtained from http://nexo.ucsd.edu/, with a robustness score assigned to each community. To benchmark communities created by hierarchical clustering, we first calculated three versions of pairwise protein distances (HC.1-3; Fig. 3a,b; Supplementary Fig. 4) using

Mashup, DSD and deepNF [24] [25] [26] . Mashup was used to embed each protein as a vector, with 500 and 800 dimensions for yeast and human, as recommended in the original paper. A pairwise distance was computed for each pair of proteins as the cosine distance between the two vectors.

Similarly, deepNF was used to embed each protein into a 500-dimensional vector by default. DSD generates pairwise distances by default. Given these pairwise distances, UPGMA clustering was applied to generate binary hierarchical trees. Following the procedure given in the NeXO and Mashup papers [23, 24] communities with <4 proteins were discarded.

Since all methods had slight differences in the resulting number of communities, communities from each method were sorted in decreasing order of score, enabling comparison of results across the same numbers of top-ranked communities. HiDeF communities were ranked by persistence. NeXO communities were ranked by the robustness value assigned to each community in the original paper [23] . To rank each community c of hierarchical clustering (branch in the dendrogram), a one-way Mann-Whitney U-test was used to test for significant differences between two sets of protein pairwise distances: (set 1) all pairs consisting of a protein in c and a protein in the sibling community of c; (set 2) all pairs consisting of a protein in each of the two children communities of c. The communities were sorted by the one-sided p-value of significance that distances in set 1 are greater than those in set 2.

We adopted a metric average F1-score [52] to evaluate the overall performance of multiscale structure identification, focusing on the recovery of reference communities. Given a set of reference communities * and a set of computationally detected communities ⃗ , the score was defined as:

where ( ) is the best match of " in ⃗ , defined as follows:

and 1( " , 1 SSS⃗ ) is the harmonic mean of Precision( " , 1 SSS⃗ ) and Recall( " , 1 SSS⃗ ). The calculations were conducted by the xmeasures package (https://github.com/eXascaleInfolab/xmeasures) [53] .

HiDeF was directly applied to the original networks in in most of our analyses of protein-protein interaction networks, and compared with the results of hierarchical clustering following the network embedding techniques [24, 26] . We sought to explore if we can combine the strength of network embedding and HiDeF to further improve the performance and robustness to parameter choices (Supplementary Fig. 7) . We borrowed the idea of shared-nearest neighbor (SNN) graph that we had been using in the analyses of single-cell data. We made a customized script to use the 500-dimensional node embeddings of the STRING network as the input of the Seurat FindNeighbors function [3] . The parameters of this function remained as the default. The output SNN graph has 1.65 ´ 10 6 edges, which is on the same magnitude as the original network (2.23 ´ 10 6 edges). We then applied HiDeF to this SNN graph with different combinations of parameters ( Supplementary Fig. 7) .

332 human proteins identified to interact with SARS-COV-2 viral protein subunits were obtained from a recent study [35] . This list was expanded to include additional human proteins connected to two or more of the 332 virus-interacting human proteins in the new BioPlex 3.0 network [36] .

These operations resulted in a network of 1948 proteins and 22,835 interactions. HiDeF was applied to this network with the same parameter settings as for other protein-protein interaction networks (see previous Methods sections), and enrichment analysis was performed via g:Profiler [54] (Fig. 3f,g) .

Not applicable.

Not applicable. These models include the hierarchy of murine cell types (Fig. 2) , the hierarchies of yeast and human protein communities identified through protein network analysis, and the hierarchy of human protein complexes targeted by SARS-COV2 (Fig. 3) .

T.I. is cofounder of Data4Cure, is on the Scientific Advisory Board, and has an equity interest. T.I. . A yeast network [23] and the human STRING network [31] were used as the inputs of a and b, respectively. HC.1-3 represent UPGMA Hierarchical Clustering of pairwise distances generated by Mashup, DSD, and deepNF [24] [25] [26] , respectively. c, Distributions of community sizes (x-axis, number of proteins) for three human protein networks: BioPlex 2.0 [29] , Coexpr-GEO [30] , and STRING [31] . 

Supplementary Figure 1 . Exploring simulated networks. a, The LFR generative model [15] was used to simulate networks with 1000 vertices and average degree 10 (Methods). The simulation included two layers of communities, "coarse" (10-20 communities, 50-100 vertices per community) and "fine" (25- Companion plots to panels (b-d). Points represent identified communities, delineated by size (y axis) and persistence (x axis). Blue/gray point colors indicate a match/non-match to a true community in the simulated network (Jaccard similarity > 0.75). Note that when noise is low (e), the highest persistence communities correctly recover simulated communities with near-perfect accuracy, e.g. for persistence threshold >20.

HiDeF is compared with the Louvain and Infomap algorithms [46, 47] , with Louvain and Infomap fixed at their default single resolutions (Methods). The three plots (a-c) compare the performance of the three algorithms in recovering simulated communities at different settings of the coarse/fine mixing parameters (see Supplementary Fig. 1 Clustering following any of three protein pairwise distance functions (Mashup, DSD, and deepNF) [24] [25] [26] . 

Using the performance analysis depicted in Fig. 3b , the Area Under Curve (AUC) was computed for different sets of HiDeF parameters (p, ). This AUC was compared to that of the best baseline tool, HC.3 (i.e. hierarchical clustering of pairwise distances generated by deepNF [26] ) to generate an equal number of communities (Methods). Note the ratio HiDeF AUC / HC.3 AUC is usually higher than 1, indicating the favorable performance of HiDeF except for very high values of the t parameter. As per Fig. 3b , the analysis was undertaken using the STRING network and the GO Cellular Component branch. b, Similar analysis with subsampling of network edges (in which a random 10% of network edges are removed prior to community detection at each resolution). higher persistence (y axis) than a given threshold (x axis). e-f, Scatterplots of community size (y axis) versus persistence (x axis). The left column characterizes the single-cell transcriptomics data (Fig. 2, Supplementary Fig. 3) . The right column (panel b, d, f) characterizes the yeast and human protein-protein interaction datasets ( Fig. 3a-b) . 

The Human Cell Atlas

Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis

Integrating single-cell transcriptomic data across different conditions, technologies, and species

Molecules into cells: specifying spatial architecture

Data clustering: a review

Community detection in networks: A user guide

Visualizing Data using t-SNE

Analysis of the structure of complex networks at different resolution levels

Van Dooren P: Significant scales in community structure

Persistent homology-a survey

A topological paradigm for hippocampal spatial map formation using persistent homology

Homological scaffolds of brain functional networks

Cytoscape: a software environment for integrated models of biomolecular interaction networks

Benchmark graphs for testing community detection algorithms

Organ collection and p, Library preparation and s, Computational data a, Cell type a, Writing g, Supplemental text writing g, Principal i: Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris

The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability

TooManyCells identifies and visualizes relationships of single-cell clades

Joint analysis of heterogeneous single-cell RNA-seq dataset collections

Dimensionality reduction for visualizing single-cell data using UMAP

Tripartite synapses: astrocytes process and control synaptic information

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

A gene ontology inferred from molecular networks

Compact Integration of Multi-Network Topology for Functional Analysis of Genes

Going the distance for protein function prediction: a new distance metric for protein interaction networks

deepNF: deep network fusion for protein function prediction

Systematic Evaluation of Molecular Networks for Discovery of Disease Genes

Assessment of network module identification across complex diseases

Architecture of the human interactome defines protein communities and disease networks

A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles

STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genomewide experimental datasets

Understanding multicellular function and disease with human tissue-specific networks

Activation and Function of the MAPKs and Their Substrates, the MAPK-Activated Protein Kinases

NDEx 2.0: A Clearinghouse for Research on Cancer Pathways

A SARS-CoV-2 protein interaction map reveals targets for drug repurposing

Dual Proteome-scale Networks Reveal Cellspecific Remodeling of the Human Interactome

The Nonstructural Proteins Directing Coronavirus RNA Synthesis and Processing

Molecular functions of the TLE tetramerization domain in Wnt target gene repression

Inhibition of severe acute respiratory syndrome coronavirus replication by niclosamide

Broad Spectrum Antiviral Agent Niclosamide and Its Therapeutic Potential

Finding and evaluating community structure in networks

Statistical mechanics of community detection

Introduction to Digital Signal Processing

The Transitive Reduction of a Directed Graph

Fast unfolding of communities in large networks

Maps of random walks on complex networks reveal community structure

SCANPY: large-scale single-cell gene expression data analysis

DDOT: A Swiss Army Knife for Investigating Data-Driven Biological Ontologies

Computing communities in large networks using random walks

Overlapping community detection at scale: a nonnegative matrix factorization approach

Accuracy evaluation of overlapping and multiresolution clustering algorithms on large datasets

Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update)

The Reactome Pathway Knowledgebase

We are grateful for the helpful discussions with Drs. Jianzhu Ma, Karen Mei, and Daniel Carlin.

Reactome [55] .