key: cord-0544992-hazt4ve8
authors: Li, Qi; Milenkovic, Tijana
title: Improving supervised prediction of aging-related genes via dynamic network analysis
date: 2020-05-07
journal: nan
DOI: nan
sha: 79dd0b536a6a6467bd203d1166f84c5d0ff2e126
doc_id: 544992
cord_uid: hazt4ve8

Motivation: This study focuses on supervised prediction of aging-related genes from -omics data. Unlike gene expression methods that capture aging-specific information but study genes in isolation, or protein-protein interaction (PPI) network methods that account for PPIs but the PPIs are context-unspecific, we recently integrated the two data types into an aging-specific PPI subnetwork, which yielded more accurate aging-related gene predictions. However, a dynamic aging-specific subnetwork did improve prediction performance compared to a static aging-specific subnetwork, despite the aging process being dynamic. Results: So, here, we propose computational advances towards improving prediction accuracy from a dynamic aging-specific subnetwork. We develop a supervised learning model that when applied to a dynamic subnetwork yields extremely high prediction performance, with F-score of 91.4%, while the best model on any static subnetwork yields F-score of"only"74.3%. Hence, our predictive model could guide with high confidence the discovery of novel aging-related gene candidates for future wet lab validation. Contact: tmilenko@nd.edu

This unexpected finding could be because the dynamic subnetwork was inferred using the induced approach, which is quite naive. Recently, an alternative, methodologically more advanced notion of network propagation (NP) was used for inference of a static context-specific subnetwork in the context of cancer [19] , with prominent NP approaches being NetWalk [20] and HotNet2 [21] . NP maps expression levels (i.e., activities) onto the genes in the entire contextunspecific PPI network. Then, NP propagates the activities via random walks or diffusion, to assign condition-specific weights to the nodes (genes, i.e., their protein products) or edges (PPIs) in the entire network. Finally, NP assumes that it is the highest-weighted network regions that are the most relevant for the condition of interest, i.e., that such regions form the context-specific subnetwork. For information on methodological differences between NetWalk and HotNet2, please refer to their original publications or to work by Newaz et al. [22] .

Our group extended these NP approaches to allow for inference of a dynamic context-specific subnetwork in the context of aging [22] . Then, the NP-based subnetworks were evaluated in an unsupervised learning context. So, while the inferred NP-based subnetworks are relevant for this paper, the unsupervised analyses of the networks by Newaz et al. [22] fall outside of its scope.

Here, besides the induced dynamic aging-specific subnetwork by Faisal et al. [4] , we also analyze the NP dynamic subnetworks by Newaz et al. [22] , plus static aging-specific counterparts of all dynamic subnetworks, with a hypothesis that a dynamic aging-specific subnetwork will be superior to all static aging-specific subnetworks in the task of supervised prediction of aging-related genes.

In each (dynamic vs. static, induced vs. NP) aging-specific subnetwork, we label genes as aging-or (currently) non-aging-related by relying on GenAge, a trusted source of aging-related knowledge [23] .

For each subnetwork, we develop a predictive model that consists of a network feature, dimensionality choice, and classifier. We consider 11 features, seven of which are dynamic, i.e., can be extracted from a dynamic subnetwork, and four of which are static, i.e., can be extracted from a static subnetwork. For each feature, we examine its full version, as well as its seven (non-)linear dimensionality-reduced versions. We couple each feature (version) with nine classifiers. For each dynamic subnetwork, we consider 7 × (1 + 7) × 9 = 504 predictive models, i.e., combinations of feature, dimensionality choice, and classifier. For each static subnetwork, we consider 4 × (1 + 7) × 9 = 288 predictive models.

For each subnetwork, we evaluate each predictive model via cross-validation [18] : we train on a subset of the agingand non-aging-related genes and test prediction accuracy on the remaining genes. We evaluate accuracy of a predictive model in terms of the area under the precision-recall curve (AUPR), precision, recall, and F-score. Then, for each subnetwork, we choose the most accurate predictive model. Finally, we compare accuracy of the different subnetworks.

To validate our hypothesis, it would suffice for the best dynamic subnetwork to yield higher prediction accuracy than all considered static subnetworks. Indeed, we find this to hold, which justifies the need for our study. Importantly, our best model on a dynamic subnetwork yields extremely high accuracy when predicting aging-related genes, with AUPR of 95.6%, precision of 93.8%, recall of 89.2%, and F-score 91.4%.

During our comprehensive evaluation, we examine two additional questions. First, the above results are when using GenAge to label genes as aging-vs. non-aging-related for classification. Human genes in GenAge are sequence orthologs of aging-related genes in model species. Because not all human genes have sequence orthologs in model species [24, 25] , using GenAge may miss aspects of the aging process that are unique to human [26] . So, we consider another source of aging-related knowledge obtained by studying human directly (rather than doing so indirectly from model species), namely the genotype-tissue expression (GTEx) project [8] . Second, the above results are when using the induced and NP subnetworks inferred from a particular context-unspecific PPI network [27] . We also consider induced and NP subnetworks inferred from another context-unspecific PPI network [28, 29] . We discuss the findings when using the alternative aging-related data or context-unspecific PPI network in Section 3.

In this study, we consider aging-specific subnetworks inferred from one of two entire context-unspecific human PPI networks: HPRD [27] and HINT+HI2 [28, 29] . The latter is newer. HPRD contains PPIs from the corresponding database. HINT combines PPIs from eight databases. Then, HINT+HI2 combines PPIs from HINT and the HI2 database. The HPRD network has 8,938 nodes and 35,900 edges. The HINT+HI2 network has 9,858 nodes and 40,705 edges. 6,989 nodes and 14,791 edges are common to the two networks.

For each of the two entire context-unspecific networks (HPRD or HINT+HI2), we consider three dynamic aging-specific subnetworks (one induced and two NP-based), their compressed counterparts, and their static counterparts (described below and summarized in Table 1 ).

First, for HPRD and HINT+HI2, we use dynamic aging-specific subnetworks inferred by Faisal et al. [4] and Newaz et al. [22] , respectively. Both were inferred using the induced approach: given human gene expression data at 37 different ages between 20 and 99 [10] , a dynamic aging-specific subnetwork consisting of 37 subnetwork snapshots, one snapshot per age, was constructed; each subnetwork snapshot corresponded to the induced subgraph of those genes that are significantly expressed (i.e., are active) at the given age. Henceforth, we denote the resulting networks as Induced.

Second, for a given entire PPI network (HPRD or HINT+HI2), we use two dynamic subnetworks inferred by Newaz et al. [22] using NP. Key differences between the induced approach and NP are as follows [22] . The induced approach considers all PPIs from the entire context-unspecific network that exist between only active genes. On the other hand, with an assumption that not all PPIs between the active genes might be equally important, NP's weighting mechanism allows it to select only highest-weighted of all PPIs between the active genes. Also, with a different assumption that it is important to consider both active genes and non-active genes that critically connect the active genes in the entire network, NP propagates activities of highly expressed (i.e., active) genes to other (i.e., non-active) genes in the network. As such, it could assign a high weight to a non-active gene if it (directly or indirectly) links sufficiently many active genes. Consequently, NP could include such a high-weight non-active gene into its context-specific subnetwork.

Newaz et al. [22] extended two prominent NP approaches originally proposed for inference of a static condition-specific subnetwork in the context of cancer, NetWalk and HotNet2, to the problem of inferring a dynamic context-specific subnetwork in the context of aging. They evaluated the subnetworks produced by NetWalk against those produced by HotNet2, and found the former to be better, i.e., to overlap more with aging-related ground truth data. Consequently, in this paper, we only focus on the dynamic aging-specific subnetworks inferred using NetWalk. For this NP approach, Newaz et al. [22] considered two versions, referred to as NetWalk and NetWalk*. Their key difference is which gene expression levels (i.e., activity scores) they assign to which genes prior to propagating the gene activities through the network. Intuitively, the former assigns actual expression levels to only those genes that are significantly expressed (active) at a given age and "non-informative" (i.e., "dummy") expression levels to all other genes in the entire network, while the latter assigns actual expression levels to both active and non-active genes; for more information, see [22] . Multiple dynamic aging-specific subnetworks were produced by each of NetWalk and NetWalk*. Here, for each of HPRD and HINT+HI2, we select the best of all NetWalk dynamic subnetworks, and the best of all NetWalk* dynamic subnetworks; by "best", we mean the subnetwork that overlapped the most with the aging-related ground truth data according to the analysis by Newaz et al. [22] or that has resulted in the best prediction accuracy in our evaluation from this paper. Specifically, our selected NetWalk dynamic subnetwork based on HPRD was referred to as NetWalk-2 in the original publication [22] ; our selected NetWalk* dynamic subnetwork based on HPRD was referred to as NetWalk*; our selected NetWalk dynamic subnetwork based on HINT+HI2 was referred to as NetWalk; and our selected NetWalk* dynamic subnetwork based on HINT+HI2 was referred to as NetWalk*-2.

So, for each of HPRD and HINT+HI2, we consider three dynamic aging-related subnetworks: Induced, NetWalk, and NetWalk*.

Because our key question is whether aging-related genes can be predicted more accurately from a dynamic aging-related subnetwork than from a static subnetwork, for each dynamic subnetwork, we create its two static counterparts, which we refer to as compressed and static, as follows.

We obtain a compressed aging-related subnetwork by aggregating (i.e., taking the union of) all nodes and edges over all 37 snapshots of the corresponding dynamic subnetwork. That is, the compressed subnetwork contains any node and edge that appears in any of the 37 snapshots. In other words, we form a compressed subnetwork by first integrating gene expression data at each individual age with the entire context-unspecific PPI network, and then aggregating the network information over all ages. This allows for fairly studying the effect of dynamic vs. static network analysis by comparing a dynamic subnetwork against its compressed counterpart, as the two contain the same nodes and edges.

On the other hand, we form a static aging-related subnetwork by first aggregating gene expression data over all ages (so that a node is considered active if it is active at any one or more of the 37 ages), and then integrating the aggregated gene expression-based activity data with the entire context-unspecific PPI network. This way, we directly infer a static subnetwork, rather than doing so indirectly by first inferring a dynamic and only then a compressed subnetwork. For Induced, this procedure yields a static subnetwork that is the same as its compressed counterpart. For each of NetWalk and NetWalk*, nodes or edges could differ between its compressed subnetwork and its static subnetwork. For these two approaches, after the aggregated gene expression data is propagated over the entire context-unspecific network and weights are assigned to all edges in the entire network, we choose the edge weight threshold that results in the static subnetwork matching the number of edges in its compressed counterpart as closely as possible. This way, the compressed subnetwork and its static counterpart are fairly comparable to each other.

In total, for each of the two entire networks (HPRD and HINT+HI2), we have three groups of aging-specific subnetworks (Induced, NetWalk, and NetWalk*), each with three versions (Dynamic, Compressed, and Static). That is, we consider 2 × 3 × 3 = 18 aging-specific subnetworks, which are summarized in Table 1 . Table 1 : The sizes of the 18 considered aging-specific subnetworks. The size of a dynamic subnetwork is the average over its 37 subnetwork snapshots.

The purpose of using the following data sets is described in Section 2.1.4.

• GenAge [23] is a highly trusted source of human aging-related genes, most of which are sequence orthologs of experimentally validated aging-related genes in model organisms. Of all 307 human genes from GenAge, 84 are present in all 18 considered subnetworks. Henceforth, we denote the 84-gene set as GenAge. • GTEx [8] contains 863 genes whose expressions decrease with age (i.e., are down-regulated), of which 131 are present in all subnetworks. Henceforth, we denote the 131-gene set as GTEx-DAG. • GTEx contains 710 genes whose expressions increase with age (i.e., are up-regulated), of which 33 are present in all 18 considered subnetworks. Henceforth, we denote the 33-gene set as GTEx-UAG. • Lu et al. [9] identified 442 genes whose expressions vary with age, of which 87 are present in all 18 considered subnetworks. • Berchtold et al. [10] identified 8,277 genes whose expressions vary with age, of which 693 are present in all 18 considered subnetworks.

• Simpson et al. [12] identified 2,911 genes whose expressions vary across different stages of Alzheimer's disease, of which 272 genes are present in all 18 considered subnetworks.

For classification (discussed in the following sections), we need a set of genes labeled as aging-related (positive class) and a set of genes labeled as non-aging-related (a negative class). Because GenAge is considered to be one of the most confident aging-related ground truth data sources, our primary label definition is with respect to GenAge. However, human genes in GenAge are sequence orthologs of aging-related genes in model organisms, but human aging is a complex process that has unique biological aspects compared to the aging process in a model species [26] . So, to hopefully capture the human-unique aspects, we separately consider a secondary label definition with respect to GTEx, which was obtained by studying aging directly in human. Specifically, of the entire GTEx, we focus on GTEx-DAG, because we analyze PPI data, and because genes in GTEx-DAG were shown to be relevant for PPIs, unlike genes in GTEx-UAG [8] . The two label definitions are as follows.

Primary definition with respect to GenAge. We label as aging-related those 84 genes from GenAge that exist in all 18 considered subnetworks. Because we want our non-aging-related gene set to be as confident as possible, we label as non-aging-related those genes that exist in all 18 subnetworks but not in any of the six considered aging-related ground truth data sets; there are 315 such genes. We intentionally use the same 84-gene positive class and the same 315-gene negative class for classification in all 18 subnetworks. This allows us to fairly compare classification accuracy (i.e., accuracy of aging-related gene prediction) across all subnetworks.

Secondary definition with respect to GTEx-DAG. We label as aging-related those 131 genes from GTEx-DAG that exist in all 18 subnetworks. We label as non-aging-related those 315 genes that exist in all 18 subnetworks but not in any of the six considered aging-related ground truth data sets. Note that we use the GenAge definition separately from the GTEx-DAG definition, i.e., we perform classification for each of the two individually. This is because the two are very different (sequence-vs. expression-based) data sets that possibly capture different aspects of the aging process. Indeed, of the 84 GenAge-based and 131 GTEx-based aging-related genes, only 14 genes are common to the two gene sets.

• DGDV, dynamic graphlet degree vector [30] , counts how many times a node participates in each considered dynamic graphlet. Dynamic graphlets are an extension of static graphlets (small subgraphs on up to n nodes) to the dynamic context, where the temporal order in which events occur in a dynamic network is added onto edges of a static graphlet. As in the original DGDV paper [30] , we use up to 4-node and 6-event graphlets, resulting in a 3,727-dimensional DGDV node feature.

• GoT, graphlet orbit transitions [31] , of a node counts, for every pair of considered graphlets, how many times one graphlet changes into the other. As in the original GoT publication [31] , and for a fair comparison with DGDV, we use 4-node graphlets, resulting in a 121-dimensional GoT node feature.

• GDC, graphlet degree centrality [32] , is defined for a node in a static network (as discussed below). We use GDC to compute, for each node v, v's centrality in each of the 37 snapshots of a dynamic subnetwork. Then, we combine v's 37 GDC values into its 37-dimensional dynamic GDC feature. We do the same for all other considered centrality features (listed below), each of which results in a 37-dimensional node feature. Going back to the definition of GDC in a static network: it ranks a node v as central if v participates in many graphlets or in complex (large or dense) graphlets (or both). For GDC, we use the code by Faisal et al. [4] , which considers up to 5-node graphlets without an option to customize the graphlet size.

• ECC, eccentricity centrality [4] , of a node v calculates distance (i.e., the shortest path length) between v and each other node in the network, finds a node u that is the most distant to v, and computes the reciprocal of the distance between u and v.

• KC, k-coreness [4] , of a node v is the size of the largest network core to which v belongs. A network core is a subgraph in which each node is connected to at least k other nodes. When v belongs to a k-core, it also belongs to 1-core, 2-core, 3-core, ..., (k-1)-core. Then, KC of v is the size of its largest k-core.

• DegC, degree centrality [4] , of a node v measures how many other nodes in the network v is connected to.

• CentraMV, centrality mean and variation [17, 18] , of a node v measures, for each of the four considered centrality-based features (GDC, ECC, KC, and DegC), two quantities: the mean and variation over v's 37 centrality values. These two quantities for each of the four centrality-based features combined form an 8-dimensional CentraMV node feature.

• SGDV, static graphlet degree vector [33] , a static counterpart of DGDV, counts how many times a node participates in each considered static graphlet. Just as for DGDV and GoT, we consider up to 4-node static graphlets. This results in a 15-dimensional SGDV node feature. We have tested up to 5-node SGDV in this study, but this has not yielded an improvement (results not shown). So, for simplicity, we just report results for up to 4-node SGDV.

• cSGDV, colored SGDV [34] , is a heterogeneous analog of SGDV. It counts how many times a node participates in each heterogeneous graphlet on up to n nodes and with c colors. We consider up to four nodes and use aging-related information about genes to derive four colors, just as done in our recent work [18] . This results in a 225-dimensional cSGDV feature. We have verified that up to 5-node cSGDV does not yielded an improvement.

• UniNet [7] aggregates 14 node centralities (DegC, ECC, KC, average shortest path, betweenness, closeness, clustering coefficient, neighborhood connectivity, radiality, stress, topological coefficient, aging neighbor count, aging neighbor ratio, and binary aging neighbor representations) into a 14-dimensional node feature.

• mBPIs [6] works as follows. First, m highest-degree nodes in the network are found. Then, for a node v, v's feature has m dimensions corresponding to the m nodes, where each dimension d of node v's feature has a value of 1 if v interacts with the top m highest-degree node d, or otherwise it has a value of 0 if v does not interact with d. Just as [17, 18] , and as suggested in the original mBPIs publication, we use m = 30 in this paper. This results in a 30-dimensional mBPIs node feature.

A common problem in supervised classification is overfitting, often due to a high dimensions of the feature used in the classification task [35] . Some of our considered features have high dimensions, such as DGDV, GoT, and cSGDV. So, as in our previous work [17, 18] , for each feature, we consider: (i) the full feature (i.e., no dimensionality reduction), (ii) linear dimensionality reduction via principal component analysis (PCA) that considers as few principal components as needed to account for at least 90% of variation in the data corresponding to the given feature, and (iii)-(viii) nonlinear dimensionality reduction via t-distributed stochastic neighbor embedding (tSNE) under six different perplexity parameters (5, 13, 21, 30, 40, 50) . This totals to 1 + 1 + 6 = 8 considered dimensionality reduction choices.

Feature dimensionality reduction can be done prior to or during cross-validation [18] . There is a debate on which is better [36] . So, we have examined both options and found their results to be virtually indistinguishable. Because of this, and because reduction during cross-validation avoids any potential leakage of information from the testing data in the training step of cross-validation, unlike reduction prior to cross-validation, we report results only for the formerfeature dimensionality reduction during cross-validation.

We consider the same nine classifiers under the same parameters as in our previous work [17, 18] 

All methodology in this section is described when using the primary GenAge-based definition of aging-and non-agingrelated gene labels. Everything is analogous when using the secondary GTEx-DAG definition.

For each of the 18 age-specific subnetworks, we develop a predictive model that consists of a network feature, dimensionality choice, and classifier. For each feature, we consider each of the eight dimensionality choices. We run each feature with each dimensionality choice under each of the nine classifiers. Given that of the 11 features, seven are dynamic, for each dynamic subnetwork, we consider 7 × 8 × 9 = 504 combinations of a feature, dimensionality choice, and classifier, i.e., 504 predictive models. Given that of the 11 features, four are static, for each static subnetwork, we consider 4 × 8 × 9 = 288 predictive models.

Note that given that we do this for the two entire networks (HPRD and HINT+HI2), and for their three dynamic, three compressed, and three static subnetworks, this totals to 2 × (3 × 504 + 3 × 288 + 3 × 288) = 6, 480 predictive models. And that is just for the GenAge-based definition of aging-and non-aging-related gene labels. The number is the same for the GTEx-based definition. So, over the entire study, we develop and evaluate a total of 12,960 predictive models. We present this number to give an idea to the reader of how comprehensive our study is.

We run each predictive model under a systematic 5-fold cross-validation framework [18] . That is, we randomly split the 84 aging-and 315 non-aging-related genes into five equal-size subsets. We train on the union of four subsets of the aging-and non-aging-related genes and test the prediction accuracy on the remaining subset. We repeat this process five times, such that each time we are using a different subset as testing data. For each of the five folds of cross-validation, the output of a predictive model is a ranked list of the genes from the testing data based on their their predicted likelihood of being aging-related. Then, to make aging-related gene predictions from a model, a threshold needs to be selected so that those genes whose likelihoods of being aging-related are above the threshold are predicted as aging-related. We vary the prediction threshold from 1 to (84 + 315)/5 = 80 in increments of 1.

For aging-related genes predicted at a given threshold, we use their actual aging-and non-aging-related labels to compute the numbers of true positives (TPs), false positives (FPs), and false negatives (FNs). A TP is a gene that is predicted to be aging-related and is also labeled as aging-related. A FP is a gene that is predicted to be aging-related but is labeled as non-aging-related. A FN is a gene that is not predicted to be aging-related but is labeled as aging-related. Given these numbers, we evaluate accuracy of a predictive model in the given fold in terms of AUPR over all prediction thresholds, as well as via precision, recall, and F-score at the prediction threshold where (i) F-score is maximized, while at the same time (ii) precision is at least as large as recall. Precision is # of TPs/(# of TPs + # of FPs). Recall is # of TPs/(# of TPs + # of FNs). F-score is the harmonic mean of precision and recall. For all four measures, we report a predictive model's accuracy as the average over the five cross-validation folds. Note that we force the above condition (ii) because we believe that for potential wet lab validation of predictions, precision should be favored over recall, as long as recall is not too low [17, 18] . And maximizing F-score is exactly what ensures that recall is not too low, i.e., that both precision and recall are reasonably high.

For each of the 18 subnetworks, we select its best predictive model, i.e., the one that maximizes AUPR, to give it the best-case advantage. Then, we report AUPR, precision, recall, and F-score of the selected model.

First, we compare accuracy of each selected predictive model against accuracy of a random approach. The latter works as follows. For a given testing data set (i.e., cross-validation fold) and each prediction threshold g, we randomly select g testing genes and predict them as aging-related. Then, we repeat this for each of the five folds. To account for randomness, we run these two steps 30 times, which means that we perform random aging-related gene prediction 5 × 30 = 150 times. We report the accuracy (AUPR, precision, recall, and F-score) of the random approach as the average over the 150 runs. A predictive model is good only if its accuracy is statistically significantly higher than that of a random approach.

Second, we compare pairs of the selected predictive models (plus the random approach), to determine if one model's predicted gene set is statistically significantly more accurate than another model's set. We do this by comparing the two models' five accuracy scores corresponding to the five folds via the Wilcoxon rank sum test. Because we perform this test for multiple pairs of models, we apply the false discovery rate (FDR) correction to adjust the p-values. We consider one model to be statistically significantly better than another if the adjusted p-value is below 0.05.

To evaluate potential complementarity of gene sets predicted by different models, we measure their overlap via the Jaccard index. We compute the statistical significance of the overlap size using the hypergeometric test. Because we run this test for multiple pairs of predictive models, we correct the p-values using the FDR correction. We consider an overlap to be statistically significantly large if the adjusted p-value is below 0.05.

Sections 3.1-3.3 are for the nine subnetworks inferred from HPRD, when using the GenAge-based definition of agingand non-aging-related gene labels. Section 3.4 is for the other nine subnetworks, also when using the GenAge-based label definition. Section 3.5 is for both the nine subnetworks inferred from HPRD and the nine subnetworks inferred from HINT+HI2, but when using the GTEx-DAG-based label definition.

Recall that for each dynamic subnetwork (Induced-Dynamic, NetWalk-Dynamic, and NetWalk*-Dynamic), we consider 7 features × 8 dimensionality choices × 9 classifiers = 504 predictive models, and for each static subnetwork (Induced-Compressed, Induced-Static, NetWalk-Compressed, NetWalk-Static, NetWalk*-Compressed, and NetWalk*-Static), we consider 4 × 8 × 9 = 288 predictive models. Then, for each subnetwork, we select its best (most accurate) predictive model ( Table 2 ). Recall that Induced-Compressed and Induced-Static are the exact same network. So, their results match, which is why we report them jointly. The selected (i.e., best) predictive model for each of the three dynamic (red), three compressed (blue), and three static (green) aging-specific subnetworks inferred from HPRD. The predictive models are presented in the "feature + dimensionality choice + classifier" format. Note that for dimensionality choice, "None" corresponds to no dimensionality reduction.

The selected model matches only between the three compressed subnetworks (and thus also Induced-Static). All other selected models differ in at least one of feature, dimensionality choice, and classifier. In more detail, the nine selected models (corresponding to the nine subnetworks) encompass four features, two out of the seven features for the dynamic subnetworks and two out of the four features for the static subnetworks. They encompass two dimensionality choices, of which no reduction is preferred for eight out of the nine subnetworks. Finally, the selected models encompass three classifiers. So, while some individual components of a model are somewhat preferred over others (especially the UniNet feature on the static subnetworks, or no dimensionality reduction on all subnetworks), there is no model as a whole that is selected for more than one subnetwork except the one for the compressed subnetworks. However, as we will show in the following sections, the most accurate of all selected models is the one for NetWalk*-Dynamic. So, in summary, there is no clear pattern regarding which component of which model works well on which network.

This predictive model inconsistency across the subnetworks stresses out the need for our comprehensive evaluation of running all of the different combinations of feature, dimensionality choice, and classifier, in order to give each subnetwork the best-case advantage. Also, it emphasizes the importance of a shift in the field of machine learning towards development of interpretable predictive models, in order to allow for understanding the effect of each model component on the prediction outcome.

First, we note that all nine subnetworks result in significantly higher accuracy than the random approach (Fig. 1) ; all adjusted p-values are below 0.023 for all of AUPR, precision, recall, and F-score.

Second, we test our key hypothesis: whether the best dynamic subnetwork predicts aging-related genes more accurately than all static subnetworks (when using for each subnetwork its best predictive model; Section 3.1). Indeed, we find that the hypothesis holds. That is, the best of the dynamic subnetworks (NetWalk*-Dynamic) outperforms all static subnetworks in terms of all of AUPR, precision, recall, and F-score (Fig. 1 ). Its superiority is not statistically significant, which is at least partly due to the prediction accuracy of all subnetworks being relatively high (i.e., the lowest AUPR over all nine subnetworks is 67.9%, the lowest precision is 68.8%, the lowest recall is 53.5%, and the lowest F-score is 61.8%). Also, this could partly be due to the fact that we compute the significance of the difference between subnetworks (i.e., their predictive models) via the Wilcoxon test applied to the models' five accuracy scores (corresponding to the five folds of cross-validation). However, it is well known that comparing such small samples via the Wilcoxon test tends to decrease (though not invalidate) the power of the test. Importantly, the best predictive model yields extremely high accuracy when predicting aging-related genes. Namely, for NetWalk*-Dynamic, AUPR is 95.6%, precision is 93.8%, recall is 89.2%, and F-score is 91.4%. Figure 1 : Prediction accuracy in terms of AUPR, precision, recall, and F-score for the nine aging-specific subnetworks inferred from HPRD, each under its best predictive model, when using the primary GenAge-based definition of agingand non-aging-related gene labels.

Because all subnetworks perform well, meaning that they predict a majority (over 53.5%) of the genes that are labeled as aging-related (i.e., of the known aging-related genes), we expect the pairwise overlaps of their true positives to be reasonably large. Indeed, we find this to be the case. Given all pairs of the nine true positive sets (corresponding to the nine subnetworks), overlaps are statistically significantly high for over two thirds of the pairs. The largest overlap over all pairs is 80.3%, the smallest one is 47.8%, and the average one is 65.4% (Supplementary Fig. S1 ).

Similarly, we examine the pairwise overlaps of the subnetworks' predicted genes that are labeled as non-aging-related, i.e., of their novel aging-related gene predictions. We find that none of the pairwise overlaps between the nine subnetworks are statistically significant. The largest overlap over all pairs is 30.3%, the smallest one is 0%, and the average one is 14.0% ( Supplementary Fig. S2 ). Intuitively, the higher the precision of a subnetwork, by definition the fewer novel predictions it makes. In fact, the best-performing subnetwork, NetWalk*-Dynamic, given its extremely high precision score, makes only five novel predictions. The novel aging-related genes predicted from this subnetwork may be good candidates for future experimental validation (the five genes are: CSK, RHOG, SH2B2, ELK1, and CAPN3). This subnetwork's well-performing model can be used to predict even more of novel aging-related genes, e.g., by applying it to the genes from the dynamic subnetwork that were not in the training or testing data, and then selecting the genes with the highest predicted probabilities of being aging-related as the most confident novel predictions for future wet lab validation.

Recall the following from Section 1.1. The key motivation behind using NP in this study is that Induced-Dynamic did not improve upon Induced-Compressed (which is the same as Induced-Static) in our past work [17, 18] . So, we want to use a better dynamic aging-specific subnetwork. NP is expected to produce such a subnetwork, given that unlike Induced, it can remove low-weight PPIs between active genes (because such PPIs are hypothesized to be less important for the condition of interest, in our case, the aging process) or add to its subnetwork non-active genes that critically connect the active genes in the entire PPI network (along with their PPIs). Note that by removing PPIs between active genes, if all of the PPIs that a gene is incident to are removed, such a gene would become isolated and would thus removed from the subnetwork. So, NP might end up losing some of the active genes compared to Induced.

Indeed, as shown in Fig. 1 , NP outperforms Induced. That is, NetWalk*-Dynamic, the best of all NP subnetworks, is better than all of the Induced subnetworks.

Not only is NP better than Induced, but also, recall that in our past work [17, 18] , we did not observe Induced-Dynamic to be better than Induced-Compressed/Induced-Static. We confirm this in the current study as well. So, while for Induced, using a dynamic subnetwork is not superior to using a static subnetwork, doing so is superior for NP, because NetWalk*-Dynamic is superior to both NetWalk*-Compressed and NetWalk*-Static (Fig. 1) . This further stresses out the power of NP. Figure 2 : Overlaps of all genes (black numbers) and genes labeled as aging-related (red numbers) between Induced, NetWalk, and NetWalk* subnetworks that are inferred from HPRD. To be precise, for each circle (e.g., NetWalk), the pink number (e.g., 1,555 for NetWalk) corresponds to the size of the node intersection of all three corresponding HPRD-based subnetworks (e.g., NetWalk-Dynamic, NetWalk-Compressed, and NetWalk-Static for NetWalk). So, the overlaps are between (i) the intersection of the three Induced subnetworks, (ii) the intersection of the three NetWalk subnetworks, and (iii) the intersection of the three NetWalk* subnetworks. Labeling only the 84 genes from the very center that are in all subnetworks and in GenAge as aging-related (and labeling the 315 genes that are in all subnetworks but not in any of the six ground truth data sets as non-aging-related) for classification allows for a fair comparison of prediction accuracy over all subnetworks.

We want to comment on additional observations regarding (dis)similarity between Induced and NP. First, regarding Induced vs. NetWalk: All NetWalk nodes are in Induced (Fig. 2) . This means the NetWalk does not add non-active nodes to its subnetworks, i.e., NetWalk "just" removes low-weight edges from Induced (and consequently some active nodes too, which become isolated because of the edge removal). That is, both Induced and NetWalk contain only active nodes. NetWalk results in similar performance as Induced (Fig. 1) .

Second, regarding Induced vs. NetWalk*: The latter has nodes that are not in Induced (Fig. 2 ). This means that NetWalk* does add non-active genes to its subnetwork, unlike Induced (and NetWalk). Also, Induced has nodes that are not in NetWalk*. As all genes in Induced have to be active, this means that NetWalk* is removing some low-weight edges between active genes (like NetWalk), because only removal of such edges can cause an active gene to be removed from NetWalk* by becoming isolated.

Combing these two observations, it seems that it is both of the topological changes (removal of low-weight edges between active genes and addition of non-active genes) of NetWalk* compared to Induced that result in NetWalk* outperforming Induced (Fig. 1 ).

All results reported in Sections 3.1-3.3 are for the nine HPRD-based subnetworks, and when using the GenAge-based gene labels. Here, we analyze the nine HINT+HI2-based subnetworks, also when using the GenAge-based labels. Note that for HINT+HI2, for NetWalk*-Dynamic, due to its large size (Table 1) , we could not run one of the dynamic features (DGDV). All other analyses remain the same as in Sections 3.1-3.3.

We again select the best predictive model for each of the nine HINT+HI2-based subnetworks (Supplementary Table  S1 ). We find that the key result for HINT+HI2 is qualitatively the same as the result for HPRD: the best dynamic subnetwork is better than all static subnetworks (Supplementary Figs. S3-S6). One difference is that now NP does not necessarily improve upon Induced, because both Induced-Dynamic and NetWalk*-Dynamic are the best of all networks, and the two are almost tied. Another difference is that the prediction accuracy of all subnetworks is somewhat lower for HINT+HI2 than for HPRD.

So, some although not all of our results are robust to the choice of entire context-unspecific network used to infer an aging-specific subnetwork.

All results reported in Sections 3.1-3.4 are for when using the GenAge-based gene labels. Here, we report results when using the GTEx-DAG-based labels, for both the HPRD-based and HINT+HI2-based subnetworks. The subnetworks' best predictive models when using GTEx-DAG are shown in Supplementary Tables S2 and S3, respectively. First, we comment on results for the nine HPRD-based subnetworks. We already showed that for the three Induced subnetworks, prediction accuracy is not nearly as good under the GTEx-DAG-based gene labels as it is under the GenAge-based gene labels [18] . Here, we confirm that the same holds for the six NP (NetWalk and NetWalk*) subnetworks. That is, although in this study all nine subnetworks are statistically significantly more accurate than the random approach (adjusted p-values below 0.04) in terms of precision, recall, and F-score, although not in terms of AUPR, their prediction accuracy is only in the ∼40%-55% range ( Supplementary Fig. S7) , i.e., not nearly as high as for the GenAge-based labels (Fig. 1) . Like for the GenAge-based labels, we do see that multiple dynamic subnetworks are (marginally) better than all static subnetworks. So, this result is robust to the choice of data for labeling genes. Just as for the GenAge-based labels, NP (both NetWalk and NetWalk*) marginally improve upon Induced, which again justifies the need for NP. However, the bottom line is that using GenAge results in much higher accuracy than using GTEx-DAG.

Second, we comment on results for the nine HINT+HI2-based subnetworks. HINT+HI2 is newer than HPRD. Nonetheless, even for the former, all subnetworks perform much worse when using the GTEx-DAG-based labels ( Supplementary Fig. S8 ) than when using the GenAge-based ones (Supplementary Figs. S3-S5). Again, all nine subnetworks are statistically significantly more accurate than the random approach (adjusted p-values below 0.04) in terms of precision, recall, and F-score, although not in terms of AUPR. However, unlike in all of the above analyses, now it is a static subnetwork, NetWalk-Static, that is (marginally) superior to all other subnetworks, including the dynamic ones. This subnetwork is NP-based, which again justifies a need for NP.

Because for both HPRD and HINT+HI2, using GTEx-DAG yields much lower accuracy than using GenAge, we do not analyze any further the true positives and novel predictions resulting from using GTEx-DAG.

The inferior performance of GTEx-DAG compared to GenAge is surprising for two reasons. First, genes in GTEx-DAG have been linked to aging directly in human, while those in GenAge have been linked to aging indirectly based on their sequence similarities to aging-related genes in model species. So, GTEx-DAG should better capture aspects of the aging process that are unique to human. Given this, and given that we have aimed to extract human aging-related knowledge from human PPI subnetworks, one would expect GTEx-DAG to perform better. However, it does not. This could be because GenAge may be more accurate than GTEx-DAG. Second, the considered subnetworks have been formed by integrating gene expression data with PPI data. Given that GTEx-DAG is also gene expression-based, while GenAge is sequence-based, one would expect GTEx-DAG to perform better. However, it does not. This could be because GTEx and GenAge are highly complementary (only 14 genes are labeled as aging-related in both), and because our predictive models may be better suited for the aging-related knowledge present in GenAge.

Unlike traditional approaches that aim to extract aging-related knowledge from static PPI data, we develop a supervised learning model that when applied to a dynamic aging-specific PPI subnetwork yields extremely high prediction performance, significantly outperforming all models on all static subnetworks. Hence, our model could guide with high confidence the discovery of novel aging-related predictions for future web lab validation.

We have used NP in hope of improving upon Induced that has been used traditionally. Indeed, this is what we observe.

Our findings are partly robust to the choice of entire context-unspecific network that is used for inference of an agingspecific subnetwork, but not to the choice of ground truth aging-related data used to label genes for cross-validation. Namely, our predictive models works better for sequence-based aging-related data from model species than for gene expression-based aging-related data from human. Development of novel predictive models that will be capable of capturing well human-unique aspects of the aging process that do not exist in model species is important.

We have studied human aging from PPI data. Nonetheless, our work can be applied to other types of biological networks, other species, or other dynamic biological processes, such as disease progression.

We thank Khalique Newaz for discussion on how to infer the NP-based static subnetworks, and for sharing the dynamic subnetworks from his recent unsupervised learning study [22] .

This work is supported by the U.S. National Science Foundation (NSF) CAREER award (CCF-1452795).

Aging, cellular senescence, and cancer

The aging kidney revisited: a systematic review

A review of supervised machine learning applied to ageing research

Dynamic networks reveal key players in aging

Global network alignment in the context of aging

A data mining approach for classifying DNA repair genes into ageing-related or non-ageing-related

Tibor Vellai, and András Benczúr. Prediction and characterization of human ageing-related proteins by using machine learning

An analysis of aging-related genes derived from the genotype-tissue expression project (GTEx)

Gene regulation and DNA damage in the ageing human brain

Gene expression changes in the course of normal brain aging are sexually dimorphic

Induction of a common microglia gene expression signature by aging and neurodegenerative conditions: a co-expression meta-analysis

Microarray analysis of the astrocyte transcriptome in the aging brain: relationship to Alzheimer's pathology and APOE genotype

Understanding the odd science of aging

Classifying aging genes into DNA repair or non-DNA repair-related categories

An extensive empirical comparison of probabilistic hierarchical classifiers in datasets of ageing-related genes

Identification of co-evolving temporal networks

Supervised prediction of aging-related genes from a context-specific protein interaction subnetwork

Supervised prediction of aging-related genes from a context-specific protein interaction subnetwork

Network propagation: a universal amplifier of genetic associations

Use of data-biased random walks on graphs for the retrieval of context-specific networks from genomic data

Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes

Improving inference of the dynamic biological network underlying aging via network propagation

Human Ageing Genomic Resources: new and updated databases

The other side of comparative genomics: genes with no orthologs between the cow and other mammalian species

More than just orphans: are taxonomicallyrestricted genes important in evolution?

Bacteria in the ageing gut: did the taming of fire promote a long human lifespan?

Human Protein Reference Database-2009 update

HINT: High-quality protein interactomes and their applications in understanding human disease

A proteome-scale map of the human interactome network

Exploring the structure and function of temporal networks with dynamic graphlets

Graphlet-orbit transitions (GoT): A fingerprint for temporal network comparison

Dominating biological networks

Uncovering biological network function via graphlet degree signatures

From homogeneous to heterogeneous network alignment via colored graphlets

Overfitting in prediction models-is it a problem only in high dimensions

A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

Scikit-learn: Machine learning in Python