key: cord-0003130-j80hnhpb
authors: Noh, Heeju; Shoemaker, Jason E; Gunawan, Rudiyanto
title: Network perturbation analysis of gene transcriptional profiles reveals protein targets and mechanism of action of drugs and influenza A viral infection
date: 2018-04-06
journal: Nucleic Acids Res
DOI: 10.1093/nar/gkx1314
sha: 3ffe5f150d7f8fe18d126f3c5035ed9425f9f4b6
doc_id: 3130
cord_uid: j80hnhpb

Genome-wide transcriptional profiling provides a global view of cellular state and how this state changes under different treatments (e.g. drugs) or conditions (e.g. healthy and diseased). Here, we present ProTINA (Protein Target Inference by Network Analysis), a network perturbation analysis method for inferring protein targets of compounds from gene transcriptional profiles. ProTINA uses a dynamic model of the cell-type specific protein–gene transcriptional regulation to infer network perturbations from steady state and time-series differential gene expression profiles. A candidate protein target is scored based on the gene network's dysregulation, including enhancement and attenuation of transcriptional regulatory activity of the protein on its downstream genes, caused by drug treatments. For benchmark datasets from three drug treatment studies, ProTINA was able to provide highly accurate protein target predictions and to reveal the mechanism of action of compounds with high sensitivity and specificity. Further, an application of ProTINA to gene expression profiles of influenza A viral infection led to new insights of the early events in the infection.

The identification of the molecular targets of pharmacologically relevant compounds is vital for understanding the mechanism of action (MoA) of drugs, as well as for exploring off-target effects. While the definition of a target can be quite arbitrary, the term generally refers to a molecule whose interaction with the compound is connected to the compound's effects (1) . In this study, transcription factors (TFs) and their protein interaction partners represent the target molecules, while differential gene expression profiles represent the effects. Among existing technologies for protein target discovery (e.g. biochemical affinity purification, RNAi knockdown or gene knockout experiments) (2) , gene expression profiling has received much recent attention due to its relative ease of implementation as well as the availability of large-scale public databases and well-established experimental protocols and data analytical methods. A complication when using gene expression profiling for target discovery is that the data give only indirect indications of the drug's action. As illustrated in Figure 1A , the interaction between a compound and its protein target(s) is expected to result in the differential expression of downstream genes that are regulated by the protein target(s). But, the expression of the protein targets themselves may not--and often do not--change (3) . Consequently, target discovery using gene expression profiles requires computational methods to identify the (upstream) targets from the (downstream) effects.

Existing computational strategies for compound target identification using gene expression profiles can generally be classified into two groups: comparative analysis and network-based analysis (4) . Comparative analysis methods use the gene expression profiles as drug signatures. Here, the similarity between the differential gene expression of a drug treatment and those of reference compounds or experiments with known targets, implies a closeness in the molecular targets and the MoA. A notable example of such an approach is the Connectivity Map (5) , which provides gene expression profiles of human cell lines treated by ∼5000 small molecule compounds as queryable signatures for evaluating drug-drug similarities (6) . The obvious drawback of comparative analysis methods is their dependence on an extensive and accurate target annotation of the reference gene expression profiles.

In network-based analysis, one adopts a system-oriented view by using cellular networks, such as gene regulatory network (GRN) and/or protein-protein interaction network (PIN). A number of network-based analytical methods relied on dynamic models of the GRN to infer network perturbations caused by drug treatments (7) (8) (9) (10) (11) . Several notable methods include Network Identification by multiple Regression (NIR) (7) , Mode of action by Network Identification (MNI) (8) , Sparse Simultaneous Equation Model (SSEM) (9) and DeltaNet (10, 11) . In these methods, the GRN is inferred from a training dataset of gene expression profiles using a linear regression derived from a dynamic mechanistic model of the gene transcriptional process. Subsequently, the inferred GRN is utilized for target identification to evaluate deviations in the differential gene expression caused by drug treatments (7) (8) (9) (10) (11) or in disease (12) . A major pitfall of the above methods is that the inference of GRN from gene transcriptional profiles is highly challenging (13) , as the inference problem often becomes underdetermined (i.e. the GRN may not be inferable) (14, 15) . In addition, as mentioned above, the expressions of the drug targets are often unaffected by the drug treatment (3) .

Another group of network-based analytical methods utilizes cellular network graphs, either curated from the literature knowledge or inferred from gene expression data, to formulate statistical hypothesis tests for ranking drug targets (3, (16) (17) (18) (19) (20) (21) (22) . Several methods in this category prioritize targets based on the enrichment of the downstream or neighbouring molecules in the network for differentially expressed genes, following the principle of 'guilt by association' (3, (16) (17) (18) (19) (20) . Another set of methods rank targets by scoring hypotheses that are generated based on causal relationships in the biological networks (21, 22) . A recent method called Detecting Mechanism of Action by Network Dysregulation (DeMAND), combines the GRN and PIN to create a molecular interaction network, where the drug targets are scored based on the statistical significance of drug-induced alterations in the joint gene expression distribution between two connected genes in the network (23) . The methods in this group make use of only the (static) topology of cellular networks without much consideration of the dynamics of the gene transcriptional process, and thus are unable to fully exploit information contained in time-series datasets.

In this work, we developed ProTINA (Protein Target Inference by Network Analysis), a network perturbation analysis method for protein target identification using gene transcriptional profiles. The analysis involves two key steps: (a) the creation of a model of tissue or cell type-specific protein-gene regulatory network (PGRN) and (b) the calculation of protein target scores based on the enhancement or attenuation of the protein-gene regulations. In developing ProTINA, we addressed some of the drawbacks in the existing methods. First, the PGRN in ProTINA is based on a dynamic model of the gene transcriptional process, and is therefore able to take advantage of time-series gene expression profiles that are commonly generated by drug treatment studies. In addition, ProTINA leverages on the availability of comprehensive maps of protein-protein and protein-DNA interactions for the construction of the PGRN, which serves as prior information to alleviate network inferability issue. Finally, ProTINA scores candidate targets based on drug-induced perturbations to the expression of genes regulated by the targets, rather than the expression of the targets themselves. We demonstrated the superiority of ProTINA over the state-of-the-art method De-MAND and differential gene expression analysis (DE), in predicting the protein targets of drugs using human and mouse datasets from NCI-DREAM drug synergy challenge (24) , genotoxicity study (25) and chromosome drug targeting study (26) . Besides protein targets of compounds, we presented the application of ProTINA to study hostpathogen interactions, specifically for elucidating the targets of influenza A viral proteins.

We applied ProTINA to three datasets of drug treatments from NCI-DREAM drug synergy challenge (24) , genotoxicity study (25) and chromosome drug targeting study (26) , and to gene expression data of human lung cancer cell Calu-3 from influenza A viral infection studies (27) (28) (29) (30) . For NCI-DREAM drug synergy challenge, we obtained the raw Affymetrix Human Genome U219 microarray data from Gene Expression Omnibus (GEO) database (31) (accession number: GSE51068). The raw data were first nor-PAGE 3 OF 11 Nucleic Acids Research, 2018, Vol. 46, No. 6 e34 malized and transformed into log 2 -scaled expressions using justRMA function in the affy package of Bioconductor (32) . Then, the log 2 fold change (log 2 FC) differential expressions and their statistical significance (Benjamini-Hochberg adjusted P-values) were calculated using a linear fit model and empirical Bayes method in the limma package of Bionconductor. Three samples from the drug treatment using the low dose of Aclacinomycin A were dropped because all of the log 2 FC expressions were close to 1 and thus not statistically significant. The probe sets were mapped to gene symbols using hgu219.db annotation package (Entrez Gene database as of 27 September 2015). In the case of multiple probe sets mapping to a gene symbol, we assigned the log 2 FC from the probe set with the smallest average adjusted P-value over the samples.

The raw microarray data from genotoxicity study (25) in human HepG2 cell line were obtained from GEO (accession numbers: GSE28878 using Affymetrix GeneChip Human Genome U133 Plus 2.0 array and GSE58235 using Affymetrix HT Human Genome U133+ PM array). As with the drug synergy data, the microarray data were first normalized using justRMA, and the log 2 FCs and their adjusted P-values were calculated using limma in Bioconductor. Because the data came from different microarray platforms, the gene symbols were matched separately for each platform using hgu133plus2.db annotation package (Entrez database of 27 September 2015) and HT HG-U133 Plus PM annotation file in Affymetrix, respectively. Likewise, in the case of multiple probe sets matching a gene symbol, the probe set with the smallest average adjusted P-value across all samples was chosen.

The raw data from the chromosome-targeting study using mouse pancreatic alpha and beta cells (26) were also obtained from GEO database (accession number: GSE36379). Again, the raw data were normalized using justRMA, and the log 2 FCs and their adjusted P-values were calculated by limma. The probes were mapped to the corresponding gene symbols using moe430a.db package (Entrez Gene database as of 27 September 2015) in Bioconductor. In the case of multiple probe sets mapping to a gene symbol, we selected the probe set with the smallest average adjusted P-value among the samples.

For influenza A infection analysis, we obtained the raw microarray data of four influenza studies (27) (28) (29) (30) from GEO database (accession numbers: GSE40844, GSE37571, GSE33142 and GSE28166). The raw data were background-corrected and normalized using normexp and quantile methods in limma package of Bioconductor. The log 2 FCs and their adjusted P-values were again calculated by limma. The probes were mapped to the corresponding gene symbols using hgug4112a.db package (Entrez Gene database as of 27 September 2015). Like before, for genes with multiple probe sets, we chose the log 2 FC value corresponding to the probe set with the smallest average adjusted P-value.

Protein-gene regulatory network. In ProTINA, the PGRN is a bipartite graph with weighted, directed edges pointing from a protein to a gene (see Figure 1A ). The edges in the PGRN describe the regulation of gene expression by TFs and their protein partners, the molecular targets of interest in this work. The bipartite PGRN above is able to capture feedback loops in the gene transcriptional regulation, even though these loops are not drawn explicitly. An example of such a feedback loop is when a protein regulates the expression of its own transcription factor(s). The PGRN is constructed by combining two types of networks, namely the TF-gene network and PIN. For the construction of human cell type-specific PGRNs, we relied on the Regulatory Circuit resource that provides 394 cell type and tissue-specific TF-gene interactions (33) . More specifically, for the analysis of the NCI-DREAM drug synergy, genotoxic compound study, and influenza A viral infection study datasets, we used the TF-gene networks of human lymphoma cells, pleomorphic hepatocellular carcinoma cells, and epithelium lung cancer cells, respectively. We included only TF-gene interactions with a Regulatory Circuit confidence score greater than 0.1. The confidence score indicates the normalized promoter activity level in a given cell type (0: not active, 1: maximally active) (33) . For the analysis of mouse pancreatic cell dataset, we obtained the mouse pancreatic TF-gene interactions from CellNet (34) . In the construction of the PGRNs, any TF-gene interactions involving unmeasured genes were excluded. In summary, the TFgene network for human lymphoma, hepatocellular carcinoma cell, and epithelium lung cancer cell lines included 31 392 edges pointing from 515 TFs to 5153 genes, 3868 edges pointing from 413 TFs to 953 genes, and 42 145 edges pointing from 515 TFs to 7125 genes, respectively. The mouse pancreatic PGRN included 2922 edges, involving 95 TFs and 588 genes.

For human PIN, we combined the protein-protein interactions from two databases, namely Enrichr (35) and STRING (36) . For mouse pancreatic cells, we obtained mouse (Mus musculus) PIN from the STRING database (36) . For each TF, we identified its protein partners, defined as proteins that are within a network distance of 2 from the TF in the PIN. When using the STRING database, we included all direct protein partners of TFs, and proteins with a network distance of 2 from TFs with a confidence score reported on STRING larger than 0.5. For human lymphoma, hepatocytes, and lung cancer cells, we identified 11 090 protein partners for a subset of 499 TFs (out of 515 TFs), 10 834 protein partners for a subset of 403 TFs (out of 413 TFs) and 6 175 protein partners for a subset of 504 TFs (out of 515 TFs), respectively. For mouse pancreatic cells, we found 6620 protein partners for a subset of 89 TFs (out of 95 TFs).

Finally, in the construction of the PGRNs, we assigned a directed edge from a TF or from a protein partner of a TF, to every gene regulated by the TF. In summary, the cell typespecific PGRN for human lymphoma cells included 21 teractions or by including proteins with a network distance from TFs larger than 2, would allow the scoring of a higher number of proteins, such strategy often lowers the accuracy of the protein target predictions.

Gene transcription model. The edges in the PGRN have weights, whose magnitudes represent the strength of the gene regulation and whose signs indicate the direction or the mode of the regulation: positive for gene activation and negative for gene repression. The weights are inferred from the gene expression dataset by adapting a procedure described in our previous method DeltaNet (10,11) (see Figure 1B ). The inference of the edge weights is based on an ordinary differential equation (ODE) model of the mRNA production of a gene:

where r k (t) is the mRNA concentration of gene k at time t, u k and d k denotes the mRNA transcription and degradation rate constants respectively, and a kj denotes the gene regulatory influence (or edge weight) of the jth protein on the kth gene.

While the regulatory edges in the model above usually describe TF-gene interactions, in ProTINA, we further accounted for the (indirect) regulation of a gene by proteins that interact with the TFs. For this purpose, we considered a modified ODE model:

where a positive (negative) b kjq describes the activation (repression) of the kth gene by a protein q through its interaction with the TF protein j. The variables n TF and n P denote the numbers of TFs and their protein partners, respectively. The multiplication of two variables r j and r q implies that the regulation of gene k by protein q requires the TF protein j (a non-zero r j ). The model in Equation (2) can be simplified into:

where a * kj denotes the overall regulatory influence of each protein j, including TFs and their protein partners, on the expression of gene k. Note that the model in Equation (3) is mathematically equivalent to that in Equation (1).

By taking the pseudo steady-state assumption (i.e. the synthesis rate of mRNA balances the degradation rate, leading to dr k /dt = 0 in Equation (3)), the inference of edge weights (a * kj ) of the PGRN can be rewritten as the following linear regression problem (see derivation in (10)):

a * kj c ji + p ki (4) where c ki denotes the log 2 FC expression for gene k in sample i. The variable p ki represents the part of log 2 FC of gene k expression in sample i that cannot be accounted for by the log 2 FC of its protein regulators. In other words, p ki indicates the perturbations to the expression of gene k. As detailed below, ProTINA relies on the magnitude and directions of such network perturbations (dysregulations) to identify proteins with altered gene regulatory activity. The dynamical information contained in time-series gene expression profiles could greatly improve the inference of the edge weights above. But, the pseudo steady-state assumption hinders the application of the linear regression in Equation (4) to time-series data. As previously described in (11), time-series information could be accounted for by adding the following linear constraint on the linear regression problem:

where s ki is the time derivatives (slope) of the log 2 FC of gene k in sample i. In contrast to Equation (4), Equation (5) was derived without assuming pseudo steady-state, which was necessary to account for the dynamics of gene expressions. The slopes of the log 2 FC at each sampling time point were computed using a second-order accurate finite difference approximation (37) . In summary, the estimation of edge weights in ProTINA involved the following linear regression problem:

where C k and S k are the 1 × m vectors of log 2 FC expressions and time-derivatives of gene k across m samples, the subscript R k refers to the set of (n TF,k +n P,k ) protein regulators of gene k in the cell type-specific PGRN, C R k and S R k denote the (n TF +n P,k ) × m matrices of log 2 FCs and their slopes across m samples, A k is the 1 × (n TF +n P ) vector of weights for edges in the PGRN pointing to gene k, and P k is the 1 × m vector of dysregulation impacts of gene k over m samples. In ProTINA, the vectors A k and P k for each gene k in Equations (6) and (7) were estimated by ridge regression. The ridge regression provides a solution to an underdetermined linear regression problem of the standard form: y = Xβ + ε, using a penalized least square objective function:

where λ is a shrinkage parameter for the L 2 -norm penalty.

Equations (6) and (7) are rewritten into the standard linear norm. Self-loops were excluded in the regression, and thus the diagonal entries of A k were set to 0. In the applications of ProTINA, we employed 10-fold cross validations to determine the optimal λ, one that gives the minimum average prediction error. Here, we used the GLMNET package (38) for both the MATLAB and R versions of ProTINA.

Protein target scoring. In ProTINA, each candidate protein target is assigned a score based on the deviation of the expression of its downstream genes. More specifically, we computed the residuals of the linear regression problem in Equation (6) for each gene k, i.e.

where r k is the 1 × m vector of residuals for m samples.

For each drug treatment, there often exist multiple gene expression profiles, taken at different time points or different doses. Correspondingly, we evaluated the z-score z lk for each drug treatment l and for each gene k, according to

wherer lk denotes the average residual of gene k among the drug treatment samples, σ k denotes the sample standard deviation of the residuals in all samples besides the drug treatment, and n l denotes the number of samples from the drug treatment. A positive (negative) z-score indicates that the expression of gene k in the particular sample was higher (lower) than expected based on the expression of its regulators. The greater the magnitude of the z-score, the more significant is the gene dysregulation.

The target score of a TF or protein for a drug is calculated by combining the z-scores of the target genes in the PGRN, as follows (39):

where z ki denotes the z-score of gene k and s ji denotes the score of the TF/protein j in the drug treatment sample i. The weighting coefficients w kj are set equal to the edge weights a kj divided by the maximum magnitude of a kj across all j. In other words, the weight w kj reflects the fraction of the regulation of gene k expression that could be attributed to protein j. When w kj (or a kj ) and z ki have the same signs, w kj z ki thus takes a positive value. As illustrated in Figure 1C , a positive w kj z ki implies an enhanced regulatory activity of protein j on gene k, since the activation (inhibition) of gene k expression by protein j is stronger in this sample than expected by the PGRN model. In contrast, a negative w kj z ki indicates an attenuation of the regulatory influence of protein j on gene k, since the activation (inhibition) of gene k expression by protein j is weaker than predicted by the PGRN model. Consequently, a highly positive (negative) score s ji is an overall indicator of strongly enhanced (attenuated) regulatory activity of protein j by the drug treatment in sample i (see Figure 1D ). The protein targets in each drug treatment sample are ranked in decreasing magnitude of the scores s ji .

For DeMAND analysis, we employed the public R subroutines available from the website: http://califano.c2b2. columbia.edu/demand. Following the procedure detailed in the original publication (23), we computed the RMA (Robust Multi-array Average) normalized gene expression values as inputs to the analysis. In DeMAND analysis, we used the same cell type-specific PGRNs as those in ProTINA. For each candidate protein target, DeMAND evaluated the P-value of the deviations in the gene expression relationship between the protein target and each of the genes connected to this protein in the PGRN. The drug targets were ranked in increasing magnitude of the combined P-values.

In differential gene expression (DE) analysis, we calculated the log 2 FC differential expression of each protein in the PGRN, as described in section Gene expression data above. Here, we used the log 2 FC values directly as the target scores. Correspondingly, we ranked the candidate protein targets in decreasing magnitude of the log 2 FC gene expression values.

For comparing the performance of different methods, we computed the area under the receiver operating characteristic curve (AUROC), i.e. the area under the plot of true positive rate against false positive rate, following the procedure adopted in DREAM challenges (40, 41) . For each method and each drug treatment, we generated a ranked list of protein targets according to decreasing magnitudes of the protein scores in ProTINA, increasing P-values of network dysregulation from DeMAND, and decreasing magnitudes of log 2 FC gene expression from DE analysis.

For influenza A virus study, we performed a gene set enrichment analysis (GSEA) of the protein target predictions from ProTINA, DeMAND and DE analysis for the KEGG biological pathways (42), using the R package GAGE (Generally Applicable Gene-set/pathway Enrichment analysis) with Kolmogorov-Smirnov tests (43) . In the case of ProTINA and DeMAND, target proteins with zero score were excluded from the GSEA.

The reference protein targets of compounds in drug treatment studies were compiled from five different public databases of chemical-protein interactions: DrugBank (44), Therapeutic Target Database (TTD) (45), MATADOR (46), Comparative Toxicogenomics Database (CTD) (47) , and STITCH (48) . DrugBank and TTD provided information on the mechanism of drug actions as well as the proteins that have physical binding interactions with drugs. Meanwhile, MATADOR, CTD, and STITCH gave interactions between proteins and chemical compounds, curated from text mining and experimental evidences. When retrieving the protein targets of drugs from these databases, we collected proteins that directly bind to the queried drugs. The reference targets for each dataset in this study are provided (49) , where 1292 host proteins that likely physically bind to viral proteins of influenza type A/WSN/33 in human embryonic kidney cells (HEK293) were identified by wholegenome co-immunoprecipitation assays.

ProTINA takes advantage of the availability of comprehensive protein-protein and protein-DNA interaction databases to construct, when possible, a tissue or cell type-specific PGRN. The method considers a PGRN with weighted directed edges (see Figure 1A) , describing direct and indirect gene transcriptional regulation by TFs and their protein partners. The edge weights are determined by applying ridge regression using the gene expression data based on a kinetic model of the gene transcriptional process (see Figure 1B and Materials and Methods). Here, a positive weight indicates a gene activation, while a negative weight implies a gene repression. Because of the underlying kinetic model, ProTINA is able to incorporate dynamical gene expression data, a common type of data from drug treatment studies (5, (24) (25) (26) . The scoring of drug targets is based on the enhancement or attenuation of protein-gene regulatory interactions caused by the drug treatment. A drug-induced gene regulatory enhancement occurs when the expression of genes that are positively (negatively) regulated by a candidate target, becomes higher (lower) in drug treated samples than what is predicted by the PGRN model (see Figure  1C) . A drug-induced attenuation describes the opposite scenario, where the expression of positively (negatively) regulated genes of a target is lower (higher) than expected from the model. For any given differential gene expression sample, a candidate protein target is scored based on the overall enhancement and/or attenuation of its regulatory influence on the downstream genes (see Figure 1D and Materials and Methods). Thus, a protein target with a more positive (negative) score is considered a more likely target of the drug, in which the drug treatment enhances (attenuates) the gene regulatory activity.

We tested ProTINA's performance in predicting drug targets using gene expression data from three drug treatment studies employing human and mouse cell lines. The first dataset came from the NCI-DREAM drug synergy study using human diffuse large B cell lymphoma OCI-LY3 (24), the second from the compound genotoxicity study using human liver cancer cells HepG2 (25) , and the third from the chromatin-targeting compound study using mouse pancreatic cells (26) . We compared ProTINA to the state-ofthe-art network-based analytical method DeMAND (23) , and to the traditional DE analysis. For the analysis of datasets from human cell lines, we constructed cell-type specific PGRNs by combining human PIN from STRING (36) and Enrichr database (35) and human cell-type specific protein-DNA networks from Regulatory Circuit resource (33) . Meanwhile, for the construction of mouse pancreatic 

Besides high AUROCs, ProTINA also provided accurate and specific indications on the MoA of the compounds. In the NCI-DREAM synergy study, roughly half of the compounds are known to induce DNA damage response, including DNA topoisomerase inhibitors (camptothecin, doxorubicin and etoposide), DNA crosslinker (mitomycin C), oxidative DNA damaging agent (methothrexate), and histone deacetylase (HDAC) inhibitors (trichostatin A). In demonstrating ProTINA's ability to reveal the compound MoA, we focused on the canonical p53 DNA damage response pathway (23) , as illustrated in Figure 3 . Here, the activation of p53 in response to DNA damage is expected to induce the transcription of Cyclin Dependent Kinase Inhibitor 1A (CDKN1A) and Growth Arrest and DNA Damage Inducible Alpha (GADD45A) (50, 51) . In turn, CDKN1A and GADD45A--through their interactions with Proliferating Cell Nuclear Antigen (PCNA)--regulate the DNA replication and repair process (52) . GADD45A also inhibits the catalytic activity of Aurora Kinase A (AU-RKA) (53), leading to a lowered activation of Polo-like Kinase 1 (PLK1) and Cyclin B1 (CCNB1) in a phosphorylation cascade (54, 55) . As shown in Figure 4A , except for trichostatin A, the six proteins in the canonical p53 pathway above were ranked highly by ProTINA among the genotoxic compounds in the study (median rank <500), consistent with their known MoA. Note that the same six proteins were ranked much lower among the non-DNA damaging compounds (median rank > 500), signifying a high specificity of ProTINA predictions (see also Supplementary Figure S1 ). Equally important, ProTINA was able to accurately identify the direction of the drug-induced alterations caused by the DNA damaging compounds. The signs of protein target scores from ProTINA indicated drug-induced enhancement (positive scores) of CDKN1A, PCNA and GADD45A, and attenuation (negative scores) of CCNB1, AURKA and PLK1 (see Supplementary Table  S4) , consistent with the expected response of these proteins to DNA damage in Figure 3 .

As illustrated in Figure 4A , DeMAND and DE analysis also performed reasonably well in predicting the com-pounds' MoA. But, the directions of the perturbations predicted by DE analysis were not always consistent with the expected response to DNA damage (see Supplementary Table S5 and S6). Meanwhile, DeMAND did not provide any information on the directions of the drug perturbations. In addition, the protein target scores of ProTINA provided a clearer demarcation between the genotoxic and the nongenotoxic agents among the compounds in the dataset, than DeMAND and DE analysis (see Supplementary Figure S1 ). Besides the canonical p53 response pathway, we further looked at the ranking of proteins involved in the overall DNA damage repair (DDR) and its associated pathways (56) (see Supplementary material 2). As depicted in Figure  4B , ProTINA ranked these proteins much higher than De-MAND and DE analysis, with DE performing the poorest among the methods considered.

In comparison to DeMAND and DE analysis, ProTINA was further able to detect a specific MoA of mitomycin C, whose DNA crosslinking activity is expected to prompt a particular DNA repair process called the fanconi anemia pathway (57) . The fanconi anemia pathway relies on a specific protein complex to ubiquitinate Fanconi Anemia Group D2 Protein (FANCD2) and Fanconi Anemia Group I Protein (FANCI), as well as two homologous recombination (HR) repair proteins, namely Breast Cancer Type 1 Susceptibility Protein (BRCA1) and RAD51 Recombinase (RAD51) (58) . In ProTINA analysis, the average rank of FANCD2, FANCI, BRCA1, and RAD51 was within top 100 for mitomycin C, while the average rank of those proteins was much >100 for the other DNA damaging agents (see Supplementary Table S7 ). However, the specific activation of the fanconi anemia pathway by mitomycin C was not detected by DeMAND or DE analysis. Thus, ProTINA provided more sensitive and specific indications for the mechanism of action of compounds than De-MAND and DE.

We (27) (28) (29) (30) . We employed ProTINA to compute the overall protein target scores using the gene expression data of Calu-3 from the four studies above, by averaging the scores from the early phase of the influenza infection between 0 and 12 h. We checked the target predictions of ProTINA against the findings from a genome-wide co-immunoprecipitation analysis of host and viral protein interactions (49) . More specifically, the aforementioned study reported 1292 host proteins that co-immunoprecipitated with viral proteins of influenza A/WSN/33 using human embryonic kidney cells (HEK293). Despite the discrepancy in the cell types and influenza viral strains between the co-immunoprecipitation analysis and the gene expression profiling, influenza A viruses share similar features and common protein interac- tions (59, 60) . Besides ProTINA, we also evaluated the accuracy of viral target predictions from DeMAND and DE for the same dataset. Figure 5 gives the receiver operating characteristic (ROC) curves of the target predictions from ProTINA, DeMAND and DE analysis. ProTINA outperformed the two other methods, providing the highest AUROC (ProTINA: 0.76 versus demand: 0.69 and DE: 0.65). We further performed a gene set enrichment analysis (GSEA) for the target predictions from each of the methods (see Materials and Methods) to elucidate the key pathways involved in the viral infection and the accompanying host response. The results of the GSEA are summarized in Figure 6 . Both DeMAND and DE target predictions were enriched for only a few pathways (q-value < 0.01), while ProTINA prediction had a much higher number of overrepresented pathways.

The common enriched pathways among ProTINA, De-MAND and DE (top of Figure 6 ) included known mechanisms related to viral entry, replication and assembly, including endocytosis (61) , protein processing in endoplasmic reticulum (62) , ubiquitin mediated proteolysis (63, 64) and RIG-I-like receptor signaling pathway (65, 66) . Both ProTINA and DE analysis indicated the modulation of host cell cycle (67) , mRNA surveillance (68) and DNA damage response (69) . Only ProTINA prediction was significantly enriched for focal adhesion and actin cytoskeleton, which have been shown to regulate influenza virus entry at the early stage of infection (70) . In addition, ProTINA target predictions were also enriched for a broad set of host response pathways to viral infection, including host defense mechanism (e.g. T-and B-cell receptor pathways, phagocytosis, leukocyte migration, chemokine signaling pathways), DNA damage repair (e.g. nucleotide excision repair, p53 signaling pathway, homologous recombination) and apop- tosis. As several influenza proteins are known to interfere with interferon production (which in turn regulates several cytokines) (65, 66) , these findings suggest that, overall, ProTINA provided a broader picture of the early events in the influenza A viral infection, than DeMAND and DE analysis.

ProTINA is a novel and highly effective network-based analytical method for inferring the protein targets of compounds from gene expression profiling data. ProTINA combines the information of TF-gene and protein-protein interactions and data of differential gene expressions to create a tissue or cell type-specific PGRN model. Similar to network-based analysis methods such as NIR (7), MNI (8), SSEM (9) and DeltaNet (10), ProTINA uses a dynamic mechanistic model of the gene transcriptional process to compute deviations in the differential gene expression profiles that are induced by drug treatments. However, as mentioned earlier, the expression of the targets of a drug is often unaffected by the drug treatment (3) . For this reason and as illustrated in Figure 1C and D, ProTINA further transforms the deviations in the differential gene expression into alterations in the protein-gene regulatory edges in the PGRN model. Finally, the target scoring is based on edgetic perturbations of the PGRN, specifically enhancement or attenuation of gene regulatory interactions, caused by the compound.

Nucleic Acids Research, 2018, Vol. 46, No. 6 e34 Like ProTINA, the state-of-the-art method DeMAND also relies on gene transcriptional dysregulations to score drug targets. But, DeMAND does not consider the mode nor the dynamics of the gene regulations, and is unable to predict the direction of the drug-induced dysregulations. DeMAND calculates protein dysregulation scores (P-values) for a given gene regulatory network, by statistical comparison between samples from drug treatment and from control experiments. Consequently, DeMAND requires only few samples to generate its prediction (provided that the network can be defined a priori). On the other hand, ProTINA makes use of the available differential gene expression profiles from a study or a cell line (i.e. not only from the specific drug treatment), to assign the edge weights of the PGRN by ridge regression. Importantly, in the regression analysis, the PGRN model used in ProTINA accounts for the network perturbations. The ability of ProTINA to incorporate data from other drug treatments or conditions in the scoring of protein targets makes this method particularly suited to take advantage of the extensive and still growing number of gene transcriptional profiles from online databases, such as GEO. As demonstrated in the applications to three benchmark drug treatment datasets using human and mouse cell lines, ProTINA significantly outperformed DeMAND and the standard DE analysis. The target predictions of ProTINA also provide indications for the MoA of compounds, including the directions of the network perturbations, with high sensitivity and specificity.

Besides its intended use to predict targets of compounds, we also demonstrated that the analysis of network perturbations using ProTINA could provide insights into the mechanism of diseases. In the application to gene expression profiles of Calu-3 cells from influenza A infection studies, ProTINA again outperformed DeMAND and DE analysis in identifying host factors that bind with viral proteins. Furthermore, the GSEA of ProTINA target predictions revealed the spectrum of cellular processes involved in the early phase of influenza A infection, including pathways involved in viral entry, replication and assembly, and those related to cellular response to viral infection. Among the pathways with the highest significance (lowest q-value) was focal adhesion, which has been shown to regulate influenza viral entry as well as viral replication (70) . Meanwhile, the target predictions of DeMAND and DE analysis had fewer enriched pathways, and thus were less informative than the target analysis by ProTINA.

The PGRN model (see Equation (1)) belongs to a class of modeling framework called Biochemical Systems Theory, specifically the S-systems model (71) . In addition to gene regulatory networks, S-system modeling have also been used to describe other cellular networks, including signal transduction pathways and metabolic reaction networks (72) . Therefore, the principle used in ProTINA could be readily adapted to infer perturbations in cellular signalling or metabolic networks, for example from proteomic and metabolomics profiles, respectively. Besides PGRNs and gene transcriptional profiles, we have not applied ProTINA to analyze other types of cellular networks and data, as such an application was beyond the scope of our work.

ProTINA requires a cell type-or tissue-specific PGRN as an input, which may hinder its application to analyze data from lesser studied organisms. In the case studies, we leveraged on the extensive online databases of proteinprotein interactions and TF-gene networks to manually curate PGRNs for human and mouse cells (33, 34, 36) . Alternatively, provided that a large dataset of gene expression profiles are available for the cell of interest, the PGRN could be inferred using existing network inference methods (73, 74) . Another potential limitation in applying ProTINA is the requirement for differential gene expression data for inferring the edge weights of the PGRN. While the minimum number for implementing ridge regression with a 3-fold cross validation (lowest fold in GLMNET) is three, the accuracy of the weights and thus the target predictions from ProTINA would generally deteriorate with lower sample sizes. Nevertheless, ProTINA was still able to provide reasonably accurate predictions using a total of 18 samples in the influenza A virus case study above.

The performance of ProTINA, like any other networkbased analytical methods, depends on the fidelity of the network used in the analysis. Uncertainty in the PGRN model, both in the structure and the edge weights, is expected to negatively affect the accuracy of the target prediction. Here, structural uncertainty is associated with the reliability of the information used to construct the PGRN, which in our study, comes from online databases of PIN and TF-gene networks. On the other hand, the uncertainty in the edge weights is associated with multiple factors, including the information content of the gene transcriptional profiles and the mathematical formulation used for the weight inference. The information content of the gene expression data is in turn related to measurement uncertainty and richness in the experimental perturbations. Keeping the same number of treatments, datasets with more replicates and less correlated gene expression profiles (i.e. the treatments induce more distinct perturbations to the network), would have a higher degree of information. Meanwhile, we have previously shown that the validity of the model assumption (e.g. pseudo steady-state condition) has an effect on the accuracy of the inferred weights and thus the target prediction accuracy (10). While we have circumvented the issue arising from the violation of the pseudo steady-state assumption in ProTINA (see Equation (7)), (in)validating all model assumptions may be difficult, if not impossible, in practice. A common strategy, as implemented in this study, is to test the performance of the method against benchmark datasets (13) . The results of applying ProTINA to drug treatment and influenza A viral infection datasets give confidence to the suitability of the mathematical formulation used in this work.

MATLAB and R versions of ProTINA can be downloaded from Github repository (https://github.com/CABSEL/ ProTINA).

Drugs, their targets and the nature and number of drug targets

Target identification and mechanism of action in chemical biology and drug discovery

Drug target prioritization by perturbed gene expression and network information

Discovering the targets of drugs via computational systems biology

The connectivity map: A new tool for biomedical research

Transcriptional data: a new gateway to drug repositioning?

Inferring genetic networks and identifying compound mode of action via expression profiling

Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks

Predicting gene targets of perturbations via network-based filtering of mRNA expression compendia

Inferring gene targets of drugs and chemical compounds from gene expression profiles

Inferring causal gene targets from time course expression data. IFAC-PapersOnLine

Silencing HoxA1 by intraductal injection of siRNA lipidoid nanoparticles prevents mammary tumor progression in mice

Wisdom of crowds for robust gene network inference

Assessment of network inference methods: How to cope with an underdetermined problem

Ensemble inference and inferability of gene regulatory networks

KEA: Kinase enrichment analysis

A human B-cell interactome identifies MYB and FOXM1 as master regulators of proliferation in germinal centers

Expression2Kinases: mRNA profiling linked to multiple upstream regulatory layers

Drug target prediction and repositioning using an integrated network-based approach

Finding the targets of a drug by integration of gene expression data with a protein interaction network

Causal reasoning on biological networks: Interpreting transcriptional changes

Assessment of network perturbation amplitudes by applying high-throughput data to causal biological networks

Elucidating compound mechanism of action by network perturbation analysis

A community computational challenge to predict the activity of pairs of compounds

A transcriptomics-based in vitro assay for predicting chemical genotoxicity in vivo

Chromatin-targeting small molecules cause class-specific transcriptional changes in pancreatic endocrine cells

Conserved host response to highly pathogenic avian influenza virus infection in human cell culture, mouse and macaque model systems

Host regulatory network response to infection with highly pathogenic H5N1 avian influenza virus

A network integration approach to predict conserved regulators related to pathogenicity of influenza and SARS-CoV respiratory viruses

Pathogenic influenza viruses and coronaviruses utilize similar and contrasting approaches to control interferon-stimulated gene responses

NCBI GEO: archive for functional genomics data sets -Update

Bioconductor: open software development for computational biology and bioinformatics

Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases

CellNet: Network biology applied to stem cell engineering

Enrichr: a comprehensive gene set enrichment analysis web server 2016 update

STRING v10: protein-protein interaction networks, integrated over the tree of life

Finite Difference Calculus, Numerical Partial Differential Equations for Enviornmental Scientists and Engineers: A first Practical Course

Regularization paths for generalized linear models via coordinate descent

Combining probability from independent tests: The weighted Z-method is superior to Fisher's approach

Lessons from the DREAM2 challenges: a community effort to assess biological network inference

Towards a rigorous assessment of systems biology models: The DREAM3 challenges

KEGG: New perspectives on genomes, pathways, diseases and drugs

GAGE: generally applicable gene set enrichment for pathway analysis

DrugBank 4.0: Shedding new light on drug metabolism

Therapeutic target database update 2012: a resource for facilitating target-oriented drug discovery

SuperTarget and Matador: resources for exploring drug-target relationships

The comparative toxicogenomics database: update 2017

STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data

Influenza virus-host interactome screen as a platform for antiviral drug development

Multiple roles of the cell cycle inhibitor p21CDKN1A in the DNA damage response

Gadd45a, a p53-and BRCA1-regulated stress protein, in cellular response to DNA damage

PCNA: structure, functions and interactions

Gadd45a interacts with aurora-A and inhibits its kinase activity

Polo-like kinase-1 is activated by aurora A to promote checkpoint recovery

Polo-like kinase 1 phosphorylates cyclin B1 and targets it to the nucleus during prophase

Therapeutic opportunities within the DNA damage response

DNA interstrand crosslink repair and cancer

Fanconi anaemia proteins, DNA interstrand crosslink repair pathways, and cancer therapy

Cellular proteins in influenza virus particles

Structure homology and interaction redundancy for discovering virus-host protein interactions

A comprehensive map of the influenza A virus replication cycle

Folding of influenza hemagglutinin in the endoplasmic reticulum

Influenza virus infection causes specific degradation of the largest subunit of cellular RNA polymerase II

Ubiquitin in influenza virus entry and innate immunity

The multifunctional NS1 protein of influenza A viruses

Influenza virus protein PB1-F2 inhibits the induction of type I interferon by binding to MAVS and decreasing mitochondrial membrane potential

Integrated network analysis reveals a novel role for the cell cycle in 2009 pandemic influenza virus-induced inflammation in macaque lungs

Non-structural protein 1 of influenza viruses inhibits rapid mRNA degradation mediated by double-stranded RNA-binding protein, staufen1

Influenza infection induces host DNA damage and dynamic DNA damage responses during tissue regeneration

Novel roles of focal adhesion kinase in cytoplasmic entry and replication of influenza A viruses

Computational Analysis of Biochemical Systems: A Practical Guide for Biochemists and Molecular Biologists

Recent developments in parameter estimation and structure identification of biochemical and genomic systems

ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context

Genome-wide identification of post-translational modulators of transcription factor activity in human B cells