key: cord-0755984-1npzolrr authors: Llabrés, Mercè; Valiente, Gabriel title: Alignment of virus-host protein-protein interaction networks by integer linear programming: SARS-CoV-2 date: 2020-12-07 journal: PLoS One DOI: 10.1371/journal.pone.0236304 sha: 5be576f9d812018d8da5f757d0b992512e86fb5f doc_id: 755984 cord_uid: 1npzolrr MOTIVATION: Beside socio-economic issues, coronavirus pandemic COVID-19, the infectious disease caused by the newly discovered coronavirus SARS-CoV-2, has caused a deep impact in the scientific community, that has considerably increased its effort to discover the infection strategies of the new virus. Among the extensive and crucial research that has been carried out in the last months, the analysis of the virus-host relationship plays an important role in drug discovery. Virus-host protein-protein interactions are the active agents in virus replication, and the analysis of virus-host protein-protein interaction networks is fundamental to the study of the virus-host relationship. RESULTS: We have adapted and implemented a recent integer linear programming model for protein-protein interaction network alignment to virus-host networks, and obtained a consensus alignment of the SARS-CoV-1 and SARS-CoV-2 virus-host protein-protein interaction networks. Despite the lack of shared human proteins in these virus-host networks, and the low number of preserved virus-host interactions, the consensus alignment revealed aligned human proteins that share a function related to viral infection, as well as human proteins of high functional similarity that interact with SARS-CoV-1 and SARS-CoV-2 proteins, whose alignment would preserve these virus-host interactions. The present outbreak of a coronavirus-associated acute respiratory disease, the COVID-19 pandemic, has forced the scientific community to rapidly analyze the virus-host relationships of the new coronavirus (SARS-CoV-2) human infection. Thus, in less than a month, several databases as [1] [2] [3] have been created to collect all SARS-CoV-2 and COVID- 19 and the SARS-CoV-2-human protein-protein interaction network was built [4] . As stated in [5] , The Coronaviridae Study Group (CSG) of the International Committee on Taxonomy of Viruses [. . .] has assessed the placement of the human pathogen, tentatively named 2019-nCoV, within the Coronaviridae. Based on phylogeny, taxonomy and established practice, the CSG recognizes this virus as forming a sister clade to the prototype human and bat severe acute respiratory syndrome coronaviruses (SARS-CoVs) of the species Severe acute respiratory syndrome-related coronavirus, and designates it as SARS-CoV-2. Therefore, the closest known human pathogen to SARS-CoV-2 is the coronavirus SARS--CoV that appeared in 2003 [6] , also called SARS-CoV-1. Understanding the mechanism of the SARS-CoV-2 infection is a crucial step towards the discovery of antiviral drugs and vaccines. The modus operandi of every viral infection is through the interaction between viral proteins and host proteins, in order to use the host cells to replicate. In this line of research, virus-host protein-protein interaction networks, a particular form of protein-protein interaction networks, have become appropriate to analyze virushost relationships, and information on well-known and studied virus-host protein-protein interaction networks can be carried over to new ones by way of protein-protein interaction network comparison and alignment. See [7, 8] for comprehensive reviews. The general problem of protein-protein interaction network alignment has been explored in the last two decades, and several tools have been already proposed and implemented [9] [10] [11] [12] [13] [14] . However, the particular case of virus-host protein-protein interaction network alignment problem has not been fully studied yet. We have recently developed a compact reformulation of a quadratic programming model for the protein-protein interaction network alignment problem as an integer linear program, which has been proven to be suitable for the alignment of virus-host protein-protein interaction networks [15] . Our proposed model can be solved using state-of-the-art mathematical modeling software such as AMPL [16] and integer linear programming software tools such as IBM ILOG CPLEX Optimization Studio and Gurobi Optimizer. In this work, we adapt and implement a modification of the aforementioned alignment method to align the virus-host protein-protein interaction networks of SARS-CoV-1 and SARS-CoV-2, in order to elucidate information on the infection mechanism of SARS-CoV-2 based on current knowledge on the infection mechanism of SARS-CoV-1. In the integer linear programming formulation of the protein-protein interaction network alignment problem, described in [15] , a virus-host protein-protein interaction network is represented by an undirected bipartite graph G = (U, V, E), with a node u 2 U for each virus protein, a node v 2 V for each host protein, and an edge {u, v} 2 E for each virus-host proteinprotein interaction. Notice that these bipartite graphs need not be connected. Let G = (U, V, E) and G 0 = (U 0 , V 0 , E 0 ) be the two virus-host protein-protein interaction networks to be aligned, and let A = (a ij ) and B = (b kℓ ) be their weighted adjacency matrices, where the weight of an entry a ij 2 [0, 1] is the confidence score of the interaction {i, j} 2 E, and the weight of an entry b kℓ 2 [0, 1] is the confidence score of the interaction {k, ℓ} 2 E 0 . Let also S = (s ik ) be a similarity matrix between the nodes of the two networks, with each s ik 2 [0, 1] the similarity score of i 2 U[V and k 2 U 0 [ V 0 . Let us define a binary variable x ik for each i 2 U [ V and each k 2 U 0 [ V 0 , where x ik = 1 if node i of the first network is aligned with node k of the second network, and x ik = 0 otherwise. Then, an alignment of two virus-host protein-protein interaction networks G = (U, V, E) and G 0 = (U 0 , V 0 , E 0 ) is represented by the binary matrix X = (x ik ). Let us also define an integer variable y ik for each i 2 U[V and each k 2 U 0 [ V 0 , where each integer variable y ik is intended to represent In this way, if x ik = 0, y ik = 0, and if x ik = 1, y ik is the weight of those edges incident to node i in G that are preserved by the alignment. Then, the goal of the integer linear programming model is to maximize where λ is a parameter, with 0 � λ � 1, to control the balance between protein similarity scores and protein-protein interaction weights: only node scores are considered when λ = 1, and only edge scores are taken into account when λ = 0. It is easy to see that this integer linear programming formulation is equivalent to the integer quadratic programming formulation of the network alignment problem given in [9] . In fact, the previous constraints entail X The objective function comes from the PathBLAST [11] idea that protein-protein network alignment be based on a log-probability-like criterion, with matching terms corresponding to both proteins and interactions [9] . The first sum in the objective function, X represents the global similarity of the aligned proteins, while the second sum, X represents the weight of those edges that are preserved by the alignment; that is, those pairs of edges (i, j)2E and (k, ℓ)2E 0 such that node i is aligned with node k and node j is aligned with node ℓ. Let m = |U| + |V| and n = |U 0 | + |V 0 |. The resulting integer linear programming formulation of the virus-host protein-protein network alignment problem has O(mn) binary variables, integer variables, and constraints. There are 130 interactions between 29 SARS-CoV-1 proteins and 109 human proteins in the March 2020 release of the VirHostNet database [1] , as well as 332 interactions between 26 SARS-CoV-2 proteins and 332 human proteins from [4] in release 4.2.13 of the IntAct database [2] . Thus, the SARS-CoV-1-Human network has 138 nodes and 130 edges, while the SARS-CoV-2-Human network has 358 nodes and 332 edges. Notice that only 6 of these 109 and 332 human proteins (P27448, Q5JRX3, Q7KZI7, Q9BW92, Q9H4F8, and Q9Y6E2) interact with both SARS-CoV-1 and SARS-CoV-2 proteins. Notice also that we have excluded any interactions among the 109 human proteins in the SARS-CoV-1-Human network, as well as any interactions among the 332 human proteins in the SARS-CoV-2-Human network. These host-host interactions do not contribute to improving the quality of the virus-host protein-protein network alignment, they rather introduce noise and, in fact, the inclusion of both virus-host and host-host interactions in the SARS-CoV-1-Human and SARS-CoV-2-Human networks results in the alignment of four of the SARS-CoV-1 proteins (nsp7, nsp9, nsp12, and orf7a) with host proteins (P61019, P06280, Q9Y375, and Q9Y5J7, respectively) instead of SARS-CoV-2 proteins. The SARS-CoV-1-Human and SARS-CoV-2-Human virushost protein-protein interaction networks have 11 and 26 connected components, respectively. We have obtained the amino acid sequences for the SARS-CoV-1 and human proteins from UniProt/SwissProt (122 sequences), UniProt/TrEMBL (2 sequences), and NCBI RefSeq (14 sequences), and for the SARS-CoV-2 and human proteins from UniProt/ SwissProt (332 sequences) and from the Supplementary material in [4] (26 sequences). We have taken the global alignment score between the amino acid sequences of two proteins, computed by dynamic programming with the algorithm of [17] as implemented in BioPython [18] , with a gap opening penalty of −7 and a gap extension penalty of −1, and normalized to [0, 1], as the similarity score between the proteins. In the protein sequences of P07203 and Q9BQE4, we substituted C (cysteine) for the rare amino acid U (selenocysteine), which appears only once in the protein sequences of P07203 and Q9BQE4 over a dataset of 310,717 amino acids in 496 viral and human protein sequences, in order to compute global sequence alignments using the BLOSUM62 amino acid substitution matrix, which does not cover selenocysteine. The corresponding integer linear programming problem instance has 83,628 variables, half of which are binary, and 84,069 constraints. The alignment of the virus-host protein-protein interaction networks of SARS-CoV-1 and SARS-CoV-2 was computed with AMPL version 2018.10.22 [16] and Gurobi Optimizer version 8.1.0, using a personal computer with an Intel Core i7-8550U quad-core processor at 1.80 GHz and 32 GB of memory running Ubuntu 18.04 LTS. The optimal alignment was found in 517.35 seconds of AMPL time, plus 3.16697 seconds of solver time, for SARS-CoV-1 to SARS--CoV-2, and in 538.112 seconds of AMPL time, plus 3.45882 seconds of solver time, for SARS--CoV-2 to SARS-CoV-1. We set λ = 0.5 in both cases, and took the consensus between them as the alignment of the two virus-host protein-protein interaction networks. Protein similarity can be assessed by comparing the annotated Gene Ontology (GO) terms for the proteins along three classifications: the molecular function ontology (MFO), the biological process ontology (BPO), and the cellular component ontology (CCO). We considered the host proteins that interact with the viral proteins in the consensus alignment, for each of the two networks. For these human proteins we obtained their GO term annotations using GOnet [19] , and measured the functional similarity between aligned human proteins using GOGO [20] , which computes the average best semantic similarity between the GO term annotations for the proteins based on their shortest paths in the GO classifications. Tables 1 (structural proteins), 2-4 (non-structural proteins), and 5 (accessory proteins) show the alignment of viral proteins in the consensus alignment, along with the alignment of the human proteins they interact with, their MFO score, their BPO score, and their CCO score. We can observe that the four structural proteins in one network were aligned with the corresponding protein in the other network. Also, most of the non-structural proteins and half of the accessory proteins in one network were aligned with the corresponding protein in the other network. On the other hand, for each pair of aligned viral proteins, the highlighted proteins in the same column of a viral protein are the human proteins it interacts with. For instance, human proteins O00303 and Q9BYF1 interact with the SARS-CoV-1 spike protein P59594, while human protein Q7Z5G4 interacts with the SARS-CoV-2 spike protein P0DTC2. Table 1 (a) shows that O00303 is aligned with P48556, Q9BYF1 is aligned with Q7L8L6, and O95295 is aligned with Q7Z5G4. Missing data are due to lack of GO term annotation for the two interacting proteins. Consensus alignment of SARS-CoV-1 and SARS-CoV-2 interaction networks As can be seen in these tables, most of the aligned proteins have a cellular component ontology score above 0.700. This means that, despite the low number of conserved interactions, the aligned proteins share their cellular location. For instance, those human proteins that interact with the spike protein in SARS-CoV-1 are aligned with human proteins that interact with the membrane protein in SARS-CoV-2. However, some biological process ontology scores between aligned human proteins are very low. This can be explained by the lack of biological process ontology GO term annotation for one of the two interacting proteins. With respect to molecular function ontology, it is remarkable that we obtained high scores for aligning proteins that interact with structurally different viral proteins. Indeed, one of the measures used to test the correctness of a protein-protein interaction network alignment is the edge correctness score, which measures the ratio of conserved edges in a given alignment. Edge correctness assume that one of the aims of the alignment is to find similar regions between the two aligned networks, in terms of network topology. In the context of proteinprotein interaction networks, it is also assumed that two proteins interact when they together carry out some biological function. For virus-host protein-protein interaction networks, viral proteins interact with host proteins to perturb the intracellular networks of their hosts to their advantage, and many virus-host interactions occur at the level of physical protein-protein interactions [7] . This means that a viral protein interacts with a host protein to carry out a cellular process, and this pathway of virus-host interactions constitutes the infection mechanism of the virus. The question then arises, when a viral protein in one network is aligned with a viral protein in another network, should the host proteins that interact with one viral protein be aligned with those host proteins that interact with the other viral protein? Clearly the answer is yes, when the viral-host protein-protein interaction is a similar infectious process stage. Therefore, Consensus alignment of SARS-CoV-1 and SARS-CoV-2 interaction networks aligned virus-host interactions must entail conserved stages in the infectious process. However, non-conserved edges do not necessarily imply incorrect alignments. Indeed, when we analyze in more depth the functional description [3] of the aligned human proteins that interact with Coronavirus proteins, we realize that they share a function related to viral infection, although their alignment introduces a non-preserved interaction. This is the case of the following pairs of proteins: The molecular function ontology score of these proteins is 1.000. Human protein O14920 interacts with viral protein P59596 (membrane) of SARS-CoV-1, which is aligned with protein P0DTC5 (membrane) of SARS-CoV-2. On the other hand, human protein Q9UHD2 interacts with viral protein P0DTD1-PRO_0000449630 (orf9b) of SARS-CoV-2. However, the functional description of the aligned human proteins reflects correctness of the consensus alignment: • O14920: Serine kinase that plays an essential role in the NF-kappa-B signaling pathway which is activated by multiple stimuli such as inflammatory cytokines, bacterial or viral products. • Q9UHD2: Serine/threonine kinase that plays an essential role in regulating inflammatory responses to foreign agents. Following activation of toll-like receptors by viral or bacterial components, associates with TRAF3 and TANK and phosphorylates interferon regulatory factors (IRFs) IRF3 and IRF7 as well as DDX3X. The molecular function ontology score of these proteins is 0.850. Human protein Q9NR30 interacts with viral protein P0DTC9 (nucleocapsid) of SARS-CoV-2, which is aligned with protein P59595 (nucleocapsid) of SARS-CoV-1. On the other hand, human protein Q92499 interacts with viral protein P0C6X7-PRO_0000037320 (proofreading exoribonuclease in replicase polyprotein 1ab) of SARS-CoV-1. The functional description of the aligned human proteins is: • Q92499: Helicase required for Coronavirus IBV replication. Antiviral defense. • Q9NR30: Component of a multi-helicase-TICAM1 complex that acts as a cytoplasmic sensor of viral double-stranded RNA (dsRNA) and plays a role in the activation of a cascade of antiviral responses including the induction of proinflammatory cytokines via the adapter molecule TICAM1. The molecular function ontology score of these proteins is 0.850. Both are GTP-binding proteins. Human protein P49703 interacts with viral protein NP_828865 (nsp7) of SARS-CoV-1. Human protein P62330 interacts with viral protein P0DTD1-PRO0000449632 (nsp15) of SARS-CoV-2, which, as reported in [4] , "has uridine-specific endoribonuclease (endoU) activity and is essential for viral RNA synthesis," with the endoU domain being "one of the most conserved proteins among CoVs and related viruses, suggesting important functions in the viral replicative cycle." The functional description of the aligned human proteins is: • P49703: Small GTP-binding protein which cycles between an inactive GDP-bound and an active GTP-bound form, and the rate of cycling is regulated by guanine nucleotide exchange factors (GEF) and GTPase-activating proteins (GAP). • P62330: GTP-binding protein involved in protein trafficking that regulates endocytic recycling and cytoskeleton remodeling. Activation is generally mediated by a guanine exchange factor (GEF), while inactivation through hydrolysis of bound GTP is catalyzed by a GTPase activating protein (GAP). Therefore, it is not clear whether edge preservation should always be required in a correct alignment of virus-host protein-protein interaction networks. To reinforce this idea, we considered the functional similarity of all pairs of human proteins whose alignment would preserve edges, given the consensus alignment of 24 viral proteins. For each pair of aligned viral proteins (say, membrane proteins) we considered the biological process, the cell component, and the molecular function ontology scores of all pairs of human proteins that interact with the aligned viral proteins (say, all pairs of human proteins such that the first protein interacts with viral membrane protein P59596 of SARS-CoV-1 and the second protein interacts with viral membrane protein P0DTC5 of SARS-CoV-2). The cellular component ontology score is above 0.800 for most of the aligned human proteins, but the highest molecular function ontology score is 0.852, while it is 1.000 in the consensus alignment, and the highest biological process ontology score of the aligned human proteins is 0.670, while it is 0.863 in the consensus alignment. Nevertheless, some of these pairs of human proteins whose alignment would preserve virus-host interactions do show high functional similarity scores, and it could be interesting to further study their role in the viral mechanism of host infection. Table 5 shows some of the highest ranking pairs of human proteins across biological process, cellular component, and molecular function ontology scores for the structural viral proteins in the consensus alignment, in descending order of average score. See the Supplementary material for more details. We observed that, based on current knowledge, SARS-CoV-1 and SARS-CoV-2 share only 6 human proteins in their virus-host protein-protein interaction networks. On the one hand, aligned viral proteins in the consensus alignment obtained with our method show a sequence similarity of over 75% on the average, and most of the SARS-CoV-1 proteins are aligned with SARS-CoV-2 proteins that belong to the same category (spike, envelope, membrane, nucleocapsid, and the various non-structural and accessory proteins) in the genome organization of the viruses. On the other hand, the proposed alignment method does not preserve the virushost interactions. This suggests that these viruses, despite their classification as human pathogens within the Coronaviridae family, do not follow the same detailed mechanism of host infection. We believe that further research on these aligned human proteins with high molecular function ontology scores, will help to elucidate the viral mechanism of infection and replication that is necessary to accomplish the goal of antiviral drug or vaccine discovery. Table 5 . Human proteins that interact with SARS-CoV-1 and SARS-CoV-2 structural proteins, whose alignment would preserve virus-host interactions. VirHostNet 2.0: Surfing on the web of virus/host molecular interactions data The MIntAct project: IntAct as a common curation platform for 11 molecular interaction databases UniProt: A worldwide hub of protein knowledge A SARS-CoV-2 protein interaction map reveals targets for drug repurposing The species Severe acute respiratory syndrome-related coronavirus: Classifying 2019-nCoV and naming it SARS-CoV-2 The genome sequence of the SARS-associated coronavirus Interactome networks and human disease Computational analysis of protein interaction networks for infectious diseases Alignment of protein interaction networks by integer quadratic programming PINALOG: A novel approach to align protein interaction networks-implications for complex detection and function prediction Path-BLAST: a tool for alignment of protein interaction networks HubAlign: An accurate and efficient method for global alignment of protein-protein interaction networks Lagrangian graphlet-based network aligner AligNet: Alignment of protein-protein interaction networks Alignment of biological networks by integer linear programming: Virus-host protein-protein interaction networks A Modeling Language for Mathematical Programming. Cengage Learning A general method applicable to the search for similarities in the amino acid sequence of two proteins Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics GOnet: A tool for interactive Gene Ontology analysis GOGO: An improved algorithm to measure the semantic similarity between gene ontology terms