key: cord-255755-5jccb3nh authors: Saha, Sovan; Chatterjee, Piyali; Basu, Subhadip; Nasipuri, Mita title: Detection of spreader nodes and ranking of interacting edges in Human-SARS-CoV protein interaction network date: 2020-04-23 journal: bioRxiv DOI: 10.1101/2020.04.12.038216 sha: doc_id: 255755 cord_uid: 5jccb3nh The entire world has recently witnessed the commencement of coronavirus disease 19 (COVID-19) pandemic. It is caused by a novel coronavirus (n-CoV) generally distinguished as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). It has exploited human vulnerabilities to coronavirus outbreak. SARS-CoV-2 promotes fatal chronic respiratory disease followed by multiple organ failure which ultimately puts an end to human life. No proven vaccine for n-CoV is available till date in spite of significant research efforts worldwide. International Committee on Taxonomy of Viruses (ICTV) has reached to a consensus that the virus SARS-CoV-2 is highly genetically similar to Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV) outbreak of 2003. It has been reported that SARS-CoV has ∼89% genetic similarities with n-CoV. With this hypothesis, the current work focuses on the identification of spreader nodes in SARS-CoV protein interaction network. Various network characteristics like edge ratio, neighborhood density and node weight have been explored for defining a new feature spreadability index by virtue of which spreader nodes and edges are identified. The selected top spreader nodes having high spreadability index have been also validated by Susceptible-Infected-Susceptible (SIS) disease model. Initially, the proposed method is applied on a synthetic protein interaction network followed by SARS-CoV-human protein interaction network. Hence, key spreader nodes and edges (ranked edges) are unmasked in SARS-CoV proteins and its connected level 1 and level 2 human proteins. The new network attribute spreadability index along with generated SIS values of selected top spreader nodes when compared with the other network centrality based methodologies like Degree centrality (DC), Closeness centrality (CC), Local average centrality (LAC) and Betweeness centrality (BC) is found to perform relatively better than the existing-state-of-art. The pandemic COVID-19 registered its first case on 31 December 2019 [1] . It laid its foundation in the Chinese city of Wuhan (Hubei province) [2] . Soon, it made several countries all over the world [3] as its victim by community spreading which ultimately compelled World Health Organization (WHO) to declare a global health emergency on 30 January 2020 [4] for the massive outbreak of COVID-19. Owing to its expected fatality rate, which is about 4%, as projected by WHO [5] , researchers from nations, all over the world, have joined their hands to work together to understand the spreading mechanisms of this virus SARS-CoV-2 [6] - [9] and to find out all possible ways to save human lives from the dark shadow of COVID-19. Coronavirus belongs to the family Coronaviridae. This single stranded RNA virus not only affects humans but also mammals and birds too. Due to coronavirus, common fever/flu symptoms are noted in humans followed by acute respiratory infections. Nevertheless SARS-CoV-2 is under the same Betacoronavirus genus as that of MERS and SARS coronavirus [12] . It comprises of several structural and non-structural proteins. The structural proteins includes the envelope (E) protein, membrane (M) protein, nucleocapsid (N) protein and the spike (S) protein. Though SARS-CoV-2 has been identified recently but there is an intense scarcity of data as well as necessary information needed to gain immunity against SARS-CoV-2. Studies has revealed the fact that SARS-CoV-2 is highly genetically similar to SARS-CoV based on several experimental genomic analyses [9] , [12]- [14] . This is also the reason behind the naming of SARS-CoV-2 by International Committee on Taxonomy of Viruses (ICTV) [15] . Due to this genetic similarity, immunological study of SARS-CoV may lead to the discovery of SARS-CoV-2 potential drug development. In the proposed methodology, Protein-protein interaction network (PPIN) has been used as the central component in identification of spreader nodes in SARS-CoV. PPIN has been found to be very effective module for protein function determination [16] - [18] as well as in the identification of central/essential or key spreader nodes in the network [19] - [23] . Compactness of the network and its transmission capability is estimated using centrality analysis. Anthonisse et al. [23] proposed a new centrality measure named as Betweenness Centrality (BC) . BC is actually defined as the measure of impact of a particular node over the transmission between every pair of nodes with a consideration that this transmission is always executed over the shortest path between them. It is defined as: where ( , ) is the total number of shortest paths from node to node and ( , , ) is the number of those paths that pass through . Another centrality measure, called as closeness centrality (CC), was defined by Sabidussi et al. [24] . It is actually a procedure for detecting nodes having efficient transmission capability within a network. Nodes which have high closeness centrality value are considered to have the shortest distance to all available nodes in the network. It can be mathematically expressed as: where | | denotes the number of neighbors of node u and ( , ) is the distance of the shortest path from node to node . Two other important centrality measures: Degree centrality (DC) [19] and Local average centrality (LAC) [21] are also found to be very effective in this area of research. DC is considered to be the most simplest among the available centrality measures which only count the number of neighbors of a node. Nodes having high degree is said to be highly connected module of the network. It is defined as: where | | hold the same meaning as state above. LAC is defined to be the local metric to compute the essentiality of node with respect to transmission ability by considering its modular nature, the mathematical model of which is highlighted as: where is the subgraph induced by and )* + , is the total number of nodes which are directly connected in . Due to high morbidity and mortality of SARS-CoV2, it has been felt that there is a pressing need to properly understand the way of disease transmission from SARS-CoV-2 PPIN to human PPIN [16] - [18] . But since SARS-CoV-2 PPIN is still not available, SARS-CoV PPIN is considered for this research study due to its high genetic similarity with SARS-CoV-2. In the proposed methodology, at first SARS-CoV-Human PPIN (up to level 2) is formed from the collected datasets [25] - [27] . Once it is formed, spreader nodes are identified in each of SARS-CoV proteins, its level 1 and level 2 of human network by the application of a new network attribute i.e. spreadability index which is a combination of three terminologies: 1) edge ratio [28] 2) neighborhood density [28] and 3) node weight [29] . The detected spreader nodes are also validated by the existing SIS epidemic disease model [30] . Then the edges connecting two spreader nodes are ranked based on the average of spreadability index of spreader nodes themselves to access to the spreading ability of the corresponding edge. The ranked edges thus highlight the path of entire disease propagation from SARS-CoV to human level 1 and then from human level 1 to human level 2 proteins. Spreader nodes and edges both play a crucial role in transmission of infection from one part of the network to another. Generally in a disease specific PPIN models, at least two entities are involved: one is pathogen/Bait and the other is host/Prey [31] . In this research work, SARS-CoV takes the role of the former while human later. SARS-CoV first transmits infection to its corresponding interaction human proteins which in turn affect its next level of proteins. So, the transmission occurs through connected nodes and edges. Not all the nodes or proteins in the PPIN transmit infection. So, proper identification of nodes transmitting infection (spreader nodes) is required. It is simultaneously also true that the transmission is not possible without the edges connecting two spreader nodes. Thus these connecting edges are called spreader edges. The proposed methodology involves a proper study and assessment of various existing established PPIN features followed by the identification of spreader nodes, which has been also verified by SIS model. Ego network of node (-) [28] is defined as the grouping of node itself along with its corresponding level 1 neighbors and interconnections. N (-) [28] consists of the set of nodes which belong to the ego network,i.e. { } ∪ 1( ). Edge ratio of node [28] is defined by the following equation: where 2 I 9 : is the total number of outgoing edges from the ego networkand 2 J 9 : is the total number of interactions among node 's neighbors, respectively. Γ ( ) denotes the level 1 neighbors of node .is considered to be Ego network. 1 9 : (K) denotes node K's neighbors which belongs -. In the edge ratio, 2 I 9 : is positively related to the non-peripheral location of node . Large number of interactions resulting from the ego network actually denotes that the node has high level of interconnectivity between its neighbors. On the other hand, 2 J 9 : is negatively related to the inter-module location of node . It represents the fact that the interconnectivity between neighbors is usually connected to the number of structural holes available around the node. When neighbor's interconnectivity is low, the root or the central node gain more control of the flow of transmission among the neighbors. Jaccard dissimilarity [32] of node and K ( L M43 N ( , K)) is defined as: where |Γ(i) ∩ Γ(j) | refers to the number of common neighbors of neighbors of and K. |Γ(i) ∪ Γ(j)| is the total number of neighbors of and K. Similarity between two nodes is determined by Jaccard similarity based on their common neighbors. The similarity degree between and K is considered to be more when they have more number of common neighbors. Whereas, when dissimilarity between the neighbors of a node is high, it guarantees that the only common node among the neighbors is the central node, which is termed as structural hole situation [28] . The neighborhood diversity [28] is defined as: When a node's neighborhood density reaches its greatest value, it reveals the fact that the neighbors have no other closer path. Hence, the neighbors should transmit or communicate through this node. Node weight Y of node ∈ Z in PPIN [29] is interpreted as the average degree of all nodes in [ \ ] , a sub-graph of a graph [ \ . It is considered as another measure to determine the strength of connectivity of a node in a network. Mathematically, it is represented by where, Z" is the set of nodes in [ \ ] . | Z dd | is the number of nodes in [ \ ] . And )*( ) is the degree of a node ∈ Z". Spreadability index of node is defined as the ability of node to transmit an infection from one node to another. Mathematically it can be defined as: Nodes having high spreadability index are termed as spreader nodes i.e. these nodes have the capability to transmit the infection to its maximum number of interconnected neighbors in a much short amount of time in comparison to the other nodes in PPIN. Figure 1 Synthetic PPIN1 consisting of 33 nodes and 53 edges where each node represents a protein and its interaction is represented by edges. Figure 1 represents a sample PPIN where each protein is denoted as nodes while its interaction by edges. In this network, spreadability index is computed by using basic network characteristics as stated earlier which has been highlighted and compared to DC, BC, CC and LAC in Table 1 to Table 5 . 8 3 15 11 3 16 12 3 17 13 3 18 16 3 19 15 3 20 20 3 21 32 3 22 10 2 23 14 2 24 18 2 25 6 2 26 28 2 27 29 2 28 30 2 29 25 2 30 26 2 31 27 2 32 31 2 33 33 2 In Figure 1 , it can be observed that nodes 1, 24 are clearly the most essential spreaders. Node 1 connects the four densely connected modules of the network which turns this node to stand in the first position having highest spreadability index by our proposed methodology. This node has been correctly ranked by all the methods except LAC and DC. Node 24 though has moderate edge ratio and node weight, but is one of the most densely connected modules itself in spite of getting isolated from the main network module of node 1. Moreover, node 24 has the highest neighborhood density. It establishes the fact that the only path of transmission for nodes 26, 27, 25, 28, 29, 30, 31, 32 and 33 is node 24. Thus if node 24 gets affected, then all the connected nodes with it will be immediately affected due to the lack of the connectivity of the neighbors with other central nodes. So, node 24 holds the second position with respect to spreadability index in our proposed methodology. Node 24 is not properly identified as the second most influential spreader nodes by the other methods. Further assessment of the remaining nodes highlights the fact that the performance of the new attribute spreadability index in our proposed methodology is relatively better in comparison to the others. The SIS Epidemic Model [30] is Table 1 to Table 5 that the proposed methodology has the highest SIS infection rate of 2.46 (see Table 1 ) in comparison to others for their corresponding top 10 spreader nodes in the synthetic network as shown in Figure 1 . To show the ranking of interacting spreader edges, another synthetic PPIN 2 has been considered as network 1 while the previous synthetic PPIN 1 in section 2.2 ( Figure 1 ) has been represented as network 2 in Figure 2 . Node D, E and F are the selected top spreader nodes in network 1 by spreadability index in the same way as that of synthetic PPIN 1 in Figure 1 while to avoid the complexity in the diagram, top 5 nodes in network 2 (see Table 1 ) are selected as spreader nodes in network 2. Red colored edges are the interconnectivities between the nodes of network 1 while black colored edges show the interconnectivity between nodes of network 2. Green colored spreader edges (i.e. edges connected with spreader nodes) show the interconnectivity between network 1 and 2. Ranking of a spreader edge measures the effectiveness of transmission ability of a spreader edge i.e. how many interactive nodes gets infected through that edge. Thus all the spreading edges are ranked based on the average of the spreadability index of its connected spreader nodes. The ranked spreader edges in Figure 2 are highlighted in Table 6 . Figure 2 Ranking of spreader edges between two interconnected PPIN based on spreadability index. Thickness of the edges vary with the order of ranking. Table 6 Ranking of spreader edges for network 1 and network 2 in Figure 2 Spreader Edges The proposed methodology leads to the identification of spreader nodes and edges through a network characteristic, called spreader index which has been also checked and validated by SIS model. Initially the whole working module is implemented on synthetic networks as shown in Figure 1 and Figure Figure 3 . In Figure 3A , at first SARS-CoV PPIN is displayed in which each protein is marked in red. Thereafter spreader nodes in SARS-CoV PPIN are identified by spreadability index which are denoted as blue nodes among red. Once the spreader nodes are active (in Figure 3B) , it transmits the infection to its corresponding direct partners i.e. human level 1proteins (marked in green). In Figure 3C , spreader nodes are identified in SARS-CoV level 1 human proteins (marked in blue) and the disease will be transmitted further. The same will continue till SARS-CoV level 2 human proteins in which green nodes are the spreaders and thus the infection will penetrate further in human PPIN resulting in a significant fall in human immunity level followed by severe acute respiratory syndrome. In Figure 4 , the PPIN of SARS-CoV network has been highlighted. There are mainly 9 proteins which include E, M, ORF3A, ORF7A, S, N, ORF8A, ORF8AB, and ORF8B. The computed spreadability index of each of this protein and its corresponding validation by SIS model is highlighted in Table 7 . It is also compared with other central/ influential spreader node detection methodologies like DC, CC, LAC and BC which are shown in Table 8 , Table 9 , Table 10 and Table 11 respectively. Similarly spreader nodes are also identified in SARS-CoV's level 1 neighbors and level 2 neighbors (see Figure 5 and Figure 6 ). Spreadability index definitely plays an important role in this proposed methodology. In fact, spreader nodes are successfully identified by the virtue of this scoring technique which covers all the aspects of transmitting infection from one node to another in a PPIN. It should be mentioned here that while identifying spreader nodes in SARS-CoV level 2 human proteins, it has been noted that the number of nodes are getting increased largely with the increment of successive levels. So, high, medium and low threshold [33] have been applied and the entire disease transmission through spreadability index is computationally assessed at each threshold. The network statistics of spreader nodes at each level of threshold is shown in Table 12 . It can be observed that threshold application is only implemented at SARS-CoV level 2 human proteins not on others. This is because of the availability of very less number of nodes and edges. Only nodes and edges having extremely low spreadability index have been discarded at the first level. Beside identification of spreader nodes, spreader edges are also identified. The ranked edges between SARS-CoV spreaders and its level 1 human spreaders is highlighted in Table 13 while the ranked edges between SARS-CoV s level 1 and level 2 human spreaders at high, medium and low threshold are highlighted in A- Table 1 , A- Table 2 and A- Table 3 in the appendix respectively. The entire network view of SARS-CoV and human PPIN has been generated online under three circumstances: 1) All the nodes and edges are considered as spreader nodes and edges respectively and are ranked accordingly. In the above generated network views, the blue, yellow and green color represent SARS-CoV spreaders, its level 1 human spreaders and its level2 human spreaders respectively. The remaining nodes are in indigo. Spreadability index is thus proved to be effective in the detection of spreader nodes and edges in SARS-CoV-human PPIN along with the cross validation by SIS model. Spreader nodes are the most critical nodes in the network which actually transmit the infection to its successors. Simultaneously it is also true if the spreader nodes are not connected with spreader edges infection transmission would not have been possible. In a nutshell, it can be said that the proposed work exploits the possibility of understanding the entire disease propagation/transmission from SARS-CoV network to human network. It should be borne in mind that SARS-CoV2 is ~89% genetically similar to its predecessor SARS-CoV [34] , [35] . It strongly reveals the fact that the human proteins chosen as spreaders of SARS-CoV might be the potential targets of SARS-CoV 2 also. If our future research work reveals this assumption, then it will definitely explore a new direction in the identification of essential drugs/vaccine for SARS-CoV2. [10] "WHO | Update 49 -SARS case fatality ratio, incubation period." https://www.who.int/csr/sars/archive/2003_05_07a/en/ (accessed Apr. 06, 2020). [11] "WHO | Middle East respiratory syndrome coronavirus (MERS-CoV)." https://www.who.int/emergencies/mers-cov/en/ (accessed Apr. 06, 2020). [15] "Naming the coronavirus disease (COVID-19) and the virus that causes it." https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/naming-thecoronavirus-disease-(covid-2019)-and-the-virus-that-causes-it (accessed Apr. 06, 2020). Table 2 Ranked spreader edges between SARS-CoV s level 1 and level 2 human proteins at medium threshold World-Health-Organization Coronavirus disease (COVID-19) outbreak A novel coronavirus outbreak of global health concern World Map | CDC Statement on the second meeting of the International Health Regulations (2005) Emergency Committee regarding the outbreak of novel coronavirus (2019-nCoV) statement-on-the-meeting-of-the-international-health-regulations-(2005)-emergency-committee-regarding-the-outbreak Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China Data sharing and outbreaks: best practice exemplified Potential inhibitors for 2019-nCoV coronavirus M protease from clinically approved medicines A pneumonia outbreak associated with a new coronavirus of probable bat origin Protein function prediction from dynamic protein interaction network using gene expression data Protein function prediction from proteinprotein interaction network using gene ontology based neighborhood analysis and physico-chemical features Funpred 3.0: Improved protein function prediction using protein interaction network Lethality and centrality in protein networks Centers of complex networks A local average connectivity-based method for identifying essential proteins from the network level High-betweenness proteins in the yeast protein interaction network The rush in a directed graph The centrality index of a graph The SARS-Coronavirus-host interactome: Identification of cyclophilins as target for pan-Coronavirus inhibitors Large-Scale Analysis of Disease Pathways in the Human Interactome Identifying influential spreaders based on edge ratio and neighborhood diversity measures in complex networks Detecting overlapping protein complexes in PPI networks based on robustness The mathematical theory of infectious diseases and its applications Analysis of protein targets in pathogen-host interaction in infectious diseases: a case study on Plasmodium falciparum and Homo sapiens interaction network THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE A method for predicting protein complex in dynamic PPI networks Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan Authors are thankful to the CMATER research laboratory of the Computer Science Department, Jadavpur University, India, for providing infrastructure facilities during progress of the work. This project is partially supported by the CMATER research laboratory of the Computer Science and Engineering Department, Jadavpur University, India, and DBT project (No.BT/PR16356/BID/7/596/2016), Ministry of Science and Technology, Government of India.