key: cord-261959-pvufajw4 authors: Karathanou, Konstantina; Lazaratos, Michalis; Bertalan, Éva; Siemers, Malte; Buzar, Krzysztof; Schertler, Gebhard F.X.; del Val, Coral; Bondar, Ana-Nicoleta title: A graph-based approach identifies dynamic H-bond communication networks in spike protein S of SARS-CoV-2 date: 2020-09-10 journal: J Struct Biol DOI: 10.1016/j.jsb.2020.107617 sha: doc_id: 261959 cord_uid: pvufajw4 Corona virus spike protein S is a large homo-trimeric protein anchored in the membrane of the virion particle. Protein S binds to angiotensin-converting-enzyme 2, ACE2, of the host cell, followed by proteolysis of the spike protein, drastic protein conformational change with exposure of the fusion peptide of the virus, and entry of the virion into the host cell. The structural elements that govern conformational plasticity of the spike protein are largely unknown. Here, we present a methodology that relies upon graph and centrality analyses, augmented by bioinformatics, to identify and characterize large H-bond clusters in protein structures. We apply this methodology to protein S ectodomain and find that, in the closed conformation, the three protomers of protein S bring the same contribution to an extensive central network of H-bonds, and contribute symmetrically to a relatively large H-bond cluster at the receptor binding domain, and to a cluster near a protease cleavage site. Markedly different H-bonding at these three clusters in open and pre-fusion conformations suggest dynamic H-bond clusters could facilitate structural plasticity and selection of a protein S protomer for binding to the host receptor, and proteolytic cleavage. From analyses of spike protein sequences we identify patches of histidine and carboxylate groups that could be involved in transient proton binding. The surface of the Severe Acute Respiratory Syndrome (SARS)-CoV-2 virion is decorated with large membrane-anchored spike proteins S (Figure 1 ) that bind to Angiotensin Converting Enzyme 2 (ACE2) receptor of the host cell (Briefing, 2020; Hoffmann et al., 2020; Li et al., 2003; Xiao et al., 2003; Zhou et al., 2020) . Interactions between the spike protein and the host receptor, and large-scale structural rearrangements of the spike protein, are essential for virus entry (Belouzard et al., 2012) . Protein conformational plasticity, as required for large-scale rearrangements, may originate in weak hydrogen(H) bonds (Joh et al., 2008) and dynamic H-bond clusters (Bondar and White, 2012) . Here, we present a graphbased methodology to characterize hydrogen(H)-bond clusters in protein S and in protein S -ACE2 interaction complexes, and use this methodology to identify interaction networks with potential role in selecting spike S protein conformations for receptor binding and proteolytic activation. Corona spike glycoprotein S (Figure 1 ) is arranged as a homotrimer ( Figure 1B ), each protomer consisting of a large N-terminus ectodomain that contains domains S1 and S2 ( Figure 1A ). S1 contains the Receptor Binding Domain (RBD) (Babcock et al., 2004; Xiao et al., 2003 ) that binds to ACE2 via an extended loop denoted as the Receptor Binding Motif (RBM) (Graham and Baric, 2010; Li, 2013) . The S2 domain includes the fusion peptide (FP), the two Heptad Repeats HR1 and HR2 (also denoted as helical regions HR-A and HR-B (Colman and Lawrence, 2003) ), a TM segment that anchors the protein into the viral envelope, and a relatively short C-terminus segment exposed to the interior of the virion ( Figure 1A ). Exposure of the fusion peptide requires two proteolytic cleavage events, first at the boundary between S1 and S2 (the S1/S2 site, Figure 1A) , and subsequently at the S2' site within the S2 domain (Belouzard et al., 2009; Millet and Whittaker, 2014; Walls et al., 2017) . Main conformations of spike proteins are typically described in terms of a native state (prior to proteolytic cleavage), pre-hairpin (after receptor binding), hairpin, and post-fusion (Eckert and Kim, 2001; Xu et al., 2004) . Upon binding to the host receptor, protein S undergoes a conformational change that reduces the height of the spike by ~10Å (Beniac et al., 2007) . In the pre-hairpin intermediate of the HIV envelope protein, the transmembrane subunit gp41 spans the membranes of both pre-fusion conformation (Wrapp et al., 2020) one of the protomers with the RBD is in the up conformation ( Figures 1B, S2) . Similar structural dynamics, whereby the inactive conformation of protein S is three-fold symmetrical with all three RBDs in a down orientation incompatible with receptor binding, and the active-like conformation is asymmetric, with one RBD in up orientation compatible with receptor binding, were observed before for SARS-CoV (Gui et al., 2017) . In a cryo-EM study of a trimeric SARS-CoV protein S ectodomain with mutations that stabilize the pre-fusion conformation, the majority of the protein was captured with one RBD up, a lesser amount with two RBDs up, and a minority of the proteins with all three RBDs in the up conformation (Kirchdoerfer et al., 2018) . The binding interface between SARS-CoV-2 RBD and ACE2 has a major contribution from polar interactions (Wang et al., 2020a; Yan et al., 2020) , and the more extensive interactions between ACE2 and SARS-CoV-2 RBD as compared with SARS-CoV RBD might contribute to 4-fold higher binding affinity measured for the former complex (Wang et al., 2020a) . Whether and how conformational dynamics at the RBD couples to dynamics elsewhere in the protein is largely unclear. Binding of the RBD to the receptor, and the RBD up conformation, may facilitate transition to the post-fusion conformation of protein S, for which weak interactions between protomers of protein S at the S2 domain, mediated largely by polar groups, might be important (Li et al., 2019) . Downstream the sequence of the RBD, the D614G mutation became prevalent during the corona pandemic (Korber et al., 2020) , but functional implications of the mutation remain controversial. Lack of inter-protomer H-bonding between D614 and T859, as observed in the cryo-EM structure of the pre-fusion conformation (Wrapp et al., 2020) , was suggested could impact local flexibility and glycosylation of N616 (Korber et al., 2020) . Since the D614 site is non-critical for antibody binding, an impact of the D614G mutation on vaccine development was suggested to be unlikely (Grubaugh et al., 2020) . Studies with simpler protein models suggest dynamic H-bonds and H-bond clusters shape protein conformational plasticity. Individual inter-helical H-bonds bring modest contributions to protein stability, on the order of ~0.6kcal/mol on the average (Joh et al., 2008) , though more significant contributions of up to 3.5-5.6kcal/mol, were measured for salt-bridges (Hong et al., 2006) . When located close to each other in a cluster, H-bonding protein sidechains can engage in clusters of dynamic H-bonds help that stabilize protein conformations and shift populations during the reaction cycle of the protein (Bondar and White, 2012) . A protein complex as large as the spike protein S (Figure 1) , and for which experimental information about functional roles of specific amino acid residues is somewhat scarce, brings about the challenge of how to identify intra-molecular interactions that could shape conformational dynamics. To tackle this challenge, here we compute H-bond interaction networks, and then rank H-bonding groups according to centrality measures. The usefulness of using centrality measures for interaction networks was illustrated by earlier works indicating high centrality values for amino acid residues that are conserved and/or are located at enzyme active sites (Amitai et al., 2004) and a negative correlation between centrality and evolutionary rate (Fokas et al., 2016) ; more recently, centrality measures were used to identify functional interactions in a membrane transporter (Harris et al., 2020) . To derive clues about intra-molecular interactions with potential role in shaping structural dynamics of protein S, we computed two-dimensional graphs of all Hbonds of protein S in structures proposed for the closed, open, and pre-fusion conformation, and for ACE2 bound to an RBD fragment. We then used centrality measures to identify amino acid residues with central roles for the connectivity within local H-bond clusters. As some of the H-bond clusters we identified are large, we refined centrality computations by considering unique paths between all Hbonding groups of each cluster. We find that the closed conformation of the ectodomain of protein S hosts an extensive, central cluster of H-bonds, contributed symmetrically by the three protomers. Relatively close to the ACE2 binding interface, each protomer of the closed conformation has H-bond clusters contributed by the same groups. In the open and, even more in the pre-fusion conformation, this symmetry of H-bond clusters is largely perturbed. In structures of the ACE2-RBD complex, four clusters of H-bonds mediate the binding interface. Taken together, the analyses presented here suggest the reaction coordinate of protein S includes rearrangements of extensive H-bond clusters, such that a three-fold compositional symmetry that characterizes the closed conformation of protein S is lost in the open and pre-fusion conformations. A H-bond network that extends deeply across the interface between protein S and ACE2 potentially contributed to the strong binding affinity. Datasets of spike-like proteins used for bioinformatics analyses. We prepared two sets of sequences of spike protein S. Set-A consists of protein sequences from various organisms, and thus can show relatively large variations in the amino acid sequence; Set-B contains sequences of SARS-CoV-2 isolated from human hosts. We extracted and curated the sequences as summarized below. Protein S sequences for Set-A were extracted by performing blastp and blastx (Altschul et al., 1997) database searches against the SARS-CoV-2 Database hosted at the NCBI (National Center for Biotechnology Information, accessed March 26, 2020). For Set-B sequences we used the Virus Variation Resource, VVR (Hatcher et al., 2017) to extract proteins available for SARS-CoV-2 genomes, selected manually sequences of S proteins according to the database annotation, and removed partial hits. For the both Set-A and Set-B, redundant sequences were removed automatically using a threshold of 100% similarity such that, when two sequences were identical, only one was kept. The resulting datasets included 48 (Set-A) and 14 (Set-B) protein sequences. We aligned sequences of S proteins from Set-A and Set-B separately using MAFFT (Katoh and Toh, 2008; Katoh and Standley, 2013) . Each alignment was manually inspected and curated, and figures of sequence alignments were prepared using Easy Sequencing in PostScript, ESPript 3.x (Robert and Gouet, 2014) . These software packages were used for all sequence alignments described below. Likewise, we inspected and hand curated all sequence alignments. Sequence region corresponding to SARS-CoV-2 RBD. To inspect the sequence variation in the region corresponding to the RBD of SARS-CoV-2 protein S we started with the sequence alignment for Set-A generated as described above, and extracted the sequences corresponding to SARS-CoV-2 RBD groups R319 to F541 . Sequence regions corresponding to SARS-CoV-2 RBD were then combined into a single multifasta file; identical sequences were removed. The resulting sequences were realigned with MAFFT using SARS-CoV-2 as reference. Dataset of ACE2 protein sequences. We prepared two sets of ACE2 sequences. Set-C consists of orthologue protein sequences from various organisms, which we extracted by performing blastp (Altschul et al., 1997) database searches against the NCBI non-redundant protein database (accessed March 29, 2020) . Set-D contains human sequences of ACE2 from the 1000 human Genome project (Auton and Brooks, 2015) ; for this set, we used the Ensembl project (Hunt et al., 2018) and extracted the different protein haplotypes existing in the 1000 Genome Project for ACE2 using the GRCh38 human genome assembly as reference. For both sets, we removed redundant sequences according to a threshold of 100% similarity such that, when two sequences were identical, only one was kept. The resulting Set-C and Set-D datasets included 46 and 22 ACE2 sequences, respectively. The length of a protein sequence is given by the total number of amino acid residues. To characterize the amino acid composition of S proteins we used the program pepstats from the Open Source suite EMBOSS (Rice et al., 2000) to calculate, for each sequence in Set-A, the total number of charged and polar groups grouped into i) Asp and Glu, which are negatively charged at standard protonation; ii) positively charged Arg and Lys; iii) His groups, which can be neutral or protonated; iv) polar groups Asn, Gln, Ser, Thr, Trp, and Tyr. Motif searches for protein S. We used Set-A of protein S sequences to identify motifs that include Asp, Glu, and His sidechains, i.e., motifs with groups that could change protonation depending on pH. To identify motifs of interest we first inspected the distribution of Asp, Glu, and His in the sequence of SARS-CoV-2 protein S. (Rice et al., 2000) and analyzed using own scripts. pre-fusion conformations. Spike protein S is a homotrimer of three polypeptide chains, or protomers, which for simplicity we label here as A, B, and C. For the starting coordinates of protein S ectodomain in the open-and closed-conformations we used, respectively, the cryo-EM structures PDB ID:6VYB (3.2Å resolution) and PDB:ID 6VXX (2.8Å resolution) (Walls et al., 2020) . For the pre-fusion conformation we used the cryo-EM structure PDB ID:6VSB (3.5Å resolution) (Wrapp et al., 2020) . Disulfide bridges were included as set in the Protein Data Bank (Berman et al., 2000) entry. In the closed conformation, each protomer (chain) has 12 disulfide bridges. In the open conformation, chains A and C have 12 disulfide bridges each, and chain B has 11 disulfide bridges. In the pre-fusion conformation, chains A and B have 12 disulfide bridges each, and chain C has 11. The number of amino acid residues for each of the structures used for analyses is reported in Table 1 . All structures were prepared by considering standard protonation for titratable amino acid residues, i.e., Asp/Glu are negatively charged, Arg/Lys are positively charged, and His groups are singly protonated on the N atom. For simplicity, and since our analyses focus on internal H-bond networks of the spike protein, sugar moieties were not included. The three cryo-EM structures used for analyses (see below) lack coordinates for water molecules. Coordinates for missing internal amino acid residues and H atoms of the proteins were generated using CHARMM-GUI (Jo and Kim, 2008; Lee et al., 2016) and CHARMM (Brooks et al., 1983) . Molecular graphics illustrating the protein segments for which we constructed coordinates are presented in Figure S3 . (Walls et al., 2020) a) Conformation of the protein as proposed in the publication referenced here. b) Number of amino acid residues in each chain after constructing coordinates for missing internal groups. Relative to the structure of the pre-fusion conformation, the cryo-EM structures of protein S in open and closed conformations report coordinates for one additional amino acid residue, S1147. Computations of the electrostatic potential surface were performed with the Adaptive Poisson Boltzmann Solver, APBS (Baker et al., 2001) , in PyMol 2.0 (Schrödinger, 2015) . We used the PDB2PQR web interface (Dolinsky et al., 2004) to assign partial atomic charges and atom radii according to the CHARMM force field (MacKerell Jr. et al., 1998) . Electrostatic potential computations were restricted to protein atoms. We analyzed H-bond networks in three structures of ACE2-protein S complexes as summarized in Tables 2 and S2. As computations of average H-bond graphs require the same number of amino acid residues in the graphs to be averaged, where needed we used Modeller 9.21 (Marti-Renom et al., 2000) to construct coordinates for missing amino acid residues. Lists of amino acid residues whose coordinates were constructed for each structure are presented in Table S1 . For all computations of graphs of H-bonds for ACE2-protein S complexes we used regions S19 to D615 of ACE2, and T333 to P527 for protein S. a) The protein S fragment is a chimera of the SARS-CoV-2 RBM, except for a RBM loop, and SARS-CoV for the remaining of the amino acid residues . b) The local resolution at the interface between ACE2 and the RBD is 3.5Å . To identify H-bonds, we used the same geometry-based criteria as in our previous work (Siemers et al., 2019) , whereby we consider that two groups are H-bonded when the distance between heavy atoms of the H-bonding groups is ≤ 3.5Å, and the angle between the acceptor heavy atom, the H atom, and the heavy donor atom is ≤ 60º. For all structures included here in analyses we computed H-bonds between protein sidechains, and between protein sidechains and backbone groups. Interactions between charged groups (Asp, Glu, Arg, Lys) and between polar sidechains are treated with the same H-bonding criteria. Tests for H-bond criteria. Analyses of H-bonding based on experimental electron densities used a set of H-heavy atom distances ranging from 1.56Å to 2.59Å, which is compatible with the 3.5Å heavy-atom to heavy-atom distance and 60º H-bond angle criteria we used (Espinosa et al., 1999) . To illustrate how the distance criteria may influence details of H-bond cluster, we recalculated interactions of three selected H-bond clusters by considering only heavy atoms, and used as H-bond criterion a distance 2.8Å, 3.0Å, 3.5Å, or 4.0Å ( Figure S4 ). As anticipated, the stricter the distance criterion used, the fewer the Hbonds in a cluster ( Figure S4 ). At a heavy-atom distance criterion of 3.5Å, H-bond clusters are slightly larger than clusters we reported with both distance and angle criterion ( Figure S4 ). We further used 10 coordinate snapshots from a simulation on the soluble protein SecA to calculate the total number of protein Hbonds with i) the combined criteria of ≤ 3.5Å distance between heavy atoms and ≤ 60º H-bond angle; ii) the criterion of a ≤ 2.5Å distance between the H atom and the heavy atom. We obtained for SecA a total of 220-236 H-bonds in test i), and 223-240 H-bonds in test ii), suggesting the combined 60º angle and 3.5Å heavy-atom distance criterion is largely equivalent to using a distance of 2.5Å between the heavy atom and the H atom. In two additional tests on SecA simulations that start from two different protein conformations , we used the combined 3.5Å distance and 60º-angle criterion to compute the frequency for all H-bonds present in starting crystal structures. We found that about 90-95% of Hbonds that appear as weak in a protein crystal structure, with the distance between the heavy atoms from 3.0Å to 3.5Å, are nevertheless sampled over the course of the two prolonged simulations, and almost 60-75% of the weak H-bonds are sampled during at least 10% of the two simulations. Pursuant to the considerations above, we suggest the combined ≤ 3.5Å distance and ≤ 60º angle H-bond criterion we used here is reasonable. A graph, or network, of H-bonds, consists of a collection of vertices (also denoted as nodes or points) and edges that inter-connect the vertices. In graph theory, each vertex can be assigned a degree that gives the number of edges incident on that vertex, such that in an undirected graph a vertex of degree 1 (usually denoted as a leaf) has only one edge (Bender and Williamson, 2010; Sadavare and Kulkarni, 2012; West, 1996) . The definition of leafs is related to that of a tree, which is a connected, acyclic graph whose edges are called branches, and whose nodes can be internal or leafs (Bender and Williamson, 2010) . In a tree graph, a node is denoted as a leaf when it has degree of 1 or in the case of a rooted tree, when it has no child nodes (Bender and Williamson, 2010) . The definition of leaf nodes can be used in the context of more general graphs that are not trees. Two nodes are denoted as adjacent if they directly connect to a common edge; two edges are denoted as adjacent edges if they share a node (Bender and Williamson, 2010) . Here we report graphs of H-bond interactions. These graphs have as nodes amino acid residues or heavy atoms that H-bond. The edges of the graph represent H-bonds, i.e., non-covalent interactions (Scheme 1). An H-bond path is defined as a continuous chain of H-bonds that connects two nodes (amino acid residues) of the graph. The shortest path between two nodes is the path that connects these two nodes via the least number of intermediate nodes (Cormen et al., 2009) . The path length L of a shortest-distance path is given by the number of H bonds that are inter-connected in that path (Scheme 1). Centrality measures and a new measure of the unique shortest paths. To assess connectivity within the network and rank the relative importance of nodes in a graph of H bonds, we used centrality measures as defined below (Scheme 1). The Degree Centrality (DC) of a node (or vertex) n i gives the number of edges (H-bonds) of the node (Freeman, 1979 ) (Scheme 1). The normalized DC value of node n i is computed by dividing its DC by the maximum possible edges to n i (which is N-1, where N is the number of nodes in the graph). In the case of H-bond graphs, the DC value of a protein group indicates the number of its unique direct H-bonds, thus providing information on the local H-bond environment of that protein group. The Betweenness Centrality (BC) of a node n i gives the number of shortestdistance paths between any two other nodes n j and n k that pass via node n i divided by the total number of shortest paths that connect n j and n k irrespective of whether they pass via node n i (Brandes, 2001; Freeman, 1977; Freeman, 1979 The cluster size  is given by the total number of nodes (H-bonding amino acid residues) of that cluster (Scheme 1C). A value of the cluster size  = 2 is thus equivalent with a H-bond. Histograms of the distribution of path lengths in a H-bond cluster give an overview of the H-bond connections within that cluster. We compute all shortest paths between all pairs of H-bonding groups (nodes) in all clusters of H-bonds. For each path, we store its length. We then select high-BC groups (nodes) n i of particular interest, and the shortest-distance H-bond paths that pass through that node n i . We calculate from these selected H-bond paths histograms that we use to illustrate the size of the network of node n i . Changes in nodes that delineate the periphery of a large H-bond cluster indicate rearrangements in that cluster. We denote these peripheral nodes as anchors or leafs (Scheme 3). For a given H-bond cluster identified from a Connected Components search for a root node n i , we compute all shortest paths from node n i to all other nodes in the Hbond cluster (Scheme 3A), construct a directed graph using all shortest paths whose edges point away from root node n i , and query this directed graph to identify all nodes n k that have only inward edges (Scheme 3B). Nodes n k are the marked anchor nodes of the cluster (Scheme 3C). The node density of a cluster, , is given by the number of anchor nodes divided by the total number of nodes of the graph. AP L , the distance between two anchors of the cluster, is given by the number of Hbonds that constitute the path that connects two anchor points passing via N c . Conserved H-bond networks in ACE2-RBD complexes. A node or an edge of a Hbond graph is conserved when present in all H-bond networks analyzed. We consider that nodes and edges of a graph are conserved when present in all chains of the four ACE2-protein S structures we analyzed (Table 2) . Thus, we excluded from the computations of conserved graphs 21 amino acid residues that are different between the chimera RBD from PBD ID:6VW1 , and the wild-type RBD from the other three structures we analyzed (Table 2) For each of the four ACE2-RBD structures (Table 2) we first identified all connected components of the H-bond network using the Networkx package (Hagberg et al., 2008) . In the second step, we selected those components that include direct Hbonds between amino acid residues of ACE2 and of the RBD. This led, for each structure, to 3-4 H-bond clusters that we labeled as a, b, c, and d. The assignment of specific amino acid residues to clusters a -d is the same for the four structures. We implemented a protocol that relies on computations of H-bond graphs and Table S2 ). In total, there are ~798-902 H-bonds for each conformation of protein S (Figure 2A , Table S2 ). These H We found that most of the H-bonding amino acid residues of protein S have rather small centrality values ( Figures 3C, 3D) , suggesting most of these groups lack extensive local connections or participation in large numbers of H-bond paths. Importantly, in all three conformations of protein S we analyzed, BC values ≥ 15 belong to groups with coordinates solved experimentally (Table S3, Figure S6 ) Table S3 . A scatter plot of the DC vs. BC values of selected amino acid residues in pre-fusion is presented in Figure S8 , and H-bond clusters of the open and closed conformations, in Figure S9 . In a static protein structure, the DC value of an amino acid residue is limited by To evaluate the relationship between BC value, cluster size, and cluster shape, we analyzed H-bond clusters that contain high-BC groups (Figures 3, 4, S9) , the distribution of BC values in clusters with  ≥ 2 (Figures 2, 4 , 5, S10), and the distribution of USP values as a function of the cluster size ( Figure S13 ). Highest BC values tend to be obtained for H-bonding groups that are part of large clusters (Figures S10, S11, S12). In a cluster with many H-bonding groups (large ) a group located centrally can be part of many shortest-distance H-bond paths, and hence have large BC. Low BC values, including BC = 0, can also be observed in clusters with relatively large  (Figures S10, S11); these are H-bonding groups located at the periphery of the cluster. In the case of large clusters, computations of BC values by accounting for all shortest paths can lead to very large BC values, e.g., in the closed conformation R1039 is part of a cluster with  = 33 and has BC = 218. The refined centrality computation we introduced here, USP, based on unique shortest paths, provides a qualitatively similar, but somewhat more intuitive picture of the connectivity within a cluster (Figures 4, S13) . The largest UBC value, of 30, is obtained for R1039 in the closed conformation ( Figures 4A, 4C) . Except for the PA_R509 cluster of pre-fusion, where R509 has UBC = 19 ( Figure 4E ), all other H-bond clusters have UBC values within 10, i.e., maximum 10 shortest paths can include one H-bond group (Figure 4 ). The size of H-bond clusters ( Figure 5) and BC values of specific H-bonding groups ( Figure 3) can vary significantly among the three conformations of protein S, indicating that conformational changes of protein S associate with structural rearrangements within clusters of H-bonds. We inspected closely H-bond clusters located at functionally important regions of protein S. D663 is relatively close in the sequence to the R685 S1/S2 cleavage site ( Figure 1A ). The pre-fusion protomer A, whose RBD is in the up conformation ( Figure Taken together, the analyses above indicate that conformational transitions of protein S associate with changes in H-bonding near the S1/S2 and S2' proteolytic cleavage sites. At these sites, compositional symmetry of H-bond clusters D663 and E819 is lost when the protein changes conformation from closed to pre-fusion. Figure S13 . A H-bond cluster of the RBD rearranges drastically during conformational dynamics of protein S. In SARS-CoV mutating to Ala the R495 (corresponding to SARS-CoV-2 R509) decreases binding to ACE2 (Chakraborti et al., 2005) (Table S4) . As summarized below, we find that R509 is part of three-fold symmetrical clusters in the closed conformation; smaller, largely symmetrical R509 H-bond clusters are present in two protomers of the open conformation; in pre-fusion, the symmetrical composition of the H-bond cluster is lost, with an extensive H-bond cluster observed only for the RBD up conformer (Figures 5, S16 ). Except for one protomer of pre-fusion, R509 is within H-bond distance of D442 (Figures 6, S16 G&H); mutating to Ala D442 (which corresponds to D429 in SARS-CoV-2) abolishes ACE2 binding of SARS-CoV protein S (Chakraborti et al., 2005) . In each protomer of the closed conformation, the D442-R509 H-bond is part of Hbond clusters constituted by the same groups including N448 and N450 (Figures 6, S16 A-C). Adjacent in the sequence, Y449 H-bonds to ACE2 (Lan et al., 2020) . In the open conformation, two protein conformers have the D442-R509 H-bond as part of relatively large and similar clusters ( Figure 5) , whereas in the third protomer this H-bond is singular ( Figure S16E ). The loss of three-fold symmetry of the R509 and D578 H-bond clusters in prefusion conformation could be important to enable binding of protein S to ACE2, and/or for proteolytic activation of protein S. Structural asymmetry in protomers was also observed in the crystal structure of the trimeric membrane transporter AcrB, being thought that each protomer was captured in a different conformation sampled during the reaction cycle of the transporter (Seeger et al., 2006) . The most prominent H-bond cluster of the closed conformation is located close to the stalk of the ectodomain: The R1039 cluster includes no fewer than 33 groups, 11 from each protomer (cluster CC_R1039 in Figures 4A, 6) . This is the largest H-bond cluster we identified for all three protein conformations (Figures 3, 4) . We denote the R1039 cluster of the closed conformation as the central H-bond cluster (Figures 6, S19 , S20). 6, S20 A,B) . This core network of R1039-E1031 branches out via S1037 (Figures 6, S19 A-B). In the closed conformation, each S1037 connects via H1048 to a local, intra-protomer H-bond cluster that includes R509 ( Figure 6 ). These core and local H-bond clusters are present, but disconnected, in the open conformation (Figures 6, S19 ). In pre-fusion, the three-fold compositional symmetry of the R1039 H-bond cluster is lost, as each R1039 interacts with E1031 of a different protomer ( Figures 6, S19 ). In two of the protomers, R1039 further connects to S1037 and, in just one of the protomers, S1037 still connects to H1048 (Figures S19 I-K, S20D). We thus found that the closed conformation of protein S has a remarkable three- Figure S21 . coronavirus largely depends on the affinity with which protein S binds to ACE2 (Li, 2013) . Structures of ACE2 bound to SARS-CoV-2 protein S fragments (Lan et al., 2020; Yan et al., 2020) , or to a chimera protein S fragment , identified H bonds and salt bridges between the RBD and ACE2 (Lan et al., 2020; Shang et al., 2020; Yan et al., 2020) . We wondered whether, instead of single Hbonds and salt-bridges, entire clusters of H-bonds, as we identified above for protein S, could mediate binding of protein S to ACE2, and computed graphs of H-bonds for ACE2-RBD complexes (Figures 9, S24 -S34) and centrality values (Tables S4, S5) . We found that the ACE2-RBD binding interface has 3-4 H-bond clusters, which we denote as the local interface clusters as a, b, c, and d (Figures 9, S24-S34 ). In Figure 9 we present H-bonds of the full-length ACE2 bound to the RBD, that are present in all ACE2-RBD structures we analyzed; each of these structures can have more H-bonds contributing to the interface clusters ( Figures S24-S29 ). Graph of H-bonds with labels for high-centrality groups that participate in conserved networks. Small spheres indicate C atoms, and are colored according to average centrality values. Pink lines represent inter-protein interactions. We label here only groups with high centrality, and give additional labels in Figure S29 ; the complete H-bond graph is presented in Figure S30 . In full-length ACE2 bound to RBD (PDB ID:6M17, chains B, E) cluster a has 16 RBD groups, and 10 ACE2 groups ( Figure 9C ). In structure PDB ID:6M0J cluster a is significantly larger, with 31 and 14 groups contributed by the RBD and ACE2, respectively ( Figure S29, S33B) . The larger size of the interface cluster a in the latter structure could be due the resolution being slightly higher (Table 2) , or to the protein conformation being different. The four interface clusters include groups whose functional role has been probed experimentally (Table S4 ), groups that have relatively high centrality values (Tables S5, S6) , and groups that are highly conserved ( Figure 9C , Tables S4, S7). N501 of SARS-COV-2 protein S is T487 in SARS-CoV protein S, in which the methyl group of the Thr is thought important for the binding of the RBD to ACE2 (Li et al., 2005) ; T487 is among the amino acid residues essential for the binding of SARS-CoV to ACE2. Y449 corresponds to Y436 of SARS-CoV protein S, and is conserved in SARS spike proteins that use ACE2 (Hoffmann et al., 2020) . N437 and D442 of the interface cluster a ( Figure 9C ) are also part of the high-centrality cluster R509 identified in the up protomer of the pre-fusion conformation of isolated protein S ( Figures 3A, 3B, 6) . In cluster b, Q493 is part of a H-bond network that includes ACE2 K31 ( Figure 6C ) a group considered a virus binding hot spot . This cluster contains two other ACE2 groups, and two from the RBD. Given the extensive H-bond network we observe for ACE2 ( Figure 9A ), and the large number of ACE2 groups that participate in interface H-bonding ( Figures 9C,D) , we wondered whether ACE2 groups that participate in interface H-bond clusters might reach the vicinity of the catalytic site of the enzyme, or groups known to be otherwise important for the functioning of ACE2. R273, H345, E375, H505, and Y515 delineate an inhibitor-binding site of ACE2 (Guy et al., 2005) , whereas H374, H378, and E402 coordinate the Zn 2+ ion (Towler et al., 2004) . According to our centrality computations, R273 is within 4 amino acid residues of the high-centrality group D269 ( Figures 9A,B) , and H378 has relatively high average DC value. H-bond clusters of R273, H345, H505, H378, H374, and E402 ( Figure S34 ) lack common H-bonds with the conserved interface clusters presented in Figure 9A . Nevertheless, the H378 H-bond cluster ( Figure 9B ) is relatively close to the interface ( Figures 9B, S34C ), N394 can be part of interface cluster a ( Figure S33B ), and N397 is part of the H405 cluster of ACE2 ( Figure S34B ). Thus, depending on the protein conformation, interface H bonding of ACE2 can extend deep into the protein, towards the active site. Together with the interface H-bond clusters (Figure 9 ), the overall picture that emerges is that the binding interface between the RBD and ACE2 is mediated by extensive H-bonding, which could explain the strong binding observed in experiments (Tai et al., 2020) . That an entire local, dynamic H-bond network, could be important for protein binding, was observed before for arrestin (Ostermeier et al., 2014) , and molecular dynamics simulations of an extrinsic subunit of the photosystem II complex indicated participation of protein sidechains in dynamic interaction clusters could be beneficial for protein binding (del Val and Bondar, 2017) : In a dynamic H-bond cluster, multiple H-bonds that break and reform rapidly provide structural plasticity, such that a protein sidechain essential for protein-protein binding can sample orientations compatible with binding; by contrast, when in the unbound state, i.e., prior to protein-protein binding, that sidechain is engaged in a stable interaction whose perturbation is energetically costly, the availability of the sidechain for new interactions needed to mediate binding may be reduced. Corona spike protein S sequences carry a significant net negative charge and contain patches with carboxylate and carboxylate-histidine motifs. Knowledge of amino acid residues that could transiently bind protons during the functioning of protein S is important, as protonation and changes in protonation can alter protein dynamics (del Val et al., 2014; Lazaratos et al., 2020) . In vitro experiments on the recombinant ectodomain of SARS-CoV suggested that, at pH between 5 and 6, monomers form trimers irreversibly (Li et al., 2006) . As fusion of protein S with the host cell can occur at neutral pH (Xiao et al., 2003) , and conformational changes of the ectodomain can occur independent of the pH (Walls et al., 2017) , it remains unclear whether or not protein S binds protons, and whether proton binding shapes conformational dynamics relevant to viral infectivity. In the future, high-resolution structures might inform on likely protonation states of specific protein groups and facilitate experiments and computations to probe pH sensitivity. To evaluate whether the sequence of protein S contain clues about sites where a proton might bind, here we analyzed sequences of spike proteins to calculate an estimated net charge and identify conserved sequence motifs of carboxylate and histidine groups. We found that the 47 protein S sequences included in Set-A have full length in the range of 1235-1363 amino acid residues ( Figure 10A , see also Supplementary Sequence Analyses). SARS-CoV and SARS-CoV-2 proteins S are among the shorter sequences of Set-A. All protein sequences have an estimated net negative charge between -3e and -25e ( Figure 10B ); for SARS-CoV-2 protein S, the net estimated charge is -7e. The shortest spike protein sequences, 1236 amino acid residues, are of the mouse hepatitis virus; this sequence also carries the smallest estimated charge. The four longest sequences, 1363 amino acid residues, are of corona viruses isolated from rabbit, deer, bovine, and horse; these are also the sequences with significant negative charges of -20e or -23e ( Figure 10B ). The large estimated net charges arise from the numerous charged and polar groups carried by the sequences (Figures S35, S36) , with ~100-118 Asp/Glu ( Figure S35A ), and ~85-105 Arg/ Lys ( Figure 35B ). Protein S is also subject to mutations that can affect Hbonding groups (Korber et al., 2020) (Table S8) , and thus its overall polarity. A wide range for the net charge carried by sequences of a protein from different organisms, as found here for spike proteins S, was observed before for other proteins (del Val and Bondar, 2017; del Val and Bondar, 2020) , suggesting the net estimated charge might be related to details of the cellular environments and/or protein interactions. The actual net charge of a protein will depend on protonation in the cellular environment in which the protein is found. When aligning the RBD of SARS-CoV-2 with the sequences of the other 47 protein S sequences from Set-A, we obtained regions with 142 -241 amino acid residues ( Figure S37A ) and an estimated net charge that tends to be positive ( Figure S37B ). The variation in the length of the region corresponding to SARS-CoV-2 RBD, and the associated variation in estimated charge, is due to numerous deletions and insertions of amino acid residues in other protein S sequences from Set-A (see Supporting Information Sequence Alignments). A slightly positive net charge of the RBD of protein S could be important for the binding of protein S to ACE2: The electrostatic potential surfaces of both SARS-CoV-2 protein S and ACE2 have patches that are predominantly negative or positive; at the binding interface, ACE2 exposes a predominantly negative surface, whereas the potential energy surface for the RBD is predominantly positive ( Figures S39-S43 ). The complementary electrostatic potential surfaces at the region where the RBD binds to ACE2 are compatible with the presence of multiple H-bonds and H-bond clusters we identified (Figure 9 ). Given the large number of carboxylate groups and net negative charge carried by protein S sequences, we sought to find out whether protein S has patches of carboxylate and histidine groups, as such patches are of potential interest for proton binding (Bondar and Lemieux, 2019; Checover et al., 2001; Gerland et al., 2020; Guerra and Bondar, 2015; Kemmler et al., 2019; Lorch et al., 2015; Shutova et al., 2007) . We found that the sequence of SARS-CoV-2 protein S has multiple positions where Asp, Glu or His groups are adjacent or separated by 1-2 amino acid residues; of the 17 His groups of the protein, 7 are within the C-terminal ~220 residue fragment ( Figures 10C, S38A) . Moreover, particular arrangements, or motifs, appear multiple times in the sequence of SARS-CoV-2 protein S ( Figures 10C, S38, S44A ). There are several positions with 2-3 carboxylates consecutive in the sequence; close to the C-terminus, there is a DEDSE motif ( Figure S44B ). Such carboxylate and carboxylate-histidine motifs appear to be a more general feature of the spike proteins. Of the 48 sequences from Set-A, only two lack any carboxylate-histidine motif; 2 vs. 3 motifs are present in 26 vs. 13 sequences ( Figure 10D ). Motifs with two consecutive carboxylate groups appear relatively frequently: 22 of the Set-A sequences have 8 -9 such motifs ( Figure 10D ). The more restrictive DELE or EELD, and EDDSE, motifs are present only once in 34 and 25 sequences, respectively ( Figure 10D ). SARS-CoV-2 protein S has a (D)EDDSE motif at the Ctail, immediately upstream the Cys-rich region ( Figure S38 ) where palmitoylation occurs ( Figure 1A ) (Petit et al., 2007) . Such a motif is also observed in SARS-CoV protein S, whereas several other coronavirus proteins S have here two carboxylates adjacent to a highly conserved Cys group (Petit et al., 2007) . It is unclear whether this carboxylate motif is related to palmitoylation or to another functional role. Further experiments and computations will be needed to ascertain whether the conserved carboxylate and carboxylate-histidine motifs we identified might be involved in transient proton binding, or whether they might instead contribute to, e.g., shaping protein dynamics and protein interactions. Binding of the coronavirus protein S to the host ACE2 receptor initiates a series of reactions that include large-scale structural rearrangements of protein S (Shulla and Gallagher, 2009) , changes in the structure and expression of the receptor, proteolitic cleavage followed by conformational change thought to assist viral entry (Kam et al., 2009) , local dehydration and ordering of the membrane upon interaction with the fusion peptide (Lai et al., 2017) , and culminating with membrane fusion and virus entry (Wang et al., 2020b) . This is a highly complex reaction coordinate whose molecular movie could facilitate the development of therapeutics. As a first step towards deciphering interactions that govern structural plasticity of SARS-CoV-2 protein S, here we focused on H-bonding and H-bond clusters, as Hbonding and H-bond clusters shape protein conformational dynamics (Bondar and White, 2012; Joh et al., 2008) , and H-bond networks are central to working models of long-distance conformational couplings in proteins (Karathanou and Bondar, 2018; Venkatakrishnan et al., 2019) . Using graph-based approaches to identify and characterize H-bond clusters of protein S (Schemes 1, 2, 3 The RBDs of the closed conformation host a H-bond cluster centered at R509, with the same groups contributing to the clusters of each protomer ( Figures 5A, 6, 11 ). In the vicinity of the S1/S2 cleavage site, each protomer has a H-bond cluster centered at D578 (Figures 5A, 6, 11 ). Both the R509 and D578 clusters are relatively large, with 8 and, respectively, 10 H-bonding groups ( Figure 6 ). Close to the stalk region of the ectodomain in closed conformation, the same 11 groups of each protomer participate in the central cluster, in which the core H-bond network of the E1031, S1037, and R1039 groups branches out via H1048 to local networks centered at R905 ( Figures 5A, 6, 11) . In the open and pre-fusion conformations, the three R509, D578, and R1039 clusters have rearranged significantly. The R509 H-bond cluster of the open conformation is reduced to 5-6 H-bonds, or to even one H-bond (Figures 6, 11 , S16). In pre-fusion, only the RBD up protomer has a R509 H-bond cluster ( Figures 5A-C, 6 , S16). This cluster includes N437 and N439, which are also part of the Hbond cluster that contributes to the binding interface between protein S and ACE2 ( Figures 9, 11) . The D578 H-bond cluster is reduced by about half in one of the protomers of the open conformation, and replaced by singular H-bonds in two protomers of the pre-fusion conformation (Figures 6, 11 , S17, S18). The central R1039 cluster of the open conformation is separated into the central core and the R905 branch; in pre-fusion the core H-bond cluster is absent, though R1039 still makes bridges to E1031 of another protomer (Figures 6, 11 , S19). Thus, three major H-bond clusters of protein S, which in the closed conformation are contributed by the same groups of each of the three protomers, rearrange drastically in the open and pre-fusion conformation, and are distinct in each of the conformers. Such structural rearrangement, whereby the three protomers of protein S experience different H-bonding, could facilitate conformational selection of a protomer for the binding to ACE2, and/or for proteolytic cleavage for activation. The three major H-bond clusters we identified for protein S, and the H-bond clusters at the RBD-ACE2 interface, include carboxylate and histidine groups, and sequences of protein S contain short patches of such groups ( Figures 10D) . Patches of closely spaced carboxylate groups could be involved in transient binding of protons (Checover et al., 2001; del Val and Bondar, 2017; Shutova et al., 2007) , and/or for enhanced local protein dynamics (del Val and Bondar, 2017) . Whether carboxylate-and histidine-containing H-bond clusters change protonation during the functioning of protein S, making the conformational dynamics of SARS-CoV-2 protein S sensitive to the local pH, is unclear. A sensitivity to pH was noted before for the spike protein of HA, which is thought to undergo a conformational change at low pH in the endosome (Millet and Whittaker, 2018) , low pH enabling transition of HA from the pre-fusion to the fusogenic conformation (Eckert and Kim, 2001) , and the fusion peptide to approach the host membrane (Bullough et al., 1998) . In the future, experiments and computations will be needed to derive accurate information on sites where protons might bind on the surface of protein S, and evaluate couplings between protonation change and protein dynamics. A caveat of our approach is that it lacks description of the hydrophobic interactions, which could bring important contributions to protein stability, protein binding interfaces, and membrane fusion. Moreover, hydrophobic packing of Hbonding sidechains shapes H-bond dynamics (del Val et al., 2014) . Our preliminary tests suggest, for example, that hydrophobic clusters are located close to the R509 and D578 H-bond clusters in the closed conformation of protein S. In the future, we plan to extend our graph-based approach to identify and dissect hydrophobic interactions and the interplay between H-bonding and hydrophobic interactions during the dynamics of protein S. The work presented here relied on static coordinate snapshots of protein S and of protein S fragments bound to ACE2. We anticipate that, as structures solved at increasingly higher resolution might become available, the conformational dynamics of protein S could be described more accurately, and that knowledge about the kinetics of transitions between intermediate conformations, and of protonation states of specific protein groups, will allow us to derive a more complete picture of the reaction coordinate of protein S in fluid, physiologic environments. The graph-based analyses presented here, combined with prolonged molecular dynamics simulations could enable estimations of the energetics associated with breaking and reforming of H-bonds within clusters of protein S, and of other large protein complexes. -We apply graph-based approaches to identify H-bond clusters in protein complexes -Three conformations of spike protein S have distinct H-bond clusters at key sites -Hydrogen-bond clusters could govern structural plasticity of spike protein S -Protein S binds to ACE2 receptor via H-bond clusters extending deep across interface Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Network analysis of protein structures identifies functional residues A global reference for human genetic variation Amino acids 270 to 510 of the severe acute respiratory syndrome spike protein are required for interaction with the receptor Electrostatics of nanosystems: application to microtubules and the ribosomes Activation of the SARS coronavirus spike protein via sequential proteolytic cleavage at two distinct sites Mechanism of coronavirus cell entry mediated by the viral spike protein Lists, Decisions and Graphs. With an Introduction to Probability University of California Conformational reorganization of the SARS coronavirus spike following receptor binding: implications for membrane fusion The Protein Data Bank Hydrogen bond dynamics in membrane protein function Reactions at membrane interfaces Severe acute respiratory syndrome coronavirus (SARS-CoV) infection inhibition using spike protein heptat repeat-derived peptides A faster algorithm for betweenness centrality Anatomy of a killer CHARMM: a program for macromolecular energy, minimization, and dynamics calculations Structure of influenza haemagglutinin at the pH of membrane fusion The SARS coronavirus S glycoprotein receptor binding domain: The mapping and functional characterization Dynamics of the proton transfer reaction on the cytoplasmic surface of bacteriorhodopsin The structural biology of type I viral membrane fusion Charged groups at binding interfaces of the PsbO subunit of photosystem II: A combined bioinformatics and simulation study Diversity and sequence motifs of the bacterial SecA protein motor Coupling between inter-helical hydrogen bonding and water dynamics in a proton transporter PDB2PQR: an automated pipeline for the setup of Poisson-Boltzmann electrostatics calculations Mechanisms of viral membrane fusion and its inhibition Topological analysis of the electron density in hydrogen bonds Residue geometry networks: a rigidity-based approach to the amini acid network and evolutionary rate A set of measures of centrality based on betweenness Centrality in social networks pH-dependent protonation of surface carboxylates in PsbO enables local buffering and triggers structural changes Recombination, reservoirs, and the modular spike: mechanisms of coronavirus cross-species transmission Making sense of mutation: What D614G means for the COVID-19 pandemic remains unclear Dynamics of the plasma membrane proton pump Cryo-electron microscopy structures of the SARS-CoV spike glycoprotein reveal a prerequisite conformational state for receptor binding Identification of critical active-site residues in angiotensin-converting enzyme-2 (ACE2) by sitedirected mutagenesis Exploring network structure, dynamics, and function using NetworkX Mechanism of inward proton transport in an Antartic microbial rhodopsin Virus Variation Resource -improved response to emergent viral outbreaks SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor Cellular entry of the SARS coronavirus Electrostatic couplings in OmpA ion-channel gating suggest a mechanism for pore opening VMD: visual molecular dynamics Ensembl variation resources Phylogenetic analysis and structural modeling of SARS-CoV-2 spike protein reveals an evolutionary distinct and proteolytically sensitive activation loop Modest stabilization by most hydrogen-bonded side-chain interactions in membrane proteins Cleavage of the SARS coronavirus spike glycoprotein by airway proteases enhances virus entry into human bronchial epithelial cells in vitro Dynamic hydrogen bonds in bacterial protein secretion Using graphs of dynamic hydrogen-bond networks to dissect conformational coupling in a protein motor Recent developments in the MAFFT sequence alignment program MAFFT multiple sequence alignment software version 7: improvements in performance and usability Dynamic water bridging and proton transfer at a surface carboxylate cluster of photosystem II Stabilized coronavirus spikes are resistant to conformational changes induced by receptor recognition or proteolysis The SARS-CoV fusion peptide forms an extended bipartite fusion platform that perturbs membrane order in a calcium-dependent manner Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor Find Network Components Graphs of dynamic H-bond networks: from model proteins to protein complexes in cell signaling CHARMM-GUI input generator for NAMD, GROMACS, AMBER, OpenMM, and CHARMM/OpenMM SImulations using the CHARMM36 additive force field Receptor recognition and cross-species infections of SARS coronavirus Conformational states of the severe acute respiratory syndrome coronavirus spike protein ectodomain Angiotensin-converting enzyme 2 is a functional receptor for the SARS coronavirus Receptor and viral determinants of SARS-coronavirus adaptation to human ACE2 The human coronavirus HCoV-229E S-protein structure and receptor binding Dynamic carboxylate/water networks on the surface of the PsbO subunit of photosystem II All-atom empirical potential for molecular modeling and dynamics studies of proteins Comparative protein structure modeling of genes and genomes Host cell entry of Middle East respiratory syndrome after two-step, furin-mediated activation of the spike protein Host cell proteases: Critical determinants of coronavirus tropism and pathogenesis Physiological and molecular triggers for SARS-CoV membrane fusion and entry into host cells Molecular mechanisms of phosphorylation-dependent arrestin activation Pamitoylation of the cysteine-rich ectodomain of the SARScoronavirus spike protein is important for spike-mediated cell fusion Different residues in the SARS-CoV spike protein determine cleavage and activation by the host cell protease TMPRSS2 EMBOSS: The European Molecular Biology Open Software Suite Deciphering key features in protein structures with the new ENDscript server A review of application of graph theory for network The PyMol Molecular Graphics System Structural asymmetry of AcrB trimer sugggests a peristaltic pump mechanism Structural basis of receptor recognition by SARS-CoV-2 Role of spike protein endodomains in regulating coronavirus entry A cluster of carboxylic groups in PsbO protein is involved in proton transfer from the water oxidizing complex of Photosystem II Bridge: A graph-based algorithm to analyze dynamic H-bond networks in membrane proteins Characterization of the receptor-binding domain (RBD) of the 2019 novel coronavirus: implication for development of RBD protein as a viral attachment inhibitor and vaccine ACE2 X-ray structures reveal a large hinge-bending motion important for inhibitor binding and catalysis Diverse GPCRs exhibit conserved water networks for stabilization and activation Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein Tectonic conformatinal changes of a coronavirus spike glycoprotein promote membrae fusion Proc Receptor recognition by the novel coronavirus from Wuhan: an analysis based on decade-long structural studies of SARS coronavirus Structural and functional basis of SARS-CoV-2 entry by using human ACE2 SARS-CoV-2 infects T lymphocites through its spike-protein-mediated membrane fusion Introduction to graph theory Upper Saddle River Cyo-EM structure of the 2019-nCoV spike in the prefusion conformation The SARS-CoV S glycoprotein: expression and functional characterization Crystal structure of Severe Acute Respiratory Syndrome coronavirus spike protein fusion core Structural basis for the recognition of the SARS-CoV-2 by full-length human ACE2