key: cord-346532-4xpnd93d authors: Strömich, Léonie; Wu, Nan; Barahona, Mauricio; Yaliraki, Sophia N. title: Allosteric Hotspots in the Main Protease of SARS-CoV-2 date: 2020-11-06 journal: bioRxiv DOI: 10.1101/2020.11.06.369439 sha: doc_id: 346532 cord_uid: 4xpnd93d Inhibiting the main protease of SARS-CoV-2 is of great interest in tackling the COVID-19 pandemic caused by the virus. Most efforts have been centred on inhibiting the binding site of the enzyme. However, considering allosteric sites, distant from the active or orthosteric site, broadens the search space for drug candidates and confers the advantages of allosteric drug targeting. Here, we report the allosteric communication pathways in the main protease dimer by using two novel fully atomistic graph theoretical methods: Bond-to-bond propensity analysis, which has been previously successful in identifying allosteric sites without a priori knowledge in benchmark data sets, and, Markov transient analysis, which has previously aided in finding novel drug targets in catalytic protein families. We further score the highest ranking sites against random sites in similar distances through statistical bootstrapping and identify four statistically significant putative allosteric sites as good candidates for alternative drug targeting. for allosteric regulation of the main protease. By providing guidance for allosteric drug design we hope to open a new 86 chapter for drug targeting efforts to combat COVID-19. 87 2 Results The first step in our graph analysis approach is the construction of an atomistic graph from a protein data bank (PDB) 89 [47] structure. This process takes into account strong and weak interactions like hydrogen bonds, electrostatic and 90 hydrophobic interactions (see Methods and Fig. 4) . Additionally, we can incorporate water molecules, which in the 91 case of the M pro are catalytically important and known to expand the catalytic dyad to a triad [11] (see Fig. S1B ). In sites) or form a communication pathway [41] . By applying quantile regression we are able to quantitatively rank all 100 bonds, atoms and subsequently residues. This allows to score the hotspots we identified and statistically prove their 101 significance. Table S2 ) 105 reveal two main areas of interest in the M pro . The hotspot on the back of the monomer opposite to the active site ( Fig. 106 1A) is described in more detail in the paragraphs below. Hotspot two is located in the dimer interface and contains 107 four residues which form salt bridges between the two monomers. Serine 1 and arginine 4 from one monomer connect 108 to histidine 172 and glutamine 290 from the other one, respectively. Interestingly, these bonds have been found to be 109 essential for dimer formation which in turn is required for M pro activity [49, 27] . To further clarify the interactions between the dimer halves ( Hence, we chose these residues as source when looking into pro-119 tease dimer connectivity in comparison between SARS-CoV-2 and 120 SARS-CoV. in SARS-CoV, this closer dimer packing led to an increased activ- Figure 1 : Bond-to-bond propensities of M pro sourced from the orthosteric sites. The source sites have been chosen as the catalytically active residues His41 and Cys145 in both chains of the homodimer and are shown in green (front A) and top B) view). All other residues are coloured by quantile score as shown in the legend and reveal two main areas of interest with important residues labelled. C) The propensity of each residue, ⇧R, is plotted against the residue distance from the orthosteric site. The dashed line indicates the quantile regression estimate of the 0.95 quantile cutoff used for identifying relevant residues. atomistic level here, we assume that studying the dimer interface residues in a systematic manner would help elucidate 138 the link between domain III and the catalytic activity of the M pro . Bond-to-bond propensities have been shown to successfully detect allosteric sites on proteins [43] and we here present 141 the results in the SARS-CoV-2 M pro to that effect. By choosing the active site residues histidine 41 and cysteine 145 as 142 source, we can detect areas of strong connectivity towards the active centre which allows us to reveal putative allosteric 143 sites. We could detect two hotspots on the protease which might be targetable for allosteric regulation of the protease 144 (Fig. 2) . Most of the residues present in the two putative sites are amongst the highest scoring residues which are listed 145 in Table S2 . Site 1 ( Fig. 2A shown in yellow) which is located on the back of the monomer in respect to the active 146 site and is formed by nine residues from domain I and II (full list in Table S4 ). The second hotspot identified with 147 Bond-to-bond propensities is located in the dimer interface and contains 6 residues (Tab. S5) which are located on both 148 monomers ( Fig. 2B shown in pink). Two of these residues, Glu290 and Arg4 of the respective second monomer, are 149 forming a salt bridge which is essential for dimerisation [27] . Quantile regression allows us to rank all residues in the 150 protein and thus we can score both sites with an average residue quantile score as listed in Table 2 . Site 1 and 2 have a 151 high score of 0.97 and 0.96, respectively and score much higher than a randomly sampled site would score with 0.53 152 (95% CI: 0.53-0.54) for a a site of the size of site 1 or 0.52 (95% CI: 0.51-0.53) for a site of the size of site 2. Our methodologies further allow to investigate the reverse analysis to assess the connectivity of the predicted allosteric 154 sites. For this purpose, we defined the source as all residues within the respective identified sites (Tables S4 and S5 ). After a full Bond-to-bond propensity analysis and quantile regression to rank all residues, we are able to score the active 156 site to obtain a measure for the connectivity towards the catalytic center (Tab. S8). For site 1 the active site score is 0.64 157 which is above a randomly sampled site score of 0.47 (95% CI:0.47-0.48). However, for site 2 the active site score is 158 0.49 which is only marginally above a randomly sampled site score of 0.48 (95% CI:0.47-0.48). As site 2 is located in 159 the dimer interface, this is in line with the above described suggestion that the allosteric effect is not directly conferred from the dimer interface towards the catalytic centre. Nonetheless, this site might provide scope for inhibiting the M pro 161 by disrupting the dimer formation at these sites. 162 Figure 2 : Putative allosteric sites identified by Bond-to-bond propensities. Surface representation of the M pro dimer coloured by quantile score (as shown in the legend). A) Rotated front view with site 1 (yellow) which is located on the opposite of the orthosteric site (coloured in green). B) Top view with site 2 (pink) located in the dimer interface. A detailed view of both sites is provided with important residues labelled. Overall, this missing bi directional connectivity hints to a more complex communication pattern in the protein and gave 163 us reason to utilize another tool which has been shown to be effective in catalytic frameworks [41] like the protease. Figure 3A and a full list can be found in Table S3 . In the SARS-CoV-2 M pro , this analysis 167 subsequently led to the discovery of two more putative sites as shown in Figure 3C . Both hotspots are located on the 168 back of the monomer in relation to the active site. Site 3 (shown in turquoise in Figure 3C ) is located solely in domain 169 II and consists of ten residues as listed in Table S6 . One of which is a cysteine at position 156 which might provide 170 a suitable anchor point for covalent drug design. Site 4 (orange in Figure 3C ) is located further down the protein in 171 domain I with 11 residues as listed in Table S7 . Both sites were scored as described above and in the Methods section. Following the same thought process as described for site 1 and 2, we can investigate the protein connectivity from the 175 opposite site by sourcing our runs from the residues in site 3 and 4. We then score the active site to measure the impact in multimeric proteins this might be due to another structural or dynamic factor which we did not yet uncover between 181 site 4 and the active site. Overall we see a similar pattern of hot and cold spots in the SARS-CoV M pro (results not shown). We find a high 183 overlap for the identified four sites which gives us confidence, that a potential drug effort would find applications in where B is the n ⇥ m incidence matrix for the atomistic protein graph with n nodes and m edges; W = diag(w ij ) is an we define the bond propensity as: and then calculate the residue propensity of a residue R: Markov Transient Analysis (MTA). A complementary, node-based method, Markov Transient analysis (MTA) 276 identifies areas of the protein that are significantly connected to a site of interest, the source, such as the active site, and 277 obtains the signal propagation that connects the two sites at the atomistic level. The method has been introduced and 278 discussed in detail in Ref. [41] and has successfully identified allosteric hotspots and pathways without any a priori 279 knowledge [41, 46] . Importantly, it captures all paths that connect the two sites. The contribution of each atom in the where t provides models for conditional quantile functions. This is significant here because it allows us to identify not the 290 "average" atom or bond but those that are outliers from all those found at the same distance from the active site and 291 because we are looking at the tails of highly non-normal distributions. As the distribution of propensities over distance follows an exponential decay, we use a linear function of the logarithm propensities can be found in Ref. [43] and for Markov Transient Analysis in Ref. [67] . Site scoring with structural bootstrap sampling. To allow an assessment of the statistical significance of a site of 299 interest, we score the site against 1000 randomly sampled sites of the same size. For this purpose, the average residue 300 quantile score of the site of interest is calculated. After sampling 1000 random sites on the protein, the average residue 301 quantile scores are calculated. By performing a bootstrap with 10,000 resamples with replacement on the random sites 302 average residue quantile scores, we are able to provide a confidence interval to assess the statistical significance of the 303 site of interest score in relation to the random site score. investigation as shown in Table 3 . For each of these fragment-bound structures, we performed Bond-to-bond propensity and Markov transient analyses to 315 evaluate the connectivity to the active site. The active site was scored as described above. A pneumonia outbreak associated with a new coronavirus of probable bat origin A new coronavirus associated with human respiratory disease in China A novel coronavirus from patients with pneumonia in China The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-341 nCoV and naming it SARS-CoV-2 The severe acute respiratory syndrome A decade after SARS: strategies for controlling emerging coron-346 aviruses Dissection study on the severe acute respiratory syndrome 3C-like protease reveals Structure-based prediction of protein allostery Allosteric Modulator Discovery: From Serendipity to Structure-Based Design Activation pathway of Src kinase reveals intermediate states 414 as targets for drug design Perturbation-Response Scanning Reveals Key Residues 417 for Allosteric Control in Hsp70 Exploiting protein flexibility to predict the location of allosteric sites PARS: a web server for the prediction of Protein Allosteric and Regulatory Sites AlloPred: prediction of allosteric pockets on proteins using normal mode pertur-424 bation analysis Improved Method for the Identification and Validation of Allosteric Sites Structure-Based Statistical Mechanical Model Accounts for the Causality and 428 Energetics of Allosteric Communication Reversing allosteric communication: From detecting allosteric sites 431 to inducing and tuning targeted allosteric response Mapping allosteric communications within individual proteins Protein multi-scale organization through graph 436 partitioning and robustness analysis: application to the myosin-myosin light chain interaction Uncovering allosteric pathways in caspase-1 using 439 Markov transient analysis and multiscale community detection BagPyPe: A Python package for the construction of atomistic, 442 energy-weighted graphs from biomolecular structures Prediction of allosteric sites and mediating 444 interactions through bond-to-bond propensities Allostery and cooperativity in multimeric proteins: bond-447 to-bond propensities in ATCase The origin of allosteric functional modulation: multiple pre-existing 450 pathways Abstract 1775: Targeting RSK4 prevents both chemoresistance and metastasis in lung 452 cancer The Protein Data Bank SARS-CoV 3CL protease cleaves its C-terminal autoprocessing site by novel subsite 456 cooperativity Quaternary structure of the severe acute respiratory syndrome (SARS) coronavirus main 459 protease Crystallographic and electrophilic fragment screening of the SARS-CoV-2 main protease Potential anti-viral activity of approved repurposed drug against main protease of SARS-463 CoV-2: an in silico based approach Silico Evaluation of the Effectivity of Approved 466 Protease Inhibitors against the Main Protease of the Novel SARS-CoV-2 Virus Targeting the Dimerization of the Main Protease of Coronaviruses: A Potential Broad-469 Spectrum Therapeutic Strategy Targeting Non-Catalytic Cysteine Residues Through 472 Structure-Guided Drug Discovery Inference of Macromolecular Assemblies from Crystalline State ProteinLens: a web-based application for the analysis of allosteric signalling on atomistic 477 graphs of biomolecules Asparagine and glutamine: using hydrogen atom 479 contacts in the choice of side-chain amide orientation Inorganic chemistry: principles of structure and reactivity DREIDING: A generic force field for molecular simulations Automated design of the surface positions of protein helices Hydrophobic Potential of Mean Force as a Solvation Function for 488 Structure of complex networks: Quantifying 491 edge-to-edge relations by failure-induced flow redistribution Algebraic graph theory Random Walks, Markov Processes and the Multiscale Modular 495 Organization of Complex Networks Quantile Regression quantreg: Quantile Regression. R package version 5 Exploring allostery in proteins with graph theory Open-source foundation of the user-sponsored PyMOL molecular visualization 504 system