key: cord-329504-91te3nu8 authors: Croll, Tristan; Diederichs, Kay; Fischer, Florens; Fyfe, Cameron; Gao, Yunyun; Horrell, Sam; Joseph, Agnel Praveen; Kandler, Luise; Kippes, Oliver; Kirsten, Ferdinand; Müller, Konstantin; Nolte, Kristoper; Payne, Alex; Reeves, Matthew G.; Richardson, Jane; Santoni, Gianluca; Stäb, Sabrina; Tronrud, Dale; Williams, Christopher; Thorn, Andrea title: Making the invisible enemy visible date: 2020-10-07 journal: bioRxiv DOI: 10.1101/2020.10.07.307546 sha: doc_id: 329504 cord_uid: 91te3nu8 During the COVID-19 pandemic, structural biologists have rushed to solve the structures of the 28 proteins encoded by the SARS-CoV-2 genome in order to understand the viral life cycle and enable structure-based drug design. In addition to the 200 structures from SARS-CoV previously solved, 367 structures covering 16 of the viral proteins have been released in the span of only 6 months. These structural models serve as basis for research worldwide to understand how the virus hijacks human cells, for structure-based drug design and to aid in the development of vaccines. However, errors often occur in even the most careful structure determination - and are even more common among these structures, which were solved under immense pressure. From the beginning of the pandemic, the Coronavirus Structural Taskforce has categorized, evaluated and reviewed all of these experimental protein structures in order to help downstream users and original authors. Our website also offers improved models for many key structures, which have been used by Folding@Home, OpenPandemics, the EU JEDI COVID-19 challenge, and others. Here, we describe our work for the first time, give an overview of common problems, and describe a few of these structures that have since acquired better versions in the worldwide Protein Data Bank, either from new data or as depositor re-versions using our suggested changes. Introduction SARS-CoV-2, the coronavirus responsible for COVID-19, has a single-stranded RNA genome that encodes 28 proteins. These macromolecules fulfil essential roles in the viral life cycle, enabling SARS-CoV-2 to infect, replicate, and suppress the immune system of its host. For example, the characteristic spikes that protrude from its envelope and allow it to bind to host cells, are a trimer of the surface glycoprotein ( Fig. 1) . Knowing the atomic structures of these macromolecules is vital for understanding the lifecycle of the virus and helping design specific pharmaceutical compounds that bind and inhibit their functions, with the goal of stopping the cycle of infection. Since the COVID-19 pandemic hit in the beginning of this year, the structural biology community has swung into action very efficiently and is now strongly engaged to establish the atomic structures of these macromolecules as fast as possible by use of nuclear magnetic resonance (NMR), cryo-electron microscopy (Cryo-EM) and crystallographic methods (1) . All of these methods require an interpretation of the measured and processed data with a structural model and cannot be fully automated. The resulting structures are made freely and publicly available in the World Wide Protein Data Bank (wwPDB), structural biology's archive of record (2) . Unfortunately, the fit between model and data is never perfect and errors from measurement, post-processing and modelling are a given. Structures solved in a hurry to address a pressing medical and societal need are even more prone to mistakes. However, as these structures are used for biological interpretation, small errors can have severe consequences -in particular, in structure-based drug discovery, structural bioinformatics, and computational chemistry; a current focus of SARS-CoV-2 research across the world. As of the writing of this publication, 524 macromolecular structures from SARS-CoV and SARS-CoV-2 have been deposited, covering parts of 16 of the 28 proteins. In this time of crisis, it is therefore vital to ensure that the structural data made available to the wider research community are the best they can be in every regard. Pushing the methods to the limit The wwPDB (3) is an invaluable tool, but once released, a structure in the databank can only be reversioned by the original depositors, who, after any associated papers are published, may have little or no motivation to correct their structures. Third parties may only deposit new models based on others' deposited data under new accession IDs when accompanied by peer reviewed publications. In such cases, there is no explicit link from the original entry to the new one. Importantly, 99% of structure downloads from the PDB are not by experimentalists, but scientists from other fields (2) . As a consequence, errors can lead to a large waste of resources and time by those making use of the data obtained from the PDB and may even be misinterpreted as biologically and pharmaceutically relevant information. We, the authors of this manuscript, develop computational methods for the solution of macromolecular structures. By being expert users of our own software tools, we were well placed to help in this unprecedented situation. This is why, since the first signs of a possible pandemic arose at the beginning of this year, we joined forces to assess and, where necessary, improve upon the published macromolecular structures from SARS-CoV and SARS-CoV-2. In cases where we believe we have significantly improved the macromolecular models, we offered them back to the original authors and the scientific community. Raw data are not deposited in the wwPDB and are not mandatory for publication; as a consequence, they are difficult to obtain. Their absence in the public record is detrimental for reanalysis and validation, and the development of new methods. In an effort to make the most out of the experimental results, we invited authors to send us their raw experimental data, which are key to validation of the entire structure determination process from start-to-finish. We thus offer to help authors deposit these data in a non-PDB public repository if the authors wished to do so. After we began our validation efforts of macromolecular structures from SARS-CoV and SARS-CoV-2, we were approached by our colleagues in in-silico drug screening, Folding@Home (4, 5) , OpenPandemics (6) , and the EU Joint European Disruptive Initiative (JEDI) (7). These initiatives needed the best structures they could get for studies of the virus and had already lost much computing time and resources to suboptimal structure solutions. All macromolecular structures from SARS-CoV and SARS-CoV-2 in the PDB are downloaded into our repository and assessed automatically in the first 24 hours after release. For crystallographic and Cryo-EM structures, we check the quality of the deposited merged data, and how well the model fits these data. An automatic evaluation of NMR data is forthcoming. Then, all structures are checked according to chemical prior knowledge. Evaluation specific to crystallographic data and structure solutions As crystal structures make up 79.0% of our data, these are evaluated most thoroughly. Crystal diffraction can, for example, stem from more than one crystal lattice (twinning), be contaminated by ice crystal diffraction (ice rings) or be incomplete due to radiation damage or suboptimal measurement strategy. These issues cannot be resolved after data collection, but treating data accordingly can yield a better structural model. Deducing such problems from the deposited structure factors (mandatory in wwPDB) can be difficult; raw data allow a much more complete analysis of the experiment. Another source of errors is data processing (integration and scaling), which nowadays is often done automatically. Assuming the wrong crystal lattice symmetry or including, for example, diffraction spots obscured by the beam stop, can lead to lower quality or even unsolvable structures. If raw data are available, data can be re-processed and these problems can be resolved manually. To analyse crystallographic data for twinning, completeness and overall diffraction quality, we used phenix.xtriage (8); furthermore we ran AUSPEX (9) which automatically identifies ice rings and produces plots from which several other pathologies, like a "bad" beam stop mask, can be recognized quickly. The completeness of most datasets is satisfying, with only 7 out of 415 datasets below 80%. All the datasets have an acceptable strength with intensity/sigma(intensity) above 3. Ice rings were detected in 61 datasets and problems with the beam stop masking in 46; 49 crystal structures were indicated as potentially resulting from twinned crystals. A general indication of how well the atomic model fits the measurement data can be obtained by comparing the deposited R-factors to results from PDB-REDO (10) (including Whatcheck (11)) to determine the overall density fit as well as many other diagnostics. While the deposited structures are often improved by PDB-REDO, they need to be checked and should not be viewed as "more correct" purely on basis of a lower R value. In addition to this, a high R value does not indicate a single type of error and hence should be used with caution. Only two structures in the repository present an alarmingly high Rfree value above 35%, although problems can be found in other structures by looking in more detail. PDB-REDO improves the Rfree for most of the structures, and the only cases where we found a huge degradation pointed to major issues with the PDB entry; this was especially true for older SARS-CoV structures. Evaluation specific to structures from single-particle Cryo-EM Cryo-EM structures make up 15.0% of our data. As with crystallographic structures, raw data are not available from the wwPDB, but the three-dimensional map reconstructed from the microscopic single particle images is deposited, allowing the calculation of the fit between model and map in the form of a Fourier Shell Correlation (FSC). The model-map FSC is plotted as a curve, which estimates agreement between features resolvable at different resolutions. For a well-fitted model, a model-map FSC of 0.5 roughly corresponds to the cryo-EM map resolution (which is determined as where the FSC between two half-maps drops below 0.143). To calculate FSCs, we use the CCP-EM (12) model validation task which utilizes REFMAC5(13) and calculates real-space Cross-Correlation Coefficient (CCC), Mutual Information (MI) and Segment Manders' Overlap Coefficient (SMOC) (14) . While MI is a single value score to evaluate how well model and map agree, the SMOC score evaluates the fit of each modelled residue individually and can help to find regions where errors occur in the model in relation to the map. Z-scores highlight residues with a low score relative to their neighbours and point to potential misfits. Out of 81 structures, 6 structures had an average model-FSC below 0.4 and seven have a MI score below 0.4, indicating a bad overall agreement between map and model and potential for further improvement. The SMOC score indicates for twelve structures that more than 5% of the residues fit poorly with the map, while the other 85% of structures had a relatively good density fit. However, most modelling errors could only be corrected manually (see below). In addition to this validation, we run Haruspex (15), a neural network to annotate reconstruction maps to evaluate which secondary structures can be recognized automatically in the map. Molecular geometry is constrained by the nature of its chemical bonds and steric hindrance between the atoms. In order to evaluate the model quality with respect to chemical prior knowledge we run MolProbity (16, 17) , which checks covalent geometry, conformational parameters of protein and RNA and steric clashes. However, it is unfortunately possible to use some of these traditional indicators of model quality as additional restraints during refinement, which invalidates them to a certain degree -we therefore also used the MolProbity CaBLAM score (18) , which can pinpoint local errors at 3-4 Å resolution even if traditional criteria have been used as restraints. CaBLAM scores higher than 2% outliers indicated that 163 of the structures have many incorrect backbone conformations. During the crisis the MolProbity webservice has been pushed to the limit of its capacity, as many different drug developers have screened the very same coronavirus structures many times. We have developed a bespoke MolProbity pipeline to make these results available online and to decrease the workload on the webservice. In addition to this, the sequence of each structure is also aligned and checked against the known genome to highlight misidentified residues. Every Wednesday, after the new PDB structures have been released, an automatic pipeline runs to organize the new structures according to the genetic information and then to assess the quality of models and the experimental results. These results, along with the original structures, are immediately available from our online repository which is accessible via our website insidecorona.net. To facilitate access and to get an overview of structures, we supply an SQL database of key statistics and quality indicators along with the results. As a community, for decades, we have aspired to automate structural biology as much as possible. However, neither structure solution nor validation have been fully automated due to the complexity of interpreting low-quality maps that have poor fit between experimental data and structural models. This task requires detailed knowledge of macromolecular/small molecule structure and chemical interactions. Even with state-of-the-art automatic methods at hand, experienced human inspection residue-byresidue remains the best way to judge the quality of a structure, highlighting the continuing need for expert structure solvers. Given the flood of new SARS-CoV-2 structures, resources have not permitted us to check all structures manually. Therefore, we have selected representative structures. Certain errors were surprisingly common, such as peptide bond flips, rotamer outliers and mis-identification of small molecules, such as water as magnesium, chloride as zinc, and a multi-zinc site modelled as poly ethylene glycols. Zinc plays an important role in many viral infections, and is coordinated by many of the SARS-CoV-2 protein structures. We also found a large number of Cys-Zn sites being mismodelled, with the zinc ion missing or pushed out of density, and/or erroneous disulphide bonds between the coordinating cysteine residues. Many coronavirus proteins are linked on certain asparagine residues ("N-linked") to carbohydrate chains called glycans. Their exact composition depends on the host cell in question, and their main function is to deter the host immune system. In some structures, where the sample was produced in eukaryotic cells, the "stem" sugars of these N-linked glycans were evident in the map. However, in many cases these sugars were flipped approximately 180 degrees from their correct orientation around the N-glycosidic bond. Out of the structures we checked manually, we were able to significantly improve 31 which are available from insidecorona.net. In the following, we will give two examples: Once SARS-CoV-2 infects a cell, the first protein produced is a long polypeptide chain which is cleaved into 16 functional proteins, the non-structural proteins (NSPs) (19) . These are essential for the production of new viruses in the host cell. NSP3 is a large protein molecule by any measure, 1945 amino acid residues in total. Its 15 segments have a variety of functions, among them the papain-like protease domain which cleaves the first five NSPs from the polypeptide chain (19) . Without cleavage of the polyprotein, the virus cannot replicate and infection is halted. Hence, the papain-like protease domain represents an important potential drug target (20) . The first SARS-CoV-2 structure of this domain was PDB 6W9C (released 1 st April 2020). It was immediately used as the basis for structure-based drug design around the world. The resolution was 2.7 Å and Rwork/Rfree were 23.9% / 30.9%. However, the overall completeness of the measured data was only 57.1%. Why were just over half of all reflections that could have been measured at this resolution recorded? The raw data for this entry are available from proteindiffraction.org (21) . They revealed that the diffraction data had been measured with a very high X-ray intensity, which led to a swift deterioration in diffracting power due to radiation damage -something which could not have been learned from the data deposited in the PDB, underlining the importance of the availability of raw data. Typically, crystallographers aim for a dose of 5 MGy or lower. Here, we estimated a diffraction weighted dose of 5.5 MGy with RADDOSE-3D, with a maximum dose value of 21 MGy, which likely completely destroyed parts of the crystal (22) . We based this calculation on assumption of a 100 x 100 x 10 μm 3 plates, given that the crystals were described as such. In addition, the measurement covered 30° sample rotation followed by an additional 60° rotation starting from the same angular position, covering the first 30° twice, further increasing the dose while recording little additional information. The angular range per image, 0.5°, was also surprisingly wide for the high-throughput pixel detector used (a Dectris Pilatus3 6M), and the diffraction was highly anisotropic. We re-processed the images using XDS (23) , omitting the final 10° of the first sweep and final 20° of the second where the radiation damage affected the data too severely. An elliptical resolution cut-off was applied with Staraniso (24) to account for anisotropy. Careful manual intervention could improve the resolution to 2.6 Å with better data quality overall. The revised ellipsoidal completeness was 44.5%. The structure has 3-fold non-crystallographic rotational symmetry; with three monomers coordinating a central zinc ion within the asymmetric unit. The second, functionally important, zinc ion is far removed from the three-fold rotation axis and coordinated by four cysteines. This zinc finger domain is essential for activity, in addition to the papain-like cysteine-histidine-aspartate catalytic triad (25) , but it is poorly resolved and incompletely and differently modelled in each of the three monomers of this structure. This disorder may be the result of radiation damage. Only two of the three zinc sites were modelled, and here, the bond lengths between each of the four sulphur atoms and the zinc varied from 2.4 Å to 2.7 Å, and the Cß-SG-Zn angles between 70° and 132°. The third site had no zinc (Fig. 2) , instead being modelled as a disulphide bond. Prior knowledge about coordination chemistry dictates that the bond lengths between Cys and Zn should all be approximately 2.3 Å and the angles about 107°. Adding zincs to all sites and restraining the bond lengths and angles to these expected values, adding non-crystallographic symmetry restraints (requiring the 3 copies to look similar), an overall higher weighting of ideal geometry, and the reassessment of side chains and water molecules improved the electron density maps and lowered the R values to 20.2%/25.4% at 2.6 Å resolution. This example shows the importance of optimised data collection strategy, data processing and model building, the quality of which is interconnected. In this case, even though the data were radiation damaged, by adjusting the data processing to take this into account, and by modifying the model refinement to include stronger restraints and to take full advantage of the non-crystallographic symmetry, this structure could be drastically improved. A new structure of the C111S mutant of the same protein (PDB CODE 6WRH) came out a month later, in which the zinc site was clearly resolved. By this time, however, the structure had already been widely used as a target in in-silico drug design: for example, 20% of participants in the EU JEDI COVID-19 challenge have used this structure to design potential drugs. The availability of a better structure a month earlier would not only have increased their chances of success but also saved much computing and man hours in computer aided drug development. When SARS-CoV-2 replicates, its single-stranded RNA genome needs to be copied. This is achieved by a macromolecular complex of RNA-dependent RNA polymerase (NSP12, RdRp), NSP7 and NSP8 (26) . Coronaviruses, including SARS-CoV-2, have some of the largest genomes among RNA viruses (approximately 30 kilobases), suggesting their polymerase complexes possess proof-reading functionality. This sets coronaviruses apart from other RNA viruses (26) . The first structure of SARS-CoV RNA polymerase (PDB entry 6NUS) was solved in 2019 by Cryo-EM (27) , before the pandemic began. In this structure, a loop close to the C-terminus (residues 892-906) was not resolved in the reconstruction map and hence not modelled. Following this loop, the polymerase has an irregular helix followed by a flexible tail. Density for this helix was poorly resolved -coupled with its short length and the lack of any information from the preceding and following loops, this led to difficulty in assigning register (the identity of the amino acid at each site). Nevertheless, the overall validation statistics (clashscore, Ramachandran outliers, sidechain outliers) provided by the wwPDB for this model appeared exceptionally good. We inspected one of the first available structures of the SARS-CoV-2 RNA polymerase complex (7BTF) using ISOLDE (28) , a program used for interactively visualising and remodelling proteins in their experimental density. The higher resolution at this C-terminal tail of the structure made it clear that the C-terminal helix was severely incorrect, with the assigned sequence being nine residues upstream of the correct residues for this site (see Fig.3 ). This error was present in all the structures of this complex from both SARS-CoV and SARS-CoV-2, presumably propagated due to each subsequent structure using the previous models as the starting point for their modelling, as is standard practice. For each affected RdRp structure we immediately contacted the original authors. 7 of the 9 SARS-CoV-2 RNA polymerase complexes in the wwPDB now have the corrected sequence alignment at the Cterminus, and those also include many of our other changes described below. These PDB re-versioned corrections allow modelling efforts for drugs against SARS-CoV-2 RNA polymerase to start from a much better model. Notably, the authors of a later cryo-EM structure of the RNA polymerase/RNA complex (PDB entry 6YYT) used one of our corrected models as the starting point for their new model (29) . The structure of SARS-CoV-2 RNA polymerase in pre-translocation state and bound to template-primer RNA and Remdesivir (PDB entry 7BV2) (30) represents a useful basis to investigate the inhibitory effect of Remdesivir and the rational design of other nucleoside triphosphate (NTP) analogues (31) . However, we found that this structure has some issues, which may provide misleading information to people who are conducting such studies. Apart from the register shift described above, there are three magnesium ions modelled in the active site, a number which is contradictory to our common knowledge of this class of proteins. Magnesium ions play an essential role in catalysis in RNA polymerase (binding the incoming NTP, positioning NTP for incorporation and stabilizing the leaving group after catalysis) (30, 32) . One of the magnesium ions is shown coordinated by a pyrophosphate, which implies that the pyrophosphate ion release in SARS-CoV-2 RdRp is relatively slow and may even couple with the translocation (33) . However, all three magnesium ions as well as the pyrophosphate are poorly supported by the map reconstructed from the experimental data or by local geometry If these ions were included as fixed components of the binding site, this may have severely impacted in-silico docking and drug design studies. In addition to the above, we corrected the conformations of three RNA residues close to the Remdesivir site including an adenosine base (T18) modelled "backwards", fixed "backwards" peptides flagged by CaBLAM, added several residues and water molecules with good density and geometry, and corrected two proline residues that had been erroneously flipped from cis to trans. Our remodelled structure is offering a valuable structural basis for future studies, such as in-silico docking and drug design targeting at SARS-CoV-2 RdRp (34), as well as for computational modelling or simulations to investigate the molecular mechanism of viral replication (31, 35, 36) . Many specialists in structural biology and in silico design are now tackling SARS-CoV-2 research, but may not be familiar with the wider body of coronavirus research. In addition to improved models and evaluation results, we also supply context on insidecorona.net. This covers literature reviews centred on the structural aspects of the viral life cycle, host interaction partners, illustrations, and evaluation criteria for selecting the best starting models for in silico projects. Furthermore, we added entries about the SARS-CoV-2 proteins to Proteopedia (17) and MolSSI (37, 38) , as well as 3D-Bionotes (18) deep-link into our data base. Finally, as SARS-CoV-2 has had an unprecedented impact on the world at large, we have also tried to make our, and others', research on the topic accessible to the general public. This has included a number of posts on our homepage aimed at non-scientists and live streaming the reprocessing of data on Twitch, as well as the design, production, and public release of an accurate 3D printed model of SARS-CoV-2 based on deposited structures for use as a prop for outreach activities. In the last five months, we have done a weekly automatic post-analysis as well as a manual re-processing and re-modelling of representative structures from each of the 16 structurally known macromolecules of SARS-CoV or SARS-CoV-2. In this global crisis, where the community aims to get structures out as fast as possible, we aim to ensure that structure interpretations available to downstream users are as solid as possible. We provide these results as a free resource to the community in order to aid the hunt for a vaccine or anti-viral treatment. Our results are constantly updated and can be found online at insidecorona.net. New contributors to this effort are very welcome. In the last 40 years, structural biology has become highly automated, and methods have advanced to the point that it is now feasible to solve a new structure from start to finish in a matter of months with little specialist knowledge. The extremely rapid and timely solution of these structures is a remarkable achievement during this crisis and, despite some shortcomings, these structures have enabled downstream work on therapeutics to rapidly progress. The downside is that errors at all points during a structure determination are not only common, but can also remain undetected, and if they are detected, this is usually seen as individual failure. However, no individual researcher is fully conversant in all the details of structure determination, protein and nucleic acid structure, chemical properties of interacting groups, catalytic mechanisms, and viral life cycle. The result is that the first draft of a molecular model often contains errors like the ones pointed out above. While any molecular model could benefit from an examination by multiple experts, during this time it is important to bring such inspection to Coronavirusrelated structures as quickly as possible. We believe that, as a community, we need to change how we all see, address and document errors in structures to achieve the best possible structures from our experiments. We are scientists: In the end, truth should always win. Visualizing an unseen enemy; mobilizing structural biology to counter COVID-19 RCSB Protein Data Bank: Sustaining a living digital data resource that enables breakthroughs in scientific research and biomedical education Announcing the worldwide Protein Data Bank Xtriage and Fest: automatic assessment of X-ray data and substructure structure factor estimation AUSPEX: a graphical tool for X-ray diffraction data analysis The PDB_REDO Server for Macromolecular Structure Model Optimization Errors in protein structures Recent developments in the CCP-EM software suite Vagin, REFMAC5 for the refinement of macromolecular crystal structures Refinement of atomic models in high resolution EM reconstructions using Flex-EM and local assessment Haruspex: A Neural Network for the Automatic Identification of Oligonucleotides and Protein Secondary Structure in Cryo-Electron Microscopy Maps MolProbity: all-atom structure validation for macromolecular crystallography MolProbity: More and better reference data for improved all-atom structure validation New tools in MolProbity validation: CaBLAM for CryoEM backbone, UnDowser to rethink "waters," and NGL Viewer to recapture online 3D graphics Nsp3 of Coronaviruses: Structures and Functions of a Large Multi-Domain Protein Identification of Severe Acute Respiratory Syndrome Coronavirus Replicase Products and Characterization of Papain-Like Protease Activity A public database of macromolecular diffraction experiments Estimate your dose: RADDOSE-3D. Protein Science STARANISO (Global Phasing Ltd The Papain-Like Protease of Severe Acute Respiratory Syndrome Coronavirus Has Deubiquitinating Activity Implications of altered replication fidelity on the evolution and pathogenesis of coronaviruses. Current Opinion in Virology Structure of the SARS-CoV nsp12 polymerase bound to nsp7 and nsp8 co-factors ISOLDE: a physically realistic environment for model building into low-resolution electron-density maps Structure of replicating SARS-CoV-2 polymerase Structural basis for inhibition of the RNAdependent RNA polymerase from SARS-CoV-2 by remdesivir A mechanism for all polymerases The Structural Mechanism of Translocation and Helicase Activity in T7 RNA Polymerase Structural Basis of the Potential Binding Mechanism of Remdesivir to SARS-CoV-2 RNA-Dependent RNA Polymerase Remdesivir and SARS-CoV-2: Structural requirements at both nsp12 RdRp and nsp14 Exonuclease active-sites Perspective: Computational chemistry software and its advancement as illustrated through three grand challenge cases for molecular science COVID-19 Molecular Structure and Therapeutics Hub