key: cord-1050248-htg71twz
authors: Burley, Stephen K; Bhikadiya, Charmi; Bi, Chunxiao; Bittrich, Sebastian; Chen, Li; Crichlow, Gregg V; Christie, Cole H; Dalenberg, Kenneth; Di Costanzo, Luigi; Duarte, Jose M; Dutta, Shuchismita; Feng, Zukang; Ganesan, Sai; Goodsell, David S; Ghosh, Sutapa; Green, Rachel Kramer; Guranović, Vladimir; Guzenko, Dmytro; Hudson, Brian P; Lawson, Catherine L; Liang, Yuhe; Lowe, Robert; Namkoong, Harry; Peisach, Ezra; Persikova, Irina; Randle, Chris; Rose, Alexander; Rose, Yana; Sali, Andrej; Segura, Joan; Sekharan, Monica; Shao, Chenghua; Tao, Yi-Ping; Voigt, Maria; Westbrook, John D; Young, Jasmine Y; Zardecki, Christine; Zhuravleva, Marina
title: RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences
date: 2020-11-19
journal: Nucleic Acids Res
DOI: 10.1093/nar/gkaa1038
sha: 901d6eb8b665305efc36770aaf4852eaa07589c9
doc_id: 1050248
cord_uid: htg71twz

The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), the US data center for the global PDB archive and a founding member of the Worldwide Protein Data Bank partnership, serves tens of thousands of data depositors in the Americas and Oceania and makes 3D macromolecular structure data available at no charge and without restrictions to millions of RCSB.org users around the world, including >660 000 educators, students and members of the curious public using PDB101.RCSB.org. PDB data depositors include structural biologists using macromolecular crystallography, nuclear magnetic resonance spectroscopy, 3D electron microscopy and micro-electron diffraction. PDB data consumers accessing our web portals include researchers, educators and students studying fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. During the past 2 years, the research-focused RCSB PDB web portal (RCSB.org) has undergone a complete redesign, enabling improved searching with full Boolean operator logic and more facile access to PDB data integrated with >40 external biodata resources. New features and resources are described in detail using examples that showcase recently released structures of SARS-CoV-2 proteins and host cell proteins relevant to understanding and addressing the COVID-19 global pandemic.

Since 1999, the Research Collaboratory for Structural Bioinformatics Protein Data Bank [RCSB PDB; rcsb.org (1, 2) ] has been continuously funded by the National Science Foundation, the National Institutes of Health and the US Department of Energy to safeguard and nurture the PDB core archive and provide open access to PDB data. Efforts are organized around four user-oriented 'services', spanning data deposition, archive management and integration, data delivery and exploration, and outreach and education.

• Service 1--deposition, validation and biocuration: RCSB PDB and other members of the Worldwide Protein Data Bank (wwPDB) partnership [wwpdb.org (3)] support >40 000 data depositors around the world ensuring completeness and accuracy of the ever-growing corpus of 3D biostructure data. A single, global system, OneDep (4) , supports deposition of macromolecular crystallography (MX), nuclear magnetic resonance (NMR), 3D electron microscopy (3DEM) and micro-electron diffraction (ED) structures, experimental data and related metadata. Every structure is validated using communityestablished standards and quality metrics, reported in the wwPDB Validation Report (5) , and expertly biocurated (6) . • Service 2--archive management and access: The wwPDB partners are jointly responsible for managing the PDB archive according to the FAIR principles (7) . As the ww-PDB archive keeper, RCSB PDB safeguards the archive and maintains the PDBx/mmCIF data dictionary (8, 9) that enables organizing and searching of archived data. Primary PDB data are stored on redundant, enterprisegrade storage capable of supporting growth in data size and complexity, and are backed up regularly. Programmatic access to PDB data is available via FTP and application programming interfaces (APIs). 3D structural information is integrated with >40 highly regarded, external scientific data resources. • Service 3--data exploration: Tools for data searching, browsing, visualization, custom report generation and analysis are freely available on RCSB.org to many millions of data consumers worldwide. All features are supported by modern browsers without requiring additional software to download. • Service 4--outreach and education: RCSB PDB develops outreach and educational resources focused on structural biology and its impact across the sciences. Features are updated regularly and made freely available to educators and their students on PDB101.RCSB.org.

The Customer Service Help Desk provides ongoing support to PDB data depositors and data consumers around the world. Targeted online surveys help identify user needs and serve as the 'Voice of the Customer'.

The Infrastructure Team works to ensure continued >99% 24 × 7 × 365 service availability uptime. Status of RCSB PDB servers and APIs is monitored by the NS1 traffic management system (NS1.com). Continuously updated status information is publicly available at https://status.rcsb. org.

The enduring commitment of the RCSB PDB and its US funders reflects the critical importance of 3D biostructure data to basic and applied research in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. A significant software development project was undertaken to overhaul the information management services supporting RCSB.org since our last Nucleic Acids Research (NAR) Database Issue publication (10) . In this comprehensive redesign, we have taken greater advantage of our extensive metadata representation (8, 9) to provide a deeper and more semantically consistent view of content spanning the data life cycle from data deposition to data delivery.

The software overhaul involved decomposition of a mature largely monolithic web application into an architecture composed of small services, each with single and welldefined responsibility. Back-end search services include text and attribute search (https://www.elastic.co), sequence similarity (11) , sequence motif search, structure similarity (12) and chemical similarity (eyesopen.com). A separate aggregation service is responsible for combining results from these multiple search modes. Data access services are provided through a new GraphQL (graphql.org) API. In addition to implementing a service-oriented back-end architecture, the website front end has adopted a modern and extensible front-end web framework (reactjs.org), while retaining the familiar look and feel of the RCSB PDB resource. The new website front end and the external programmatic users consume the same search (REST) and data access (GraphQL) services. Among the benefits of the architectural redesign are operational efficiencies, improved deployment scalability, reduced time for rollout of new features and bug fixes, and enabling of more proactive monitoring of service health. These architectural improvements will also allow for more economical future deployments using public cloud resources.

3D biostructure data are central to discovery and development of new drugs and vaccines to combat the COVID-19 global pandemic (2, (13) (14) . The first structure of a SARS-CoV-2 protein (Nsp5, non-structural protein 5 or main protease), determined by Zihe Rao and Haitao Yang's research team at ShanghaiTech University, was publicly released on 5 February 2020 <1 month after the viral genome sequence was made available (15) . More than 350 SARS-CoV-2 structures were released in the following 7 months (5 February to 31 August 2020; see http://RCSB. org/COVID19). Rapid access without cost or restrictions on usage to detailed molecular portraits of promising COVID-19 drug targets is facilitating small-molecule drug discovery efforts, including those targeting the main protease [Nsp5, PDB ID 6LU7 (15) 6WVN (20) ]. These and many more structures of COVID-19 drug discovery targets with bound inhibitors, among others, are being used by research teams in the biopharmaceutical industry and academe around the world. Equally important is open access to SARS-CoV-2 structures that are informing design of vaccines and passive immunization treatment strategies, including the virion surface glycoprotein spike protein [S-protein, 6VYB (21) ] and its complexes with Fab fragments of neutralizing antibodies [e.g. 6W41 (22) ]. Structures of other SARS-CoV-2 proteins [e.g. nucleocapsid N-protein, 6VYO (23) and 6YUN (24) ] and their complexes with host factors [e.g. spike protein with angiotensin converting enzyme 2, 6M17 (25) ] help explain the biological and biochemical mechanisms central to the pathogenicity of the virus. More generally, SARS-CoV-2 structural biology of the pandemic underpins, complements and synergizes with other types of studies, such as mapping of interactions between human and viral proteins (26) .

As a comprehensive 3D biostructure data archive, the PDB contains other valuable clues to fighting the COVID-19 pandemic in the guise of structures of proteins from related coronaviruses. The 2003 outbreak of the severe acute respiratory syndrome was rapidly followed by structures of the SARS-CoV Nsp5 main protease [e.g. 1Q2W (27) ]. As of 31 August 2020, >800 structures of SARS-CoV-2, SARS-CoV and other coronavirus proteins were freely available from the PDB archive. 3D structural comparisons of these viral proteins in 3D could be vital in furthering our understanding of coronaviruses as human pathogens, thereby facilitating discovery and development of new treatments and vaccines to contain the current pandemic and manage other coronavirus outbreak(s) that are likely to threaten humanity in the future.

During the first 8 months of 2020, the RCSB PDB processed a total of 4721 depositions from the Americas and Oceania and released 9137 new PDB structures from around the world into the public domain, increasing the total number of PDB structures to 168 093 as of 31 August 2020. Open access to these data, integrated with information from >40 external resources, empowers PDB data consumers to make far-reaching breakthroughs in basic and applied research and education across fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences (28, 29) .

The Basic search. The top of every RCSB.org website page displays the Basic search box, which provides simple searches of the millions of PDB data items indexed using Elastic-Search (https://www.elastic.co) and updated weekly. The simplest way to use this feature is to type in a four-character PDB ID (e.g. 6LU7) and hit return or click the 'magnifying glass' icon. Doing so will take the user directly to the corresponding Structure Summary Page for that entry (see below). When more than one PDB ID is entered (e.g. 1Q2W, 6LU7; each separated by a space or comma plus space), the system returns a List of the matched PDB IDs, each illustrated with a static Mol* (32) structure image ( Figure 1A ). Search results can also be rendered using several alternative views including Gallery, PDB IDs or Tabular Reports.

Free text strings can be entered into the top search box. Two pointers: First, for a full text search of an entire phrase, use double quotes and hit return or click the magnifying glass icon. Otherwise, structures containing any appearance of any text will be returned and may include false positives. Second, enter the search term and wait for the drop-down menu to appear (instead of immediately pressing return). This drop-down menu of related search options is updated weekly via ElasticSearch indexing. In the example shown in Figure 1B , entering SARS-CoV-2 into the search box yields access to entries that include the input search string in labeled fields, e.g. in Additional Structure Keywords, in Structure Title, in Structure Description, in Polymer Entity Title, in Source Organism Taxonomy Name, in Primary Citation Title and in Citation Title. Clicking on Source Organism Taxonomy Name yields the outcome shown in Figure 1C , providing access in this example to all currently available SARS-CoV-2 protein structures in the PDB archive listed in descending order of Score, which is a measure between 0 and 1 intended to reflect the degree of relevance with which the listed structure matches the input search term. The summary list returned by the website search system can also be reordered (ascending or descending) according to Release Date, PDB ID and Resolution (only for MX, 3DEM and ED structures).

Immediately adjacent to the summary list, the user is presented with opportunities to refine the outcome of the initial basic search by clicking one or more checkboxes to select menu items under Refinements ( Figure 1A fered in Refinements. The user can also click on the MyPDB Login button (see below), which allows for saving and retrieving searches.

Data organization hierarchy. All data stored in the PDB archive conform to the PDBx/mmCIF data dictionary (8, 9) , from which two significant advantages accrue. First, data provenance and quality information from the originating data resource are faithfully preserved. Second, wherever possible, integrated data are indexed with respect to reference protein sequences maintained by either UniProtKB [https://www.uniprot.org (33) ] or NCBI/RefSeq [https:// www.ncbi.nlm.nih.gov/refseq/ (34)] at the level of individual amino acid residues. The practical benefit of these design features is the ability to perform complex searches across the informatics platform that simultaneously scrutinize plain text, complementary annotations, drug data, 1D sequence and 3D structures at the level of individual amino acid residues. Before describing additional features of the new RCSB.org website, we introduce the following definitions relevant to the way the atomic coordinates, experimental data and metadata are organized for each PDB structure:

• Entry: All data pertaining to a particular structure deposited in the PDB constitute an archival Entry, designated with a four-character alphanumeric identifier (PDB ID; e.g. 1Q2W). • Entity: Each chemically unique molecule in the Entry is defined as an Entity. Entities may be polymers, branched or non-polymers. Every Entity is labeled with a unique Entity ID (numeric).

• Polymer entities are composed of smaller chemical building blocks linked together by covalent bonds. Polymers may be proteins or polypeptides, DNA or polydeoxyribonucleotide, RNA or polyribonucleotide--identified by individually numbered amino acids and nucleotides covalently linked in the order defined by the polymer sequence. • Branched entities are either linear or branched carbohydrates and are composed of saccharide units covalently linked via one or more glycosidic bonds. • Non-polymer entities are small chemicals (enzyme cofactors, ligands, water molecules, etc.). Every nonpolymer Entity is labeled with a wwPDB Chemical Component Dictionary (CCD) ID (35) (one-to threecharacter alphanumeric). Note: Every Entry in the PDB contains at least one polymer Entity or one branched Entity (either linear or branched oligosaccharides).

• Instance: There can be multiple Instances of a given Entity. Each Instance or 'copy' of a polymer Entity or a branched Entity is given a unique Chain ID (one or multiple alphanumeric characters, e.g. A, AA, . . . ). Nonpolymer entities are identified by the Chain ID of the closest polymer Entity neighbor and their instances are distinguished with unique numbering. • Assembly: Polymer Entity Instances or Chains frequently occur in nature as components of larger macromolecular Assemblies, ranging in size and complexity from simple protein homodimers [e.g. 1Q2W (27) ] to whole ribosomes [e.g. 4V51 (36) ] to the HIV nucleocapsid [e.g. 3J3Q (37) ] to the faustovirus [e.g. 5J7V (38) ]. Each assembly is assigned a unique Assembly ID (numeric; e.g. 1, 2, . . . ). 3D Structural Motif searching and Chemical Substructure searching developed by RCSB PDB (39) will be added as an Advanced Search capability in 2020.

An example of using the Advanced Search Query Builder that combines Attribute, Structure Similarity, Sequence Similarity and Chemical searching is depicted in Figure  2 . Attribute searching detected 803 structures with Source Organism Taxonomy Name = Coronaviridae. Sequence Similarity searching detected 297 structures that are ≥50% identical to the sequence of PDB ID 1Q2W (SARS-CoV Nsp5). Structure Similarity searching detected 3042 structures similar in structure (relaxed) to PDB ID 6LU7 (SARS-CoV-2 Nsp5). Chemical searching (Graph Relaxed Stereo) detected three structures matching the SMILES string for a small-molecule inhibitor designated 7J (wwPDB CCD Identifier QYS). Employing the Boolean operator AND yielded only three structures matching all criteria, including co-crystal structures with the desired bound inhibitor (7J/QYS) for SARS-CoV-2 Nsp5 [6XMK (40) ], SARS-CoV Nsp5 [6W2A (40) ] and MERS-CoV Nsp5 [6VH3 (40) ]. Raising the sequence identity cutoff from 50% to 80% yields only two structures matching all criteria (MERS-CoV Nsp5 is distantly related to its SARS-CoV and SARS-CoV-2 homologs). Advanced Search results can be displayed as Structures (or Entities), individual polymer Entities, Assemblies or non-polymer Entities.

The top of the Advanced Search page also presents additional tabs providing access to the current session search History, a tool to Browse Annotations mapped to PDB structures, the MyPDB service and Help documentation.

History. The Search History page displays up to 50 searches, beginning with the most recent. The Search History will persist for as long as the current browser tab remains open. To permanently save a search query, after logging in to MyPDB (see below), click the 'Save to MyPDB' button.

Browse annotations. This browser system enriches the user experience by offering access to PDB structures (updated weekly), organized by annotations integrated from external data resources (identified with an orange banner) or from RCSB PDB (i.e. Protein Symmetry identified with a blue banner): notation. The entire hierarchy of annotation categories and subcategories populating a particular Browse Annotations tab can also be accessed by entering a word or phrase in the search box positioned immediately above the major categories. This feature can also be reached from any RCSB.org web page by clicking on Browse Annotations immediately below the top search box.

MyPDB. This long-standing RCSB PDB feature enables users to store PDB searches for re-use. It also supports an automated query service, wherein users receive regular emails when structures that match customized queries are publicly released into the PDB archive. Generation of Tabular Reports of search results is also possible, including a list of PDB IDs, various predefined reports and user-generated custom reports. Finally, Download Selected Files is offered as a single button click option for bulk download of structure data files (in both legacy PDB and PDBx/mmCIF formats) and experimental data files (MX and limited NMR data only). (Note: Because of limitations in the legacy PDB format, some Entries are only available in the newer PDBx/mmCIF format. Users are strongly encouraged to download and use PDBx/mmCIF files instead of relying on the legacy PDB file format.)

Many RCSB PDB website pages contain 'query-byexample' links to the Search Results page. For example, from the Structure Summary Page (below), which provides detailed information about a specific PDB structure, each listed author contains a link to launch a search for all structures for which the author is listed as the PDB deposition author.

Structure Summary Page. Once a structure of interest has been identified using either Basic or Advanced search, clicking on an individual PDB ID takes the user to the redesigned RCSB PDB Structure Summary Page for that particular structure. Figure 3 Figure 3D ).

• The top-line summary ( Figure 3A ) provides the title of the PDB Entry (a.k.a. structure) with a ww-PDB digital object identifier (DOI) that serves a machine-readable citation of each PDB ID (e.g. DOI:

10.2210/pdb6VXX/pdb for 6VXX), while providing access to the atomic coordinates, experimental data and various metadata items with deposition and depositor information. Immediately below the top-line summary, there is a summary of the experiment used to determine the structure and a graphical summary of the wwPDB Validation Report. The validation 'slider' graphic visually displays percentile scores that compare the validated structure to the entire PDB archive. For each metric, two percentile ranks (tick marks) are calculated: an absolute rank with respect to the entire PDB archive and a relative rank. Tick marks in the blue side of the scale are considered 'better' than those on the red side (worse). These images link to the full details report in PDF or XML, and can also be mapped in the 3D viewer Mol*. On the top right-hand portion of the top-line summary, two means of accessing PDB data files are provided. Display Files gives direct views of the FASTA Sequence and the PDB Format (legacy) and mmCIF Format atomic coordinate files (both Header and full File coordinate contents). • For most PDB structures, the Literature box ( Figure 3A) provides the Primary Citation information for the structure with the PubMed Abstract and opportunities to Download Primary Citation (Mendeley format). When no primary literature publication is available (∼20% of PDB structures), users are encouraged to cite the structure using the wwPDB DOI provided in the top-line summary. • The Macromolecules box ( Figure 3B) grated from external resources or a means of accessing the external resource, such as UniProtKB.) Immediately below these three buttons, users can see the Protein Feature View for the Entity. Two options enable display of Reference Sequence numbering conforming to that present in the deposited structure or that present in the UniProtKB reference sequence when available (6VXX 1 versus P0DTC2 in Figure 3B ). Within the Protein Feature View, there is an active area that enables zooming in to examine the polymer Entity sequence and traversing its entire length [mouse over the information (i) icon for instructions regarding using mouse or trackpad]. Whenever relevant, ARTIFACT, MUTATIONS, MODIFIED MONOMER and/or UNOBSERVED rows appear below the Entity sequence, respectively, indicating the presence of a cloning artifact, mutated or modified residues, or segments of the polymer sequence that are not represented in the atomic coordinates for each Instance of the Entity labeled with Chain ID (e.g. 6VXX.A in Figure 3B ). Additional tabs present at the top of each Structure Summary Page are described below:

• 3D View tab, which launches the web-native Mol* 3D molecular viewer described below (32) (47)). The origin of the information presented in each row is color coded by the bar to the right of the row label as either being computed by RCSB PDB from PDB data (blue) or integrated from an external resource (orange).

Mousing over each symbol in any row provides provenance information on the top right of the view.

The new RCSB.org website utilizes three tools that enable execution of highly complex searches across the PDB archive in real time.

Sequence Similarity searching across the PDB archive. Previous versions of the RCSB.org website utilized the wellknown BLAST method (48) for identifying PDB entries containing similar protein and nucleic acid sequences. With ∼10% year-on-year growth of the PDB archive, this option for sequence searching across the PDB archive became too slow. RCSB.org now uses the mmseqs2 method (11) , which achieves ∼11 times faster performance at comparable levels of sensitivity based on testing with the PDB archive. Rapid searches (specifying E-value or sequence identity cutoffs) can be performed in two ways:

• By PDB ID and Entity: Typing a PDB ID into the PDB ID text box and selecting the desired Entity from the pulldown menu yields all PDB structures containing polymer Entities similar in sequence. • By Sequence: Paste the sequence in one-letter code format into the Sequence text box, after removing all extraneous information such as FASTA headers. Note: Sequences must be longer than 20 residues. Shorter sequences (e.g. purification tags and antibody epitopes) can be identified using Sequence Motif searching (see below).

'Display Results as' can be set to 'Polymer Entities' to display Sequence Identity, E-value and matched Region, and viewed as an interactive alignment of the matched Region (corresponding to PDBx/mmCIF file numbering). The interactive alignment can be adjusted between Query (showing Query sequence on top), Subject (showing matched sequence on top) and Pairwise (showing both sequences). The blue bar designating the matched Region is shaded for sequence identity from light blue (higher) to dark blue (lower), with pink dots indicating sequence differences.

Sequence Motif searching across the PDB archive. Entries containing similar short sequence patterns can be identified with the Sequence Motif search. Searches can be launched by selecting one of the following modes:

• Simple: Sequence queries using one-letter amino acid codes (e.g. MQTIF) plus 'X' to indicate any amino acid at a position (e.g. use XPPXP to search for SH3 domain recognition sites corresponding to polyprotein type II helices, where X is any residue and P is proline). • PROSITE: Complex queries that include ambiguities, exempt amino acid residues, repetition and/or positioning at N-or C-terminus can be expressed using PROSITE (49) patterns (e.g.

[AC]-X-V-X(4)-{ED}).

• RegEx: Highly complex queries can be built using socalled regular expressions (https://en.wikipedia.org/wiki/ Regular expression).

'Display Results as' can be set to 'Polymer Entities' to view the numbering for the sequential sequence match region (corresponding to PDBx/mmCIF file numbering).

Owing to the size and complexity of the PDB archive, previous versions of the RCSB.org website supported Structure Similarity searches using an approach that limited the actual search process to an ensemble of representative structures each extracted from precomputed clusters of similar structures. RCSB PDB recently developed a computationally efficient method based on Zernike polynomials for Structure Similarity searching that supports real-time searches across the entire PDB archive (12) . The new system assesses global 3D shape similarity using BioZernike descriptors that capture the global volumetric shape of the protein. This feature can be used by

• search for polymeric chains that are similar to a given polymer Entity Instance (i.e. Chain ID), or • search for Assemblies that are similar to a given Assembly (i.e. Assembly ID).

For either type of search, it is possible to choose between two modes of matching:

• Strict: Appropriate for ensuring all matches are relevant and missing some more distant matches. • Relaxed: Appropriate for identifying all similar matches (at the risk of including some false positives in the search results). PDB data are used extensively by millions of researchers, educators and students around the world. Structure data files downloaded directly from the PDB archive totaled ∼840 million in 2019 (∼2.3 million file downloads/day). This statistic underestimates PDB data utilization, because many PDB data consumers access archival data through third parties. Review of the 2020 NAR Database Issue revealed that 13 of the 59 newly reported databases (∼22%) utilize PDB data. Those additions bring the total number of external databases utilizing PDB data to 449 databases out of a total 1637 (∼27%) reported by NAR. Finally, these metrics also fail to take into account that all major biopharmaceutical companies worldwide maintain copies of the PDB archive inside their firewalls for use with proprietary structure information generated within the company.

Mol* is a web-native 3D molecular viewer developed by an open-source software development collaboration involving RCSB PDB, Protein Data Bank in Europe (PDBe, EMBL-EBI, Hinxton, UK) and the Central European Institute of Technology (Brno, Czech Republic) (32) . This new viewer enables rapid visualization of macromolecular structures and their corresponding data, together with high-quality rendering within the browser window. It does so without the need to download and periodically update external software.

The speed of Mol*, enabling visualization of even the largest PDB structures in modern browsers on laptop/desktop computers and mobile devices, is achieved through the use of binary CIF files (54) , which are available as static files or delivered from the ModelServer and VolumeServer. This compressed format, delivering only the data that are required for image rendering, ensures fast loading of model and map data from PDB structures and Electron Microscopy Data Bank maps (https://www.ebi.ac. uk/pdbe/emdb/). In addition to its speed, Mol* has a powerful rendering engine, enabling high-quality visualization of molecular structures in various representations. Mol* is now used for software rendering of static images throughout the RCSB.org website.

Newly developed Mol* features include alignment and display of superimposed structures, viewing of ligandbinding environments, measuring distances within a structure, highlighting particular structural regions using the sequence display, changing the representation of particular residues, displaying symmetry-related molecules and displaying electron density or 3DEM maps to help visualize how well coordinate data fit the underlying density. A new Mol* User Guide is now available on RCSB.org (https: //www.rcsb.org/3d-view/molstar/help).

PDB-101, the educational arm of the RCSB PDB (PDB101.RCSB.org), hosted 665 958 users in 2019. The primary feature of PDB-101, the Molecule of the Month series (30, 31) , received nearly 1 million page views during that time period. While topical features such as Zika virus and opioid receptors drew large audiences over the short term, articles related to topics commonly addressed in high school classrooms such as hemoglobin and catalase continue to be frequently accessed every year. 2020 marks the 20th anniversary of this popular feature, which was launched with a feature on myoglobin. The series continues to broaden in scope and increase in impact by building on the flood of new biomolecular topics recently made available in the PDB through technological advances such as 3DEM and X-ray free electron lasers.

More recently, PDB-101 has developed materials to disseminate COVID-19 information beyond the research community. As noted by a RCSB PDB Twitter follower, 'SARS-CoV2 is not an invisible enemy but rather one that needs special tools to see by overcoming the limitations of the human eye' (55) . These PDB-101 videos, paintings, curricula, activities and other related features (http://pdb101.rcsb.org/ browse/coronavirus) help students, educators and members of the curious public better visualize and understand the virus. In this work, we are constantly exploring new modalities for reaching non-technical communities. For example, PDB-101 hosted a Coronavirus Challenge, wherein visitors were tasked with creating accurate or artistic representations of SARS-CoV-2 using the digital cell painting program CellPaint (56) . A coloring activity presented early in the pandemic was widely used by children and adults to explore and understand the structure of the virus as it was making headlines around the world.

The PDB was founded as the first open-access digital data resource in all of biology in 1971 (57) following a landmark Cold Spring Harbor Symposium (58) . The structural biology community came together and decided that unfettered data sharing would accelerate technology developments and broaden the impact of the field on researchers, educators and students around the world. Starting with just 7 X-ray structures, the PDB archive now holds ∼170 000 structures determined by MX, NMR, 3DEM and ED. In many respects, sheer growth in numbers has been eclipsed by the pivotal roles that PDB structures are playing today in research and education across fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. The PDB archive is recognized as being comprehensive, authoritative and of high scientific quality. It has been certified by the CoreTrustSeal (https://www.coretrustseal. org) as a Trustworthy Data Repository and is regarded as a gold-standard exemplar of open-access archive for biological data as a public good.

In recognition of the importance of long-term preservation of biostructure data, the wwPDB partnership was established in 2003 (59) . Current members include the RCSB PDB, PDBe (60), Protein Data Bank Japan (PDBj) (61) and Biological Magnetic Resonance Data Bank (62) . ww-PDB partners have organized a series of scientific meetings around the world to celebrate the golden jubilee of the PDB (Table 1) . PDB data depositors and data consumers wishing to provide financial support for these PDB50 celebration meetings are encouraged to make tax-deductible donations to the Worldwide Protein Data Bank Foundation (https://foundation.wwpdb.org/donations.html).

Release of new RCSB.org tools for searching and exploring PDB data necessitated discontinuation of certain RCSB PDB Services. Legacy programmatic web services, implemented within our older monolithic architecture, are being retired. These services were implemented using XML format for data exchange with textual documentation. They have been replaced by new web services using a simpler JSON (http://www.json.org) data exchange format, and REST APIs are documented using the OpenAPI standard (https://data.rcsb.org/redoc/index.html). The legacy REST data access services have been replaced by the more versatile GraphQL API providing flexible access to the full RCSB PDB data schema (https://www.rcsb.org/pages/ webservices). Details, including speakers and registration, will be updated at https://foundation.wwpdb.org.

Recently introduced Search and Data APIs offer comprehensive functionality and high performance (https:// www.rcsb.org/pages/webservices). Legacy RCSB PDB APIs (REST search and fetch) will be discontinued in late 2020. Importantly, the new RCSB PDB APIs enable access to the remediated carbohydrate data (released in July 2020). Legacy APIs do not support access to these remediated data. Users of the legacy APIs are strongly encouraged to migrate to the new APIs as soon as possible. Please contact RCSB PDB Customer Service with any questions about the new and improved APIs (email: info@rcsb.org).

REST.rcsb.org services were deprecated in May 2020. These services have been replaced by more powerful and comprehensive APIs that drive the new and improved RCSB.org.

Java-based tools Protein Workshop and Ligand Explorer were retired in June 2020. The 3D viewer Mol* supports features previously available from these tools.

RCSB PDB has evolved considerably since its first NAR Database Issue publication more than two decades ago (1) . Throughout this process, the organization has been guided by the needs of a relentlessly growing and ever more diverse user community that today numbers many millions worldwide. Subsequent RCSB PDB NAR publications have described our journey (10, (63) (64) (65) (66) (67) (68) (69) (70) (71) . Others have summarized the results of various studies of the impact of the RCSB PDB and the PDB archive on research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences (2, (13) (14) (28) (29) . Continuing in this vein, a recent RCSB PDB study documented that structural biologists and the PDB archive facilitated discovery and development of ∼90% of new smallmolecule and biologic drugs approved by the US Food and Drug Administration (FDA) across all therapeutic areas during 2010-2016 (14) . Structure-guided drug discovery, typically jump-started by open access to PDB structures of drug discovery targets deposited by academic researchers, played a particularly important role in producing small-molecule anticancer agents approved by the US FDA during 2010-2018 (13) . Looking ahead, the PDB archive houses structures of many human proteins that represent drug targets of tomorrow (2), thereby ensuring that structural biologists and structure-guided approaches will continue to facilitate discovery and development of lifechanging medicines for patients and their families. In the midst of the COVID-19 pandemic, the RCSB PDB is providing open access to hundreds of SARS-CoV-2 protein structures and an even larger number of other coronavirus proteins that provide valuable information regarding target druggability and starting points for wet-and dry-laboratory structure-guided drug discovery efforts, antiviral antibody engineering and vaccine design.

As the PDB archive enters its 50th year of operations, there is broad agreement that advances in basic and applied research depend critically on open access to the research findings of the scientific community, most of which were in fact bought and paid for with public and private philanthropic monies. Promulgation of the FAIR principles (7) and the work of non-governmental organizations such as the CoreTrustSeal (https://www.coretrustseal.org) are helping to raise awareness of the value of open sharing of data (72) . Equally important going forward will be sustainable funding for open-access data resources such as the PDB archive that is commensurate with the central roles they play in our global biological and biomedical research and educational ecosystems (73, 74) .

RCSB PDB services are publicly available from http:// RCSB.org.

The Protein Data Bank

RCSB Protein Data Bank: enabling biomedical research and drug discovery

Protein Data Bank: the single global archive for 3D macromolecular structure data

OneDep: unified wwPDB system for deposition, biocuration, and validation of macromolecular structures in the PDB archive

Validation of structures in the Protein Data Bank

Worldwide Protein Data Bank biocuration supporting open access to high-quality 3D structural biology data

The FAIR Guiding Principles for scientific data management and stewardship

3.6.2 The Protein Data Bank exchange data dictionary in International Tables for Crystallography

Definition and Exchange of Crystallographic Data

RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

Real time structural search of the Protein Data Bank

Impact of Protein Data Bank on anti-neoplastic approvals

How structural biologists and the Protein Data Bank contributed to recent FDA new drug approvals

Structure of M(pro) from SARS-CoV-2 and discovery of its inhibitors

The crystal structure of papain-like protease of SARS CoV-2

Structure of the RNA-dependent RNA polymerase from COVID-19 virus

Delicate structural coordination of the severe acute respiratory syndrome coronavirus Nsp13 upon ATP hydrolysis

Crystal structure of Nsp15 endoribonuclease NendoU from SARS-CoV-2

Center for Structural Genomics of Infectious Diseases (CSGID). (2020) Crystal structure of Nsp16-Nsp10 from SARS-CoV-2 in complex with 7-methyl-GpppA and S-adenosylmethionine

2020) Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein

A highly conserved cryptic epitope in the receptor binding domains of SARS-CoV-2 and SARS-CoV

Crystal structure of RNA binding domain of nucleocapsid phosphoprotein from SARS coronavirus 2

2020) 1.45 angstrom resolution crystal structure of C-terminal dimerization domain of nucleocapsid phosphoprotein from SARS-CoV-2

Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2

A SARS-CoV-2 protein interaction map reveals targets for drug repurposing

Company says it mapped part of SARS virus. The New York Times

Analysis of impact metrics for the Protein Data Bank

Impact of the Protein Data Bank across scientific disciplines

Insights from 20 years of the molecule of the month

The RCSB PDB 'Molecule of the Month': inspiring a molecular view of biology

Mol*: towards a common library and tools for web molecular graphics

UniProt: a worldwide hub of protein knowledge

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

The chemical component dictionary: complete descriptions of constituent molecules in experimentally determined 3D macromolecules in the Protein Data Bank

Structure of the 70S ribosome complexed with mRNA and tRNA

Mature HIV-1 capsid structure by cryo-electron microscopy and all-atom molecular dynamics

Structure of faustovirus, a large dsDNA virus

Real-time structural motif searching in proteins using an inverted index strategy

3C-like protease inhibitors block coronavirus replication in vitro and improve survival in MERS-CoV-infected mice

Implementation of GlycanBuilder to draw a wide variety of ambiguous glycans

SCOP2 prototype: a new approach to protein structure mining

SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures

CATH: an expanded resource to predict protein function through structure and sequence

The Pfam protein families database: towards a more sustainable future

The Gene Ontology Resource: 20 years and still GOing strong

Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation

Basic local alignment search tool

PROSITE: a documented database using patterns and profiles as motif descriptors

The GTEx Consortium atlas of genetic regulatory effects across human tissues

A conditional knockout resource for the genome-wide study of mouse gene function

Pharos: collating protein information to shed light on the druggable genome

DrugBank 5.0: a major update to the DrugBank database

BinaryCIF and CIFTools--lightweight, efficient and extensible macromolecular data management

Integrative illustration for coronavirus outreach

CellPAINT: interactive illustration of dynamic mesoscale cellular environments

Crystallography: Protein Data Bank

Cold Spring Harbor Symposia on Quantitative Biology

Announcing the worldwide Protein Data Bank

PDBe: towards reusable data delivery infrastructure at Protein Data Bank in Europe

New tools and functions in data-out activities at Protein Data Bank Japan (PDBj)

BioMagResBank (BMRB) as a partner in the Worldwide Protein Data Bank (wwPDB): new policies affecting biomolecular NMR depositions

The PDB data uniformity project

The Protein Data Bank: unifying the archive

The Protein Data Bank and structural genomics

The distribution and query systems of the RCSB Protein Data Bank

The RCSB PDB information portal for structural genomics

The RCSB Protein Data Bank: redesigned web site and web services

The RCSB Protein Data Bank: new resources for research and education

The RCSB Protein Data Bank: views of structural biology for basic and applied research and education

The RCSB Protein Data Bank: integrative view of protein, gene and 3D structural information

Landscape of innovation for cardiovascular pharmaceuticals: from basic science to new molecular entities

Data management: a global coalition to sustain core data

Towards coordinated international support of core data resources for the life sciences

RCSB PDB is a member of the Worldwide Protein Data Bank (wwPDB.org). We thank the 40 000+ PDB data depositors who have contributed to the growth of the PDB archive since 1971 and gratefully acknowledge contributions from past members of the RCSB PDB team and our Worldwide Protein Data Bank partners. We appreciate ongoing guidance from our Advisory Committee and from our Director Emerita Helen M. Berman.