key: cord-0001060-s90geszi authors: Lang, Dorothy M.; Zemla, A. T.; Zhou, C. L. Ecale title: Highly similar structural frames link the template tunnel and NTP entry tunnel to the exterior surface in RNA-dependent RNA polymerases date: 2012-12-25 journal: Nucleic Acids Res DOI: 10.1093/nar/gks1251 sha: 8dc8c40ee8b61c5660b15c9e81e928eb3dc44ee5 doc_id: 1060 cord_uid: s90geszi RNA-dependent RNA polymerase (RdRp) is essential to viral replication and is therefore one of the primary targets of countermeasures against these dangerous infectious agents. Development of broad-spectrum therapeutics targeting polymerases has been hampered by the extreme sequence variability of these sequences. RdRps range in length from 400–800 residues, yet contain only ∼20 residues that are conserved in most species. In this study, we made structure-based comparisons that are independent of sequence composition using a recently developed algorithm. We identified residue-to-residue correspondences of multiple protein structures and created (two-dimensional) structure-based alignment maps of 37 polymerase structures that provide both sequence and structure details. Using these maps, we determined that ∼75% of each polymerase species consists of seven protein segments, each of which has high structural similarity to segments in other species, though they are widely divergent in sequence composition and order. We define each of these segments as a ‘homomorph’, and each includes (though most are much larger than) the well-known conserved polymerase motifs. All homomorphs contact the template tunnel or nucleoside triphosphate (NTP) entry tunnel and the exterior of the protein, suggesting they constitute a structural and functional skeleton common among the polymerases. The polymerase protein family has been studied extensively for >40 years. This interest has been motivated by their unique function-to replicate all forms of life, and confounded by their sequence diversity. As more tertiary structures of polymerase were solved, it became apparent that widely diverse sequences form highly similar structures. There has not, until recently, been a time-effective computational method to make detailed comparisons of these observations. The objective of this study was to clarify the relationship between structure and sequence in a group of RNA-dependent RNA polymerases (RdRps) that replicate many of the viruses that represent significant threats to life throughout the world. We selected well-studied species in order to maximize the amount of experimental data that could be used to evaluate the association of functional residues and structure (Table 1) . We used the StralSV algorithm (1) to perform structure comparisons between all of the selected species. We created maps of residue-to-residue (R2R) correspondence from which we determined the boundaries of structurally similar segments-which we named 'homomorphs'. In contrast to the relatively short lengths of previously described motifs, we found that most homomorphs are long, and each provides a structural connection between the template tunnel or NTP entry tunnel and the exterior of the protein. The tertiary structure of the replicative unit of most RdRps is highly conserved (2) . It resembles that of a right-handed palm, with finger-like folds curved inward to form a tunnel that encircles the template that is being processed (3) . Most single-unit polymerases are 400-800 residues in length. Early polymerase studies found that $22 residues are highly conserved in all polymerases, and in most species they are in the same sequential order (4) . Some are clustered, with two to four highly conserved residues within a segment $10 residues in length. The sequence segment that includes each of the highly conserved residues or clusters has been described as a motif. The motifs are arranged in most species in the order: G-F1-F2-F3-A-B-C-D-E. The birnaviruses (IBDV and IPNV) differ from this scheme due to a *To whom correspondence should be addressed. Tel: +1 209 839 1880; Fax: +1 925 423 6437; Email: dorothylang@gmail.com Correspondence may also be addressed to A. T. Zemla. Tel: +1 925 423 5571; Fax: +1 925 423 6437; Email: zemla1@llnl.gov transversion involving the C Motif (C-A-B) (5) . The references for each of the motifs and species that have been studied are listed in Table 2 ; these references were selected because they included either alignments of one or more motifs with several species, or alignments for a particular motif not found elsewhere. Apart from these conserved motifs, the RdRp sequences are highly variable. An extensive study of picornaviruses by Koonin et al.(6) illustrates this variability. The basis of Koonin et al.'s study was an alignment of 64 species using the algorithm multiple sequence comparison by log expectation (MUSCLE) (7) and a manual adjustment. The alignment included four species that are also in our sample group, and these had a total of 32 conserved residues within sequences $550 residues in length. The analysis that we present in the following pages demonstrates that highly similar structures can be formed by very different sequences. Structure comparison, rather than sequence comparison, enabled us to readily recognize functionally significant segments of similarity and difference between sequences. In total, 18 well-studied viral species with solved RNA polymerase structures and four viral species with solved DNA polymerase structures were selected for analysis. We used the StralSV algorithm (http://proteinmodel.org/), described in detail previously (1) , to perform the analyses. StralSV compares the R2R (structural) correspondence of each sequence in a set of reference sequences to a specified query sequence, beginning at the start of the sequence and continuing to the end, by evaluating successive overlapping segments of a user-selected length. Each of the selected structures was used as a query to all structures available in the Protein Data Bank (PDB release 2011_01_25; number of chains 176 365) (8) . The results were filtered for structural segments of at least 55% LGA_S structure similarity (1) to at least one query segment of 90 amino acids in length (size cutoff for the structural context) from which R2R correspondences were extracted from local tightly superimposed spans: continuous segments of the minimum length of five amino acids. These parameters together contributed to the identification of common regions of structure similarity, which were used to distinguish regions of conservation (structure matches) from regions in which structure deviates (non-matches). The StralSV comparison of each initial query to all structures in PDB resulted in the identification of a final set of 37 PDB structures that were used as a reference polymerase structure set in this study. One representative of each species within the set of 37 PDB structures was used to create an all-against-all structure comparison using StralSV. The species and PDB identities of this reference set are summarized in Table 1 . The output from the all-against-all structure comparisons was parsed to extract R2R correspondences for each query/template pair. In each comparison the full query sequence was represented. At some positions in the template sequences, gaps occurred either due to structure deviation exceeding the alignment cutoff (5 Å ) or because the template contained additional residues (e.g. a loop) without correspondence in the query structure. A structure map was created for each set of R2R correspondences derived from each query/template-set StralSV comparison by combining the data for each binary alignment in an Excel spreadsheet. In this article, we report the structure map using poliovirus RdRp as the primary query (Figures 2-8) , although in most cases we include structure maps with other species as query (Figures 2 and 4 -8) . Alternative query species were used when structure similarity of some of the templates was not identified with poliovirus as query (e.g. Motif G in WNV and DENV were identified based on DENV as the query). The query that contributes to non-poliovirus matches is indicated on each alignment ('q' following the species abbreviation). Ferrer-Orta et al. Poch et al. Poch et al. Poch et al. Ferrer-Orta et al. ( HRV Pan et al. Pan et al. Pan et al. Love et al. Love et al. Love et al. Love et al. FMDV Gorbalenya et al. Ferrer-Orta et al. (2) Poch et al. Ferrer-Orta et al. (2) Ferrer-Orta et al. ( NV Pan et al. Pan et al. Ferrer-Orta et al. (2) Ferrer-Orta et al. (2) Ferrer-Orta et al. (2) Pan et al. Ferrer-Orta et al. ( RHDV Pan et al. Pan et al. Ferrer-Orta et al. (2) Love et al. Love et al. Love et al. Ferrer-Orta et al. ( HCV Pan et al. Choi et al. Bressanelli (36) Pan et al. Bressanelli et al. Choi et al. Pan et al. Choi et al. Choi et al. Choi et al. Choi et al. HIV Pan et al. Pan et al. Pan et al. Pan et al. Pan et al. Pan et al. Pan et al. Butcher et al. Butcher et al. Pan et al. Pan et al. Pan et al. Butcher et al. REOV Pan et al. Pan et al. Pan et al. Pan et al. Pan et al. Pan et al. The structure maps were used as the basis for the structure alignments described in this article. On all structure maps, we identified the Motifs A-F as described by Gong and Peersen (9) by coloring the background of the columns matching the residues of the motif orange and the background of the columns matching the highly conserved residues yellow. A similar coloring scheme was used for Motif G, except that it depicts the residues identified by Pan et al. (5) because Motif G was not specified in the Gong and Peersen study (9) . On all structure maps, we colored the residues of the picornaviruses blue; the caliciviruses green; HCV and BVDV (flaviviruses) black; WN and DENV (flaviviruses) red; PHI6 black; REOV brown; ROTAV, IBDV, IPNV black; HIV purple; TERT, T7RNAP, N4 black; TAQ turquoise and T7DNAP black. The segment of conserved structure adjacent to each motif was determined from the StralSV maps. For each of the queries and each of the motifs, the location at which structure conservation of most species became discontinuous was noted. We defined the boundaries of a homomorph as the position at which the structural segment shared by all representatives in a set became discontinuous in more than two species. In all structure maps, the conserved segments that we identified based on StralSV R2R correspondences are colored light blue. We defined the homomorph of each motif as the segment consisting of the conserved motif plus the adjacent structurally conserved segments. The length of each homomorph varied somewhat depending on the query. For each query species, the start and end of the homomorphic segment of each motif were recorded, and a 20 Â 20 matrix for each motif was generated to compile the data from each query (data not shown). This matrix was used to identify the minimum start and maximum end of each homomorph, and these values are summarized in Supplementary Table S1 . These values are plotted in Figure 1 , which illustrates the maximal expanse of the homomorphic segments that include each of the polymerase motifs. All of the tertiary structures were illustrated using the Cn3D program (10) . Structural examination of the sequence motif regions yielded extended regions of structural conservation. We named each of these regions a 'homomorph', defined as a sequence segment that shares a highly similar tertiary structure with other species, independent of the sequence composition. We found that most of the homomorphs were at least twice as long as the corresponding sequence motif. The extent of this expansion is illustrated in Figure 1 . The length of each homomorph was determined separately for each species, using a single structure for each species as the query in a StralSV analysis. The identity of the start and end of each homomorph depends on the structural similarity to a given query. Therefore, there is some minor query-specific variability of the location of the ends of homomorphs that can be observed in Figure 1 . Within the homomorphs, non-matching residues can be used to identify minor differences between species. In most single-stranded RdRp (ss-RdRp) species (PV, COXS, HRV, FMDV, NV, RHDV, SAPV, HCV, BVDV, WN and DENV), the homomorphs of Motif G are the largest (median of 53 residues), followed by A (49), B (46), E (37), F3 (28), D (23), C (17), F1 (10) and F2 (8) . In double-stranded RdRps (ds-RdRps) (PHI6, REOV, ROTAV, IBDV and IPNV), the homomorphs of Motif G are relatively short (median of 12 residues). Several other homomorphs are also shorter in ds-RdRps: B (48), A (41), E (31), F3 (28), D (16), C (12), F1 (11) and F2 (2) . The lengths and occurrences of homomorphs of polymerases that are associated with DNA (HIV, TERT, TAQ, T7 DNAP, T7 RNAP and N4) are variable and will be discussed in the sections describing each motif. The homomorphs of all species are similarly distributed over the length of the polymerase ( Figure 1 ). The ss-RdRps (PV, COXS, HRV, FMDV, NV, RHDV, SAPV, HCV, BVDV, KUNJ and DENV) are most similar to each other. The spacing between homomorphs is more variable in the ds-RdRps (PHI6, REOV, ROTAV, IBDV and IPNV), and in general larger than in the ss-RdRps. The homomorph of Motif C (hmC) is identified in the birnaviruses despite a sequence inversion that places it before Motif A (5) . Relatively large segments between homomorphs occur in PHI6 between F1 and F3, and in KUNJ, DENV and PHI6 between B and C. The spacing between motifs is notably reduced in HIV and TERT (RdDps). In birnaviruses (IBDV, IPNV), the homomorphs of C and A are only three residues apart, and the distance between the homomorphs of Motif F3 and C is greater than the typical F3-B distance. Most homomorphs are separated from each other by a segment that contains a turn (secondary structure), or there is a turn at the beginning or end of the homomorph. Within all RdRps, all motifs occur within a length of 375 residues. In T7-DdRp (T7 RNAP) and N4, the motifs are spread out over approximately 600 residues. The amount of R2R correspondence for most of the RdRps, determined from the minimum and maximum values of all homomorphs, is $75% over the span from Motif G through Motif E (Supplementary Table S1 ). The structurally aligned sequences that comprise hmG are summarized in Figure 2A . The R2R correspondences of WN and DENV could not be evaluated for the Motif G region (approximately PV 101-121), as the structural configuration of the segments of these viruses that would be expected to match the homomorph of Motif G (hmG) segment had not been determined. Within the homomorph, most of the ss-RdRps were highly similar (Figure 2A , top). BVDV is similar to the other ss-RdRps in the N-terminal segment, but no longer matches them at the C-terminal segment. Only Motif G, and not a homomorph, was identified in PHI6, IBDV and IPNV. In the region of Motif G, StralSV did not identify R2R correspondences between any of the RNA polymerases and REOV, ROTAV, HIV, TERT or DNA-dependent polymerases (TAQ, T7 DNAP, T7 RNAP and N4). There were structural discontinuities within the homomorph (noted by x in Figure 2A ) and similar discontinuities within the motif. These minor discontinuities identify species-specific differences within a segment that is otherwise highly continuous in several species. For example, FMDV has 3 AA between S87 and T90 (PV numbering) and therefore does not match the structure of PV 88-LD-89, which has only 2 AA. In contrast, WN and DENV have the same number of residues for the gaps from A388-R392 and G385-R389, respectively, but these segments were not structurally aligned by StralSV within the parameters used in this study. The segment PV-Y102 to A109 is a b-hairpin unique to picornavirus RdRps (11) . The numbering on NV and HCV clarifies that these regions are continuous in these species (and the other caliciviruses, RHDV and SAPV, though not numbered). Figure 2B and C illustrates the tertiary structure of the homomorph using a poliovirus structure (PDB:1RA6). Most of the N-terminal segment is a single helix that extends over nearly half of the surface of the protein. Both ends of the homomorph terminate at the exterior surface of the protein. The distance between the homomorphs of Motifs G and F1 was 20-37 residues in the ss-RdRps (in all species where both were present) and longer in the ds-RdRps (median 47 residues) ( Figure 1 ). Three components of Motif F have been recognized: F1, F2 and F3 (2, 5) . In some species there are sequence segments between these motifs. In all the species in our sample set except PHI6, in those species that have R2R correspondence within Motif F, the three F motifs are continuous; therefore, we have combined them, and the adjacent structurally aligned segments, into a single homomorph. The structurally aligned sequences that comprised homomorph of Motif F (hmF) for RdRps and HIV are summarized in Figure 3A . HmF extended five residues upstream from the N-terminal edge of Motif F1 [as defined by Gong and Peersen (9) ] and $20 residues downstream from Motif F3 [as defined by Gong and Peersen (9)]. HmF1 was found in all RdRp species except WN and DENV; it was not possible to evaluate R2R correspondence for this segment of WN and . The number of residues from the start of the polymerase structure to the start of the first homomorph is identified for each species at the left of the chart. The PDB structures, and consequently, sequence position numbers for KUNJ, DENV and TAQ, which are used throughout this article, do not begin at the polymerase; therefore, for this figure, the distance from the start of the polymerase is shown after the slash. For species lacking Motif G, the first identified homomorph is indicated at the left of the start position. The length of the polymerase of each species is listed at the right of the chart. DENV as the structure of this segment has not been resolved. Motifs F1 and F2 are always continuous if F2 is present, and Motif F2 is present in most species. Motif F2 is represented by a single residue in PHI6, REOV and ROTAV (dsRNA), two residues in HIV (RdDp), 15 residues in BVDV and 10 residues in HCV (two of which, in HCV, are structurally aligned to the other RdRps). Motif F2 varied in length from 6 to 15 residues. In PHI6, there was a 61-residue segment between F1 and F3. HmF3 was present in all RdRp species. Figure 3B and C illustrates the tertiary position of hmF. Most of the structure is hairpin-like, with some residues of Motif F2 at the apex, which is located at the exterior surface of the protein. HmF1 and hmF3 are approximately parallel for several residues. HmF3 then independently extends to the surface of the protein approximately opposite the Motif F2 site. Figure 3D shows the N-and C-terminal residues and some residues of the C-segment of hmF3 at the surface of the protein. Figure 3E shows the position of Motif F2 relative to the template tunnel. The segments between hmF and hmA are 8-17 residues in ss-RdRps and PHI6, 28-30 residues in ds-RdRps (REOV, ROTAV, IBDV and IPNV), 30-40 residues in RdDps (HIV and TERT) and DdRps (T7 RNAP and N4) and 102-131 in DdDps (TAQ and T7 DNAP) ( Figure 1 ). The structurally aligned sequences that comprise homomorph of Motif A (hmA) are summarized in Figure 4A . and IPNV (ds-RdRps) have fewer aligned residues. REOV (ds-RdRp) and HIV and TERT (RdDps) do not have R2R correspondence with the ss-RdRps. The DdRps (T7 RNAP and N4) and DdDps (TAQ and T7DNAP) share a homomorphic structure within the N-terminal segment, but it is substantially different from the RdRp structure and therefore is not included in the homomorph or Figure 4A . Within the motif, HIV corresponds only to NV and SAPV (only found using an HIV query), indicating a significant structural difference from other species; HIV also lacks R2R correspondence beyond the motif and therefore is not included in the homomorph. At the C-terminal segment of the homomorph, most species in the sample set, except HIV, have a homologous structure. At some sequence positions within hmA, a particular residue composition is conserved throughout a viral family (e.g. picornavirus), and a different residue composition is conserved in another viral family at the same position. This within-family sequence conservation (!75%) occurs at the following sequence positions (shown in Figure 4A , PV numbering): 214, 234, 237-240, 245 and 249. Within the N-terminal side of the homomorph, at the edge of the motif (PV 226-227), there is a minor discontinuity in structure homology ( Figure 4A ). The distance between the discontinuities in each species is provided in a column within the figure (white) that indicates the entire span over which discontinuity exists for each species. However, the loop represented by this discontinuity varies in length by only one to four amino acids. Figure 4B illustrates the tertiary structure of the hmA. Each end of the homomorph is at the exterior surface of the protein (Figure 4C ), and its center-the conserved Motif A-is at the surface of the template tunnel. The overall configuration of the homomorph is spring-like ( Figure 4D ). The species-specific loop within the homomorph is located at the exterior of the protein. The sequence segment between the homomorphs of Motif A and Motif B (hmB) is $4-20 residues in the RdRps, and mostly greater than 20 residues in the DNA-dependent polymerases. It is relatively long in REOV (41), T7 RNAP (81) and N4 (98). In the birnaviruses (IBDV, IPNV), Motif C precedes Motif A in sequence; this sequence inversion is described in a later section of this article, which describes Motif C. Motif B is a component of the largest homomorph identified in the RdRps. The homomorph begins 21 residues upstream from Motif B and extends 10 residues downstream. The motif is 15 residues long. The size of the homomorph is consistent in most species. The structurally aligned sequences that comprise the homomorph are shown in the top section of Figure 5A . They include all the RdRps in the sample set plus TERT (RdDp). Each of In cells with a light blue background filled with a number, the number is the sequence position of the adjacent matches for each species; numbers in the white column between them summarize the length of sequence that the non-matched sequence represents in each species. In this segment there are more residues in each species than between the corresponding residues in PV, indicating that this region is a loop that is absent in PV, and the loop length varies by species. At the left of the alignment (209-214, uncolored), there is a structure common to several species, but too few to qualify the region as part of the homomorph. (B) In this figure of poliovirus (PDB:1RA6), the N-terminal segment of the homomorph is blue and the C-terminal segment is brown. The terminal residues of HmA are at the exterior surface of the protein (PDB:1RA6). Motif A is centered within the homomorph at the wall of the template tunnel. (C) The terminal residues of the homomorph and the helix adjacent to each are constituents of the protein surface. (D) In PV, an insertion (red) at the C-terminal edge of the motif is lethal: L241-i-S242 (42) . A species-specific loop (green) affects the catalytic rate (in PV) (37) . these species matched a poliovirus query, indicating there is greater structural similarity than in other homomorphs and motifs. The N-terminal segment of the homomorph contains some discontinuities that are resolved by using R2R matches for alternative queries ( Figure 5A , lower section). The C-terminal segment of the homomorph is well represented in all RNA polymerases and TERT. No R2R correspondence was found between the residues comprising hmB in the RNA polymerases and residues in the DNA-dependent polymerases (T7 RNAP, N4, TAQ and T7 DNAP). The lower section of Figure 5A illustrates the dependence of the R2R correspondence on the query sequence. These differences make it possible to identify fine details between structures. Our definition of each of the homomorphs, however, is based on the inclusion of all R2R alignments using all queries in the sample set. The position of the hmB within the tertiary structure of PV is illustrated in Figure 5B . The N-terminal residue is at the exterior surface of the protein. The N-terminal segment is a classical b-hairpin protein structure that is folded back on itself and is almost entirely exposed on a surface nearly perpendicular to the face of the protein that contains the N-terminal residue ( Figure 5C ). The base of the loop transitions to Motif B at the template tunnel. The C-terminal side of the homomorph extends from the tunnel to the exterior surface of the protein ( Figure 5D ). The distance between the homomorphs of Motifs B and C (hmC) (Figure 1 ) is <6-17 in all RdRps except KUNJ and DENV, which are 36 and 35 residues, respectively. In the DNA-dependent polymerases, this distance is between 98 (TAQ) and 258 (N4) residues. In IBDV and IPNV, the segments between the homomorphs of Motifs B and D are 16 and 11 residues, respectively. The structurally aligned sequences that comprise hmC are shown in Figure 6A . Motif C is the only RdRp motif that is not a component of a larger homomorphic structure. The segments immediately adjacent to both flanks of Motif C do not even cluster into subgroups. Motif C is short-12 residues in most RdRps and folds sharply back on itself ( Figure 6B ). The highly conserved residues (labeled Motif C) are at the surface of the template tunnel and both the N-terminal and C-terminal residues are at the exterior surface of the protein ( Figure 6C ). In the birnaviruses IBDV and IPNV, there is a sequence inversion that results in the relocation of Motif C to a position immediately preceding Motif A. Figure 7A shows an alignment that documents this inversion. The top and bottom segments of Figure 7A illustrate that all species are well aligned upstream of Motif C (IPNV positions 365-372) and within Motif A (IPNV positions 399-409). RHDV, SAPV and BVDV are not well aligned within Motif C using the IPNV query, and therefore are missing from the middle section of Figure 7A (IPNV positions 382-393). The numbering of IPNV and IBDV is sequential, indicating that Motif C precedes Motif A in these species. The numberings of NV and HCV indicate there are R2R matches with IPNV at Motif C, but that over this segment the match is not in sequential order. Using a PV query, however, all of these species have R2R matches over this segment (shown in Figure 6A ). The IPNV query indicates that the structure of Motif C of the birnaviruses more closely matches NV and HCV than the others in the sample set. The difference in linear order that results from the sequence inversion is Figure 6 . (A) The high number of species that align to PV indicates that the structure of Motif C is highly conserved. Although A T7 DNAP query was required to identify the matches for the N4-TAQ-T7 RNAP species, it was achievable. HmC is the only homomorph for which there is R2R correspondence in all species of the study group. (B) Motif C (gold) is the only motif in the RdRps that is not a component of a larger structure. Motif C [illustrated using poliovirus (PDB:1RA6)] is tightly folded upon itself in a manner that places the highly conserved residues (yellow) at the tunnel wall, whereas the N-terminal segment of the motif (blue) and C-terminal segment of the motif (brown) are parallel to each other and penetrate the protein. (C) The terminal residues of both the N-and C-terminal segments are at the surface of the protein. compensated by a modified structure that maintains the motifs within a tertiary position that is similar to all other RdRps ( Figure 7B and C) . The distance between the hmC and hmD (homomorph of Motif D) is <10 residues in the RdRps. It is relatively large in PHI6 (25 residues) and is indeterminate in the DdDps, as neither Motif D nor its homomorph is within the PDB structures included in this study. The structurally aligned sequences that comprise the hmD are shown in Figure 8A . The homomorph is 21 residues long and consists of a 10-residue extension from the N-terminal edge of the motif plus the motif itself. The structure of the N-terminal segment is more highly conserved (i.e. has more R2R matches) than the motif. Various query sequences were tested with the expectation that they would capture additional alignments. The middle section of Figure 8A illustrates that this produced some improvement. For example, using an HCV query, there are R2R matches to TERT, TAQ and T7 DNAP. The C-terminal edge of the motif has some R2R correspondence, suggesting that the structure of the motif is moderately conserved. Using T7 DNAP as a query (lowest segment of the figure) , only a small portion of the C-terminal edge of Motif D and a few species have similar structures. There is no alignment of PHI6 within the N-terminal segment of the homomorph, because in this region PHI6 consists of a 24-residue loop between the end of Motif C and the start of Motif D. The tertiary structure of the hmD is illustrated in Figure 8B and C. This homomorph lies mostly at the exterior surface of the protein. The motif lines the wall of the polymerase tunnel. The segment between the hmD and homomorph of Motif E (hmE) is <15 residues in all structures in the sample set, except in IBDV and IPNV in which it is 28 and 40 residues, respectively. The structurally aligned sequences that comprise hmE are summarized in Figure 9A . HmE is large and in most of the ss-RdRps (PV, COXS, HRV, FMDV, NV, RHDV, SAPV, HCV, BVDV, WN and DENV) it is highly conserved. The motif is near the N-terminal edge and a loop region is located near the middle of the homomorph. The sequences vary in length due to the loop region. The length of hmE in the caliciviruses (30-34 residues) is shorter than those in the picornaviruses (36-37 residues); HCV and BVDV loops are 37 and 35 residues, respectively, and the loops of WN and DENV are the longest at 38 and 39 residues, respectively. There is strain-specific amino acid variability in this segment of HRV. HmE is well represented by all RdRps. No R2R correspondence was found with HIV or TERT. These species, however, are structurally matched to each other ( Figure 9A , middle section). There is considerable sequence similarity between PV and DENV within this homomorph; this is illustrated in the bottom section of Figure 9A by the shaded conserved residues. DdRps and DdRps are not included in the analysis of this region because the region is missing from the structures in our sample group. The tertiary structure of the hmE is illustrated in Figure 9B and C. Most of the homomorph is at the exterior of the protein near the NTP entry tunnel. Although it has extensive surface exposure, each terminus of the homomorph appears to be anchored by residues that are not part of the homomorph; as a result, the terminal residue at each end of the homomorph is exposed as a single residue at the exterior surface of the protein. Motif E is located near the N-terminal edge of the homomorph and contacts the surface of the NTP entry tunnel (2) . The C-terminal segment of the homomorph is folded back on itself in a manner that places the speciesspecific loop at the surface of the protein ( Figure 8C) . The homomorph forms a double strand through PV_M392, at which point the remainder of the homomorph is a single-stranded helix that emerges at the exterior surface of the protein. In PV, the C-terminal of hmE (R402) is exposed at the surface the protein and surrounded by the segment 28-SAFHYVFEG-36. Structure-based sequence alignment using the StralSV algorithm (1) enabled us to identify seven distinct homologous structures in most of the polymerases in our collection of 22 species. In the RdRps, the combined regions of structural homology represent $75% of the sequence from the start of homomorph of Motif G (hmG) through the end of hmE in each species ($375 residues). There is <10% conservation of sequence composition among these species. Each of the homomorphs includes a sequence motif consisting of characteristic highly conserved functional residues that are essential to replication. The tertiary position of each of the homomorphs includes at least one residue (and sometimes more) in contact with the exterior surface of the protein and one or more highly conserved functional residues located within or at the wall of the template tunnel. We defined the boundaries of a homomorph as the position where the structural segment shared by all representatives in a set became discontinuous in more than two species. For many queries, this position could be confidently identified. However, these positions sometimes varied by one or two residues, depending on the query sequence. Query-dependent differences in R2R matches were also observed within the motifs themselves, where minor differences in structure resulted in a lack of R2R matches for short segments of some queries. Our approach was to set the boundary at the position where most queries were in agreement, but to keep in mind that these edges might vary by one or two residues. Poliovirus had R2R correspondence with other species in the sample set more often than did any other structure. In almost all instances, we were able to map functional features of other proteins to a structurally similar segment of poliovirus. This property of centrality makes it a useful template for polymerase structure properties. HmG is shared by picornaviruses, caliciviruses and flaviviruses, although the structures of each of these groups begin to diverge within the C-terminal segment of Motif G. Motif G is characterized by the conserved motif [T/Sx 1-2 G], which is located near the outer edge of the template tunnel. The motif may enforce the correct orientation of essential residues and a primer (35) . Each flank of the homomorph contains amino acid residues that significantly affect the life cycle of the species. In PV, mutations at the N-terminal residue of the homomorph (D71A/E72A) are lethal (37) . Mutations located outside the N-terminal edge of the motif (PV D105A/E108A) result in small plaques (37) . Downstream from the C-terminal edge of the motif, there is a nuclear localization signal (NLS) in the picornaviruses and caliciviruses. The NLS is located two residues from the C-terminus of the homomorph and mutations in the NLS (K125A/ K126A/K127A and K127A/R128A/D129A) are lethal to PV (37) . Previous research found that Motif F occurs in all RdRps (38) , that it recognizes the incoming NTP (39) , serves as the primary fidelity checkpoint for RdRp and reorients the proper triphosphate into a position for efficient catalysis (40) . HmF is an extensive structure with surface exposure at both ends and near its mid-section at Motif F2 ( Figure 3A-D) . Motif F2 ( Figure 3E ) is analogous to the loop in hmG that varies in composition and length; it is upstream of a highly conserved motif and is speciesspecific. The large size of this homomorph and its positioning that transects the protein while maintaining contact with the template tunnel is consistent with its established role in transcription, which requires both fine-scale stability and large-scale mobility. Motif F3 consists of mostly basic amino acid residues and forms the roof of the NTP entry tunnel (41) ; the characteristic conserved arg residue is essential to nucleotide binding (38) . The required orientation of the F motifs would be stabilized by the loop formed by hmF and the doublestranded segment formed by the extension of the homomorph beyond the motifs. Both the N-terminal and C-terminal residues of the homomorph are exposed at the exterior surface of the protein. In PV, mutations of residues adjacent to the N-terminal are lethal: G149-i-I150 (42) and H149A/K150A (37) . The conserved residues of Motif A (in PV, D233 and D238) control the function of the metal ions at the active site (41, 43) , which perform the phosphotransfer essential to polymerase activity (41) . D233 is ligand to the metal (44) . D238 is essential to NTP binding (3). Similar functions for the residues of Motif A have been identified for HCV (45) , HRV (45) and FMDV (2) . Motif A is centered within a spring-like homomorph ( Figure 4B -D). Each end terminates at the exterior surface of the protein, and the beginning and end of the homomorph terminate nearly opposite each other. Mutations in the N-terminal segment of the homomorph (in PV at E226A/E227A) result in small plaques (37) , suggesting that these residues influence the rate of catalysis. This is the region where species-specific structures protrude from the homomorph (PV L224-L229). This position, relative to the conserved motif, is analogous to a similar structure in hmG and Motif F2. All these structures contain a segment that varies in length and composition by species and is located upstream from a highly conserved motif, essential to replication. An insertion at the C-terminal edge of Motif A (L241-i-S242) is lethal ( Figure 4A ) (42) . The structure of the homomorph is highly conserved in this region, suggesting that the structural consequences of an insertion are not tolerated. The position of this lethal insertion is similar to the position of lethal mutations in Motif G, although the major effect in hmG may be the loss of the nuclear localization signal. Mutations near the C-terminal residue of the Motif A homomorph (PV G257) affect function: E254A/K255A is lethal (37) , and the insertion I256-ile-G257 results in temperature sensitivity (46) . HmA provides a structural connection between the functional residues at the template tunnel and the exterior surface of the protein. Sequence residues immediately adjacent to the motif have a high degree of functionality. It is possible that the orientation of the conserved segments that comprise the homomorph would be affected by changes in the orientation of residues at the edges of the motif. N-terminal and C-terminal segments of the homomorph are helices, which are likely to be relatively rigid. Motif B is near the center of a very large homomorph that contacts the exterior surface at nearly opposite positions. As stated by Bruenn (38) , Motif B forms the base of the template-entry channel and may function in guiding the template entry into the active site. Choi et al. (17) observed that the highly conserved asn (N414 in BVDV) is conserved in all picornavirus. Hansen et al. (47) found that in HRV, N297 is involved in positioning NTP for recognition. Ferrer-Orta et al. (2) determined that the equivalent FMDV-N307 and D245 (Motif A) together are involved in ribonucleoside triphosphate (rNTP) selection. Tao et al. (21) and Butcher et al. (20) proposed that Motif B interacts with the 2 0 -OH group on the incoming nucleotide. Korneeva and Cameron (48) determined that FMDV-N307 interacts with the C-terminal-OH in the uridylylation complex, but with the 2 0 -OH in the elongation complex. The role of Motif B in the mechanisms of active site closure has recently been described in detail by Gong and Peersen (9) . These experiments document the role of the highly conserved asn in the motif in multiple species and suggest that structural alignment may be useful for the identification of potential functionally equivalent residues in structures that have R2R correspondences. StralSV structure analysis indicates that the structure of Motif B is highly conserved in all RdRps, unlike some of the other motifs that have unmatched R2R correspondences. This highly conserved structure is consistent with its role in NTP recognition. An insertion in Motif B at PV C290-s-S291 is lethal (42) . Within the N-terminal segment of the hmB, the mutation in PV-K276L results in small plaques (49) . The structural position of this mutation (within the homomorph and upstream from the motif) is similar to that of rate-affecting mutations in the homomorphs of Motifs G and A. At the C-terminal end of the homomorph, in BVDV (BVDV-F426), mutation of residues C427, S428 and R447 to ala reduces primer-dependent RNA elongation and abolishes de novo synthesis (17) . Motif C is not a component of a larger structurally conserved segment, but has the same key features of the other homomorphs. It is folded in a manner that places the apex of the fold at the wall of the template tunnel, and both the N-terminus and C-terminus at the exterior surface of the protein ( Figure 6B and C) . Therefore, Motif C as defined in the literature comprises the homomorph. The absence of R2R correspondence adjacent to hmC indicates that the structures of the adjacent sequence segments are highly specific to each species. HmC is highly conserved in the RdRps and highly similar to the DNA-dependent polymerases. Although there is a sequence inversion in the birnaviruses (Motif C precedes Motif A), Figure 7B and C illustrates that despite the difference in sequence order, the homomorphs occupy a similar tertiary position. StralSV analysis indicates that the structure of Motif C is highly conserved in the RNA-dependent polymerases, though slightly different in the DNA-dependent polymerases ( Figure 6A ). Motif C is part of the classic 'RRM-fold' that forms the core of the palm domain of all these polymerases (together with that part of Motif A that forms a b-sheet with Motif C. Experimental studies have demonstrated that several residues within Motif C are sensitive to the position and composition of mutants. The highly conserved residues, GDD, occur near the center of the motif. The primary function of these residues is to coordinate the metal ions associated with the incoming rNTP (45, 43) . In PV, mutation of D to E in either or both positions (D328 or D329) is lethal (50) . In HCV, mutation of G317A is also lethal (51) . However, in birnaviruses the highly conserved residues are ADN, rather than GDD, and mutation to GDD increases RNA synthesis activity (5) . Certain mutations immediately upstream from the GDD motif are lethal in PV: Y326[CHIMS] (50) . This is similar to the effect of the L241-i-S242 at the downstream edge of Motif A in PV. Near the N-terminal end of Motif C in HCV (HCV T312), the mutation D311A characterizes chronic hepatitis (52) . Mutation at the edge of a highly conserved structure seems to have a substantial effect on the viral life cycle. The R2R comparisons summarized in these structure maps identify the types of sequence variability that can occur while maintaining the same spatial structure ( Figure 6A ) and demonstrates and identifies the variations in composition that can be tolerated even within a key functional motif. The R2R correspondence of other RdRps with birnaviruses in Motif C ( Figure 7A ), despite the sequence inversion in birnaviruses (CAB) supports the premise that conservation of structure is a significant, if not dominant, factor in evolution. HmD is different from the other homomorphs in that it lies mostly on the surface of the protein ( Figure 8C ). Like the others, however, its terminal residues are located at a distinctive surface ( Figure 8D ). In the case of hmD, they come from an opposite surface rather than the interior of the protein. The N-terminal segment of hmD is more conserved than the motif itself, which forms the C-terminal segment. The motion of Motif D in the active state has not been captured by the existing structures in PDB (40) . Therefore, the lack of R2R correspondence in the motif may be a reflection of the limitations of the available structures. Residues within hmD perform varied functions. In PV, e.g. polymerases form an extensive lattice system by polymerase-polymerase interactions; L342 and D349, located within hmD, contribute to interface I of this lattice system (53) . The most highly conserved residue within the homomorph is a gly (PV G351) at the N-terminal edge of the motif and central to the homomorph; gly in this position would facilitate the folding of the homomorph, and is consistent with Cameron et al.'s (40) hypothesis that Motif D may be the most dynamic structural element of RdRps and RTs. Another conserved residue is a lys near the C-terminal edge (PV-K359). Residues equivalent to PV-K359 supply a proton to the nucleotidyl transfer reaction that increases the rate constant for nucleotide addition by 50-to 1000-fold (40) . In PV, within the motif, the insertion T353-t-M354 results in small plaques, likely due to delayed RNA synthesis (54) . In other homomorphs, mutations that affect the rate of synthesis occur more commonly outside of the motifs. Immediately downstream from the homomorph, PV-T362I is an attenuating mutation for the Sabin vaccine (40) . HmE is $36 amino acids in length ( Figure 9A ), and well represented by all RNA-dependent members of the sample set, except that no correspondence was found with HIV or TERT. These two species, however, are structurally matched to each other. There is considerable sequence similarity between PV and DENV within this homomorph, shown in the bottom segment of Figure 9A . Appleby et al. (55) determined that Motif E is unique to RNA polymerases. HmE forms part of the NTP entry tunnel and has a considerable amount of exposure on the surface of the protein. The species-specific loop within hmE is at the outermost edge of the protein, a feature found in other homomorphs (G, F2 and A; Figures 2A, 3A and 4A, respectively). Huang et al. (25) found that the Motif E loop region acts as a pivot point for thumb subdomain movement upon template-primer binding. Motif E may also function in the proper positioning of the thumb relative to the palm (5) . The turn of the loop projects into the active site cavity where it has been implicated in helping to position the C-terminal end of the primer strand for attachment to the a-phosphate of the NTP during phosphoryl transfer (56) . Motif E in HCV plays a role in binding the priming nucleotide (not the incoming nucleotides) (38) ; HCV has a longer loop ( Figure 9A ), possibly related to this function. In PV, the C-terminal of hmE (R402) emerges from the protein into the segment 28-SAFHYVFEG-36. This segment contains residues F30 and F34, which interact with W403 to maintain the polymerase structure (39) . Comparisons of the tertiary structures of the RdRps of 18 viral species indicated that most of the highly conserved residues essential to polymerase function are embedded in large sequence segments that are highly conserved structurally, yet disparate in composition. We have named these conserved segments 'homomorphs' and have identified the composition and length of each homomorph that includes previously recognized polymerase motifs ( Table 2 ). We have demonstrated that the RNA polymerases have structural skeletons (frames) that are highly conserved, with flexible segments between them, and that extensive segments of structure similarity can be identified by the methods we have described. These methods are applicable to the studies of other groups of proteins, and we anticipate that by accessing structure similarity independent of sequence composition, skeletal frameworks will be found in other groups of proteins. Additionally, after structure similarity is identified, differences between members of the group become readily apparent. All of the homomorphs included residues that connect the template tunnel or the NTP entry tunnel with the outer surface of the protein. Although some of the surface residues within these homomorphs have specific functional roles, as reported in the literature (see citations in previous paragraphs), we anticipate that they may all be important for polymerase function; the consistent occurrence of homomorphs embedding motifs-even when a defined sequence motif is small in size-suggests a structurefunction relationship between the motif and its structurally conserved flanking regions. It would be interesting to explore the possibility that interactions at the surface of the protein (e.g. protein-protein contact at surface homomorph residues) may subtly affect function buried deep beneath, within the tunnel. Furthermore, each homomorph is either divided by or is separated from another homomorph by a flexible secondary structure. Identification of the span of each homomorph and the terminal residue enables us to identify specific residues on the surface that would not, in many cases, be otherwise noticed. By comparing experimental data with the surface location of the ends of the homomorphs, we have found that these are often the sites of key functional interactions of the protein. A paper describing these sites is in preparation. We have compared the effects of currently recognized mutations within the motifs and within the homomorphs. Most mutations within the motifs are function-specific, related to either a change in charge or size, and in most cases the mutations are lethal [mA (42) , mB (42) , mC (50), (51) ]. Mutations outside of the motif (but within the homomorph) are more often rate related and located in a segment that bulges from the homomorph by an amount that varies by species (Figures 2A, 3A and 4A ). These differences support the hypothesis that residues actively involved in template processing are essential to viability, and most of them are components of a consistent, stable structure that places and/or maintains them in their appropriate functional position. However, the practice of mutating residues to ala has resulted in a somewhat 'all or nothing' perspective of mutations. StralSV analysis can facilitate informed selection of alternative residues of various compositions, which could possibly affect replication rates to different extents. Experiments involving this type of testing would enhance predictive models and may provide new insights for the design and development of medical countermeasures. The extension of all homomorphs from the template tunnel to the exterior of the protein was an unexpected finding. Its universality in the polymerase family suggests a functional significance. Residues within the homomorphs that were localized to the surface often had species-specific loops. The most likely reason these features have not been identified previously is due to the limitations of existing sequence and structure comparison tools-in particular, the ability to perform multi-species comparisons of structures, using overlapping windows of a size determined by the user, and the ability to select the criteria for R2R matches. The homomorphs as defined in this work add structural clarity and context to sequencebased functional motifs previously observed by numerous authors performing comparative studies among polymerases. The structure maps created from the R2R correspondences identified by the StralSV algorithm provided a unique and informative perspective of structure and function in RdRps. They readily identify unique regions of each species and those shared by proteins within a family. These are features that would be useful for studies of any protein family. Based on the results of this study, it may be possible to define characteristic homomorphs for many other protein families, despite considerable sequence variation. It may be feasible to classify homomorphs in a manner analogous to the SCOP database, and in doing so provide new insight into protein evolution. The StralSV algorithm simultaneously, rapidly and quantitatively identifies the similarities and differences of the structural components of multiple species and provides an output that facilitates the comparison of three-dimensional structure information. StralSV enabled us to cluster protein segments that have the same tertiary structure, independent of sequence variability. In a sense, it is an analog of Blast, although based on structure rather than sequence. The precision of StralSV makes it easy to identify small differences between and within species. The ability to process multiple species at the same time can rapidly accelerate our understanding of differences between them. The identified structural associations may also facilitate the transfer of structure-related functional information among proteins. The traditional perspective of the relationship between the amino acid sequence of a protein and its tertiary structure has been that sequence determines structure. Under this premise, sequence-based evolutionary studies and phylogeny would inherently incorporate structure. In this study of RdRps, we demonstrated that structure accommodates substantial sequence variability, and that highly diverse sequences can generate highly similar tertiary structures. Structure-based phylogeny may provide new perspectives of protein evolution. StralSV: assessment of sequence variability within similar 3D structures and application to polio RNA-dependent RNA polymerase Structure of Foot-and-Mouth Disease Virus RNA-dependent RNA polymerase and its complex with a template-primer RNA Structural basis for proteolysis-dependent activation of the poliovirus RNA-dependent RNA polymerase Identification of four conserved motifs among the RNA-dependent polymerase encoding elements The structure of a birnavirus polymerase reveals a distinct active site topology The Big Bang of picorna-like virus evolution antedates the radiation of eukaryotic supergroups MUSCLE: multiple sequence alignment with high accuracy and high throughput The Protein Data Bank Structural basis for active site closure by the poliovirus RNA-dependent RNA polymerase Cn3D: sequence and structure views for Entrez Crystal structure of coxsackievirus B3 3Dpol highlights the functional importance of residue 5 in picornavirus polymerases The crystal structure of the RNA-dependent RNA polymerase from human rhinovirus: a dual function target for common cold antiviral therapy Crystal structure of norwalk virus polymerase reveals the carboxyl terminus in the active site cleft Crystal structures of active and inactive conformations of a caliciviral RNA-dependent RNA polymerase The 2.3 A resolution structure of the sapporo virus RNA dependant RNA polymerase Substrate complexes of hepatitis C virus RNA polymerase (HC-J4): structural evidence for nucleotide import and De-novo initiation The structure of the RNA-dependent RNA polymerase from bovine viral diarrhea virus establishes the role of GTP in de novo initiation Crystal structure of the RNA polymerase domain of the West Nile virus non-structural protein 5 Crystal structure of the Dengue virus RNA-dependent RNA polymerase catalytic domain at 1.85-angstrom resolution A mechanism for initiating RNA-dependent RNA polymerization RNA synthesis in a cage-structural studies of reovirus polymerase lambda3 Mechanism for coordinated RNA packaging and genome replication by rotavirus polymerase VP1 The N-terminus of the RNA polymerase from infectious pancreatic necrosis virus is the determinant of genome attachment Complexes of HIV-1 reverse transcriptase with inhibitors of the HEPT series reveal conformational changes relevant to the design of potent non-nucleoside inhibitors Structure of a covalently trapped catalytic complex of HIV-1 reverse transcriptase: implications for drug resistance Structure and functional Implications of the polymerase active site region in a complex of HIV-1 RT with a double-stranded DNA template-primer and an antibody Fab fragment at 2.8 Å resolution Structure of the Tribolium castaneum telomerase catalytic subunit TERT The structural mechanism of translocation and helicase activity in T7 RNA polymerase Structure of replicative DNA polymerase provides insights into the mechanisms for processivity, frameshifting and editing Structural basis for the transition from initiation to elongation transcription in T7 RNA Polymerase Structural basis for DNA-hairpin promoter recognition by the bacteriophage N4 virion RNA polymerase Crystal structures of open and closed forms of binary and ternary complexes of the large fragment of Thermus aquaticus DNA polymerase I: structural basis for nucleotide incorporation Crystal structure of a bacteriophage T7 DNA replication complex at 2.2Å resolution Molecular model of SARS coronavirus polymerase: implications for biochemical functions and drug design The palm subdomain-based active site is internally permuted in viral RNA-dependent RNA polymerases of an ancient lineage Crystal structure of the RNA-dependent RNA polymerase of hepatitis C virus Clustered charged-to-alanine mutagenesis of poliovirus RNA dependent RNA polymerase yields multiple temperature-sensitive mutants defective in RNA synthesis A structural and primary sequence comparison of the viral RNA-dependent RNA polymerases Stabilization of poliovirus polymerase by NTP binding and fingers-thumb interactions Dynamics: the missing link between structure and function of the viral RNA-dependent RNA polymerase? Structure-function relationships among RNA-dependent RNA polymerases Effects of mutations in poliovirus 3Dpol on RNA polymerase activity and on polyprotein cleavage A mechanism for all polymerases Remote site control of an active site fidelity checkpoint in a viral RNA-dependent RNA Polymerase Structural basis for the C-terminal-N-terminal exonuclease activity of Escherichia coli DNA polymerase I: a two metal ion mechanism trans rescue of a mutant poliovirus RNA polymerase function Structure of the RNA-dependent RNA polymerase of poliovirus Structure-function relationships of the viral RNA-dependent RNA polymerase: fidelity, replication speed, and initiation mechanism determined by a residue in the ribose-binding pocket Intramolecular and intermolecular uridylylation by poliovirus RNA-dependent RNA polymerase Mutation of the aspartic acid residues of the GDD sequence motif of poliovirus RNA-dependent RNA polymerase results in enzymes with altered metal ion requirements for activity Biochemical properties of Hepatitis C virus NS5B RNA-dependent RNA polymerase and identification of amino acid sequence motifs essential for enzymatic activity Effect of mutation in the hepatitis C virus nonstructural 5B region on HCV replication Oligomeric structures of poliovirus polymerase are important for function Genetic complementation among poliovirus mutants derived from an infectious cDNA clone Crystal structure of complete rhinovirus RNA polymerase suggests front loading of protein primer Crystal structure of human immunodeficiency virus type 1 reverse transcriptase complexed with double-stranded DNA at 3.0 A resolution shows bent DNA A.Z. designed and developed the StralSV algorithm and performed calculations for all viral species in the study. C.Z. and D.L. wrote codes and developed methods for post-processing of StralSV results, and performed literature searches for interpretation of biological significance of various residue positions. D.L. performed the greater part of the detailed sequence comparisons and prepared the manuscript, with contributions from C.Z. and A.Z. All authors participated in the discussions and shaped the ideas that led to the experimental design and results of this work. All authors read and approved the manuscript. Conflict of interest statement. None declared.