key: cord-305143-mqd4ioj4
authors: Zmasek, Christian M.; Knipe, David M.; Pellett, Philip E.; Scheuermann, Richard H.
title: Classification of human Herpesviridae proteins using Domain-architecture Aware Inference of Orthologs (DAIO)
date: 2019-01-06
journal: Virology
DOI: 10.1016/j.virol.2019.01.005
sha: 
doc_id: 305143
cord_uid: mqd4ioj4

We developed a computational approach called Domain-architecture Aware Inference of Orthologs (DAIO) for the analysis of protein orthology by combining phylogenetic and protein domain-architecture information. Using DAIO, we performed a systematic study of the proteomes of all human Herpesviridae species to define Strict Ortholog Groups (SOGs). In addition to assessing the taxonomic distribution for each protein based on sequence similarity, we performed a protein domain-architecture analysis for every protein family and computationally inferred gene duplication events. While many herpesvirus proteins have evolved without any detectable gene duplications or domain rearrangements, numerous herpesvirus protein families do exhibit complex evolutionary histories. Some proteins acquired additional domains (e.g., DNA polymerase), whereas others show a combination of domain acquisition and gene duplication (e.g., betaherpesvirus US22 family), with possible functional implications. This novel classification system of SOGs for human Herpesviridae proteins is available through the Virus Pathogen Resource (ViPR, www.viprbrc.org).

Herpesviruses comprise a large and diverse order (Herpesvirales) of double stranded DNA viruses that infect humans and a wide range of other hosts (Pellet and Roizman, 2007; Virus Taxonomy: The Classification and Nomenclature of Viruses The Online 10th Report of the ICTV, 2017). Human diseases caused by herpesviruses range from vesicular rashes to cancer. The order Herpesvirales is subdivided into three families, including the Herpesviridae, which is further subdivided into three subfamilies, the Alpha-, Beta-, and Gammaherpesvirinae. Within subfamilies, groups of related herpesvirus species are classified into genera. The nine species of human herpesviruses are distributed across the three subfamilies and several genera (Table 1) ; these viruses are the main focus of this work. Prior studies found that the Beta-and Gammaherpesvirinae are more closely related to each other than to Alphaherpesvirinae (Montague and Hutchison, 2000) . In contrast to some other human viruses, the human herpesviruses have a long evolutionary history, with evidence suggesting that the primordial herpesvirus diverged into the Alpha-, Beta, and Gammaherpesvirinae approximately 180 million to 220 million years ago (McGeoch et al., 1995) . Coupled with their genome complexity and the availability of numerous complete genome sequences, this deep evolutionary history makes herpesviruses a tractable and informative model to study virus genome evolution at the levels of gene duplication and protein domain rearrangement. Nehrt et al., 2011; Rogozin et al., 2014) , due to its importance for computational sequence functional analysis (Eisen, 1998; Zmasek and Eddy, 2002) and the significance of gene duplications for biological evolution (Zhang, 2003) .

Orthologs (or groups/clusters of orthologs) have often been inferred by indirect methods based on (reciprocal) pairwise highest similarities [e.g. (Remm et al., 2001; Tatusov et al., 1997) ]. In this work, we used explicit phylogenetic inference combined with comparison to a trusted species tree for orthology inference, as this approach is likely to yield more accurate results Eddy, 2002, 2001) .

Many eukaryotic proteins, and by extension, proteins of eukaryotic viruses, are composed of multiple domains, components that can each have their own evolutionary history and functional implications. The architecture of a protein is a product of the ordered arrangement of its several domains and their overall tertiary structure. Evolutionarily, individual domains can combine with other partner domains, enabling formation of a vast number of domain combinations, even within the same species (Moore et al., 2008) . Assembling multiple domains into a single protein creates a distinct entity that can be more than the sum of its constituent parts. The emergence of proteins with novel combinations of duplicated and then diverged domains is considered to be a major mechanism for rapid evolution of new functionality in eukaryotic genomes (Itoh et al., 2007; Peisajovich et al., 2010) . It is especially important in the evolution of pathways, where novel linkages between existing domains may result in the rearrangement of pathways and their behaviors in the cell (Peisajovich et al., 2010) . The modular structure of eukaryotic proteins provides a mechanism that enables evolutionarilyrapid differentiation and emergence of a multitude of novel protein functions from an initially limited array of functional domains. Proteins can gain (or lose) new domains via genome rearrangements, creating (or removing) domain combinations, in addition to modification of domains themselves by small-scale mutations (Patthy, 2003; Ye and Godzik, 2004) .

Here we present a systematic classification of proteins catalogued in the NCBI RefSeq entries for each of the nine human herpesviruses plus selected comparisons with homologs from non-human herpesviruses based on phylogenetic inferencing and domain architecture analysis using Domain-Architecture Aware Inference of Orthologs (DAIO). This analysis resulted in the classification of proteins into "Strict Ortholog Groups" (SOGs), in which all proteins are orthologous to each other (related by speciation events) and exhibit the same domain architecture. The SOG classification also enabled the development of an informative name convention for each SOG that includes information about the protein's function (if known) and a suffix indicating the taxonomic distribution of the protein. For example, an "aBG" suffix would indicate that proteins of this group are found in some (but not all) human Alphaherpesvirinae species (lowercase "a"), and all human Beta-and Gammaherpesvirinae species (uppercase "B" and "G"). Such suffixes allow for the quick understanding of presumed conserved protein function and minimal common genome across the Herpesviridae family. The SOG classification results have been made publicly available through the Virus Pathogen Resource (ViPR) (Pickett et al., 2012) at https://www.viprbrc.org.

For this analysis, we developed a rational, phylogeny-and domain architecture-aware classification approach for human herpesvirus proteins, the Domain-architecture Aware Inference of Orthologs (DAIO) method, which produces Strict Ortholog Groups (SOGs) of proteins. Before we present genome-wide findings, we show results for a few instructive SOG examples, including protein groups that have evolved in a "simple" manner, recapitulating the Herpesviridae evolutionary tree without gene duplications or domain rearrangements, and protein groups in which domain rearrangements (domain gains) and/or gene duplications have occurred. Table 2 lists the 23 SOGs common to all nine human herpesviruses. For every SOG, a suggested name is provided, composed of a protein names and a suffix indicating the taxonomic distribution (A, B, G: present in all human members of the Alpha-, Beta-, Gammaherpesvirinae, respectively; a, b, g: present in some but not all human members of the Alpha-, Beta-, Gammaherpesvirinae, respectively). Gene names/symbols (a forward slash is either part of the accepted gene name or is used to separate multiple gene names) and Pfam domain architecture names are also included. The table is organized into three sections. The first section lists protein families that have apparently evolved without gene duplication or domain rearrangements [e.g., uracil DNA glycosylase and the capsid scaffolding protein protease (CSPP)]; the second section lists proteins that have evolved with domain rearrangements and or duplications [e.g., glycoprotein B (gB), DNA polymerase, and multifunctional regulator of expression proteins (mRE)], and the third section lists proteins that share some function (and even genome region) but have been formed from distantly or unrelated domains (e.g., gL, gN, and DNA polymerase processivity factor).

Uracil DNA glycosylases catalyze the first step -removal of the RNA base uracil from DNA -in base excision repair, the mechanism by which damaged bases in DNA are removed and replaced (Krusong et al., 2006) . Uracil DNA glycosylases are found in eukaryotes, bacteria, and archaea, as well as in herpesviruses and poxviruses (Chen et al., 2002) . Our phylogenomic analysis shows that for all nine human herpesviruses, uracil DNA glycosylase is well conserved and contains one Pfam domain, UDG (uracil DNA glycosylase superfamily). In addition, the gene tree for human herpesvirus uracil DNA glycosylases ( Fig. 1B ) precisely recapitulates the herpesvirus species tree (Fig. 1A) ; therefore, this protein family can be inferred to have evolved from a single common ancestor and without any gene duplications or domain rearrangements (see Table 2 for virus-specific gene names).

Capsid scaffolding protein proteases are essential for herpesvirus capsid assembly and maturation, and have an essential serine protease activity (Liu and Roizman, 1993) . These proteins contain one Pfam domain, Peptidase_S21. In contrast to uracil DNA glycosylases, currently available data indicate that protease-scaffolding proteins with a Peptidase_S21 domain are unique to Herpesvirales. Like uracil DNA glycosylases, CSPP evolved without domain architecture rearrangements or gene duplications (Fig. 1C , Table 2 ).

Other examples of Herpesviridae genes that have evolved without any domain architecture rearrangements or gene duplications are listed in the first section of Table 2 .

Herpesvirus virions have an envelope that consists of an outer lipid bilayer studded with 12 or more surface glycoproteins (originally defined in HSV). After virion glycoprotein engagement with cell surface receptors, the envelope fuses with the plasma membrane -a process which, for herpes simplex virus 1 (HSV-1), requires four of its 12 envelope glycoproteins, namely glycoproteins gB, gD, gH, and gL (Cai et al., 1988; Forrester et al., 1992; Ligas and Johnson, 1988; Roop et al., 1993; Spear and Longnecker, 2003) . In contrast, for other herpesviruses, only glycoproteins gB, gH, and gL have been reported to be required for membrane fusion (AlHajri et al., 2017) .

gB and gH are highly conserved across all nine human herpesviruses (Table 2) . A protein annotated as gL is also present in all nine human herpesviruses, yet its occurrences in members of the Alpha-, Beta-and Gammaherpesvirinae are homologous within, but not between subfamilies. gLs from different subfamilies contain unrelated protein domains (Pfam: Herpes_UL1, Cytomega_gL, and Phage_glycop_gL). gL is discussed in more detail below.

Detailed phylogenetic analysis of the human herpesvirus gB family ( Fig. 2A) , including proteins from selected non-human members of the Herpesviridae, shows a picture of a protein that has evolved without gene duplications (or, at the very least, duplicated genes have not been retained) and with nearly completely conserved domain architectures.

The one exception to this is that human cytomegalovirus (HCMV) glycoprotein B (gB) has a short region of about 40 amino acids near its Nterminus that comes in two forms that differ by approximately 50% at the amino acid level. This sequence variant was identified in HCMV strains isolated from Chinese patients (Shiu et al., 1994) and is identified in Pfam as "HCMVantigenic_N domain". In our global hmmscan analysis (applying the same threshold of E = 10 −6 for every Pfam domain) Evalue support for presence of this domain in some strains is strong (E < 10 -22 ) and matching over the entire Pfam model while other HCMV strains do not exhibit significant sequence similarity with this domain. It has been suggested that this domain polymorphism may be implicated in HCMV-induced immunopathogenesis, as well as in strainspecific behaviors, such as tissue-tropism and the ability to establish persistent or latent infections (Pignatelli et al., 2004) . In our new systemic naming approach (see below) we term the SOG of the protein with HCMVantigenic_N domain "Glycoprotein B_ ABG.b", whereas all other proteins fall into the "Glycoprotein B_ ABG.AbG" SOG. 

All members of the Herpesviridae encode six conserved proteins that play essential roles at the replication fork during viral DNA replication: a single-strand DNA binding protein (major DNA binding protein), a DNA polymerase composed of two independently coded subunits (the catalytic DNA polymerase subunit and a DNA polymerase processivity factor encoded by three distantly related genes in members of the Alpha-, Beta-, and Gammaherpesvirinae, see below), and a three subunit helicase/primase complex (DNA replication helicase, DNA helicase primase complex associated protein, and DNA primase) (Pellet and Roizman, 2007) .

Our analysis shows that the catalytic DNA polymerase subunits of all members of the Herpesviridae contain two domains: an N-terminal DNA polymerase family B exonuclease domain, and a C-terminal polymerase domain from DNA polymerase family B (Fig. 2B ). Cellular family B DNA polymerases are the main polymerases involved with nuclear DNA replication and repair in eukaryotes and prokaryotes, and include DNA polymerases II and B, and polymerases α, δ, and ε (Garcia- Diaz and Bebenek, 2007) . Family B DNA polymerases are also found in other dsDNA viruses, such as the insect Ascoviridae, and members of the Iridoviridae (e.g., fish lymphocystis disease virus) and Phycoviridae (e.g., chlorella virus) (Villarreal and DeFilippis, 2000) . In addition to these two large and ubiquitous domains, Simplexvirus (which include human simplex virus 1 and 2) and Mardivirus also possess a small C-terminal domain, called the DNA polymerase catalytic subunit Pol (DNAPoly-mera_Pol) domain in Pfam (Zuccola et al., 2000) , and are longer by about 45 aa on average than DNA polymerase proteins from other Herpesviridae. According to currently available genomic data, DNAPo-lymera_Pol is found in members of the Simplexvirus genus of the Alphaherpesvirinae. While varicella-zoster virus (Human herpesvirus 3) and other members of the Varicellovirus genus of the Alphaherpesvirinae possesses DNA polymerases that also tend to be longer, similarity of these protein regions to the DNAPolymera_Pol domain is low, using the current Pfam model for DNAPolymera_Pol (Pfam version 31.0). The function of this third domain is to mediate interaction between DNA polymerase and its cognate processivity factor (Bridges et al., 2000; Loregian et al., 2000) based on the observation that a peptide corresponding to the 27 C-terminal amino acids of HSV-1 DNA polymerase has been shown to inhibit viral replication by disrupting the interaction between DNA polymerase and UL42 (Digard et al., 1995; Loregian et al., 1999) . In this context, it is interesting to note that the DNA polymerase processivity factors are only distantly-related across the Alpha-, Beta-, and Gammaherpesvirinae (see below). It is therefore conceivable that the interactions of Beta-, and Gammaherpesvirinae DNA polymerase processivity factors with their corresponding DNA polymerases (which lack a DNAPolymera_Pol domain) is different in nature than for Alphaherpesvirinae. As for Varicellovirus it is unclear whether they possess a functional DNAPolymera_Pol domain, and a definitive answer will require similar biochemical assays as have been performed for HSV-1.

Phylogenetic analysis of human herpesvirus DNA polymerase proteins, plus related proteins from selected mammalian herpesviruses, shows that, similar to the glycoprotein B family, DNA polymerases of the Herpesviride evolved without gene duplication. Nonetheless, in contrast to gB, DNA polymerases acquired a new domain early in Alphaherpesvirinae evolution. This domain might have been lost again, or underwent significant mutations, during Varicellovirus evolution. The presence of the longer domain in Varicelloviruses suggests that the longer domain emerged prior to the Varicellovirus/Simplexvirus split.

Multifunctional regulator of expression (mRE; also known as immediate-early protein IE63, infected cell protein 27, ICP27, and α27) is a protein with homologs in all human herpesviruses (for gene names see Table 2 ). Multifunctional regulator of expression is a regulatory protein that plays a role in the prevention of apoptosis during HSV1 infection (Aubert and Blaho, 1999) . Multifunctional regulator of expression interacts directly with a number of proteins in performing its many roles. In particular, multifunctional regulator of expression protein contributes to host shut-off by inhibiting pre-mRNA splicing by interacting with essential splicing factors, termed SR proteins, and affecting their phosphorylation (Sciabica et al., 2003) . Furthermore, the mRE protein C.M. Zmasek et al. Virology 529 (2019) [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] has been shown to associate with cellular RNA polymerase II holoenzyme in a DNA-and RNA-independent manner and to recruit RNA polymerase II to viral transcription/replication sites (Dai-Ju et al., 2006; Zhou and Knipe, 2002) . mRE also competes with some transport receptors, resulting in the inhibition of host pathways while supporting mRNA export factor-mediated transport of HSV-1 mRNAs (Malik et al., 2012) . All of the multifunctional regulator of expression proteins analyzed here have a single copy of a Pfam "Herpesvirus transcriptional regulator family" (Herpes_UL69) domain that is specific to members of the Herpesviridae. In addition to the Herpes_UL69 domain, human Simplexvirus mRE have an additional N-terminal domain, the "Herpes viral adaptor-to-host cellular mRNA binding domain" (HHV-1_VABD) (Tunnicliffe et al., 2011) . Besides human Simplexvirus, architectures with C-terminal HHV-1_VABD and N-terminal Herpes_UL69 domains are also found in Chimpanzee herpesviruses (e.g. NCBI Reference Sequence: YP_009011042 (Severini et al., 2013) ), while other non-human Simplexviruses lack the HHV-1_VABD domain. Using currently available genomic data, we were unable to detect HHV-1_VABD domains outside of the Simplexvirus genus.

Phylogenetic analysis of human herpesvirus mRE proteins, including proteins from selected herpesviruses of other mammals, shows that multifunctional regulator of expression proteins evolved without observable gene duplications (since this gene tree recapitulates the herpesvirus species tree).

Nine groups of human herpesviruses are annotated as performing the same, or very similar function, in the absence of discernable protein sequence similarity (Table 2, Fig. 3 ).

As mentioned above, DNA polymerase processivity factor is one of the six proteins that play essential roles at the replication fork during viral DNA replication. Processivity factors, also called clamp proteins, help to overcome the tendency of DNA polymerase to dissociate from the template DNA, and thus greatly enhance DNA polymerase processivity (Weisshart et al., 1999; Zhuang and Ai, 2010) . In contrast to the protein families discussed so far, DNA polymerase processivity factors are only distantly-related across the Alpha-, Beta-, and Gammaherpesvirinae. In the Alphaherpesvirinae, the protein is composed of two tandem Herpes_UL42 domains; Betaherpesvirinae have a single Herpe-s_PAP domain; Gammaherpesvirinae have a single Herpes_DNAp_acc domain (Fig. 3A, B, C) . These three domains are very distant homologs and are members of the DNA clamp superfamily (Pfam clan CL0060). gL (Fig. 3D , E, F) is another example of a protein function performed by different, probably non-homologous domains present in different Herpesviridae subfamilies (Pfam domains Herpes_UL1, GlyL_C, Cyto-mega_gL, and Phage_glycop_gL). Interestingly, the open reading frames for these seemingly unrelated proteins are located in analogous conserved genomic contexts, including open reading frame sizes and orientations relative to the surrounding conserved coding regions.

The remaining seven groups with these characteristics are: cytoplasmic egress tegument protein, cytoplasmic egress facilitator-1, cytoplasmic egress facilitator-2, encapsidation chaperone protein, glycoprotein N Pfam clan Herpes_glyco, CL0146), LTP binding protein, and small capsid protein (Table 1 and Supplementary Table 1) . 

In contrast to the protein families discussed so far, the evolutionary history of human Herpesviridae proteins with 7-transmembrane receptor domains is more complex (Fig. 4) (Spiess et al., 2015) . By comparing this gene tree with a species tree for human Herpesviridae (Fig. 1A) , we can infer three gene duplication events (marked as red squares in Fig. 4) , resulting in four groups of orthologous genes: UL33/U12, US27, U51/ORF74, and US28. In our new nomenclature (see below), we call the first group "Gprotein coupled receptor homolog UL33/U12_B" because it is found in all four human Betaherpesvirinae species (uppercase B suffix). The second group is called "G-protein coupled receptor homolog US27_b" as it is found in some human Betaherpesvirinae (lowercase b suffix). The third group is called "G-protein coupled receptor homolog U51/ORF74_bg" because it found in some human Betaherpesvirinae and in some human Gammaherpesvirinae (lowercase "bg" suffix). The fourth group is called "Envelope protein US28_b". No orthologous genes were found in the human Alphaherpesvirinae. Whenever available, we base our names preferably on (Mocarski, 2007) or the "Recommended name" (under "Protein names") from the UniProtKB database (Bateman et al., 2017) . For reasons of consistency and objectivity, we used an automated approach to root all trees by mid-point rooting. It is possible, that the true root for the 7-transmembrane domain proteins tree is at the base of the U51-ORF74 subtree. In this case there would be only two duplications in the tree, but still the same four ortholog groups: U51/ORF74, US28, US27, UL33/U12. Functionally, all these proteins appear to be hijacked human proteins that are being used by the virus to modulate the host immune system. In particular, many of them appear to act as chemokine (orphan) receptors (Casarosa et al., 2003 (Casarosa et al., , 2001 Isegawa et al., 1998; Murphy, 2001; Zhen et al., 2005) (Fig. 5) .

Proteins with US22 domains have the most complex evolutionary history of all Herpesviridae proteins, even though among the human herpesviruses, the US22 domain has been found only in C.M. Zmasek et al. Virology 529 (2019) 29-42 betaherpesviruses (Hanson et al., 1999) . US22 domain proteins are also present in Gallid herpesvirus 2 (a member of the Alphaherpesvirinae), in members of the Alloherpesviridae family, in other dsDNA viruses (e.g., Poxviridae and Iridoviridae), and in some animal species. Most proteins with US22 domains carry two copies of the domain. US22 is a member of a large group of distantly homologous proteins (the SUKH superfamily, Pfam clan CL0526), which, for example include bacterial Syd proteins. It has been suggested that a function of the US22 family is to act against various anti-viral responses by interacting with specific host proteins (Zhang et al., 2011) .

Here we summarize the results of our phylogenetic analysis of US22 domain proteins of the human bataherpesviruses. Unfortunately, the phylogenetic signal across this group of protens is weak, thus some support values are low. Two groups of US22 orthologs span all four human betaherpesviruses: CMV tegument protein UL23 is likely to have orthologs in HHV-6A, HHV-6B, HHV-7 (Roseolovirus) Protein U3 ("Tegument protein UL23/Protein U3_B"). Similarly, CMV Tegument protein UL43 is likely to be orthologous to HHV-6A, HHV-6B, HHV-7 (Roseolovirus) Protein U25 ("Tegument protein UL43/Protein U25_B"). U3 and U25 are paralogous towards each other, as they are connected by a gene duplication, as are HCMV UL23 and 43. Four groups of orthologs specific to Roseolovirus are Tegument protein DR1, Tegument protein DR6, Protein U7, and Protein U17/U16. In U17/U16 proteins, it is unclear whether they possess a second US22 domain, as the similarity to this domain is weak to the point of insignificance. In contrast, U7 proteins possess at least three US22 domains and an additional C-terminal Herpes_U5 domain. Proteins U7 are most closely related to CMV UL29, but differ in their domain architecture (lack of Herpes_U5 domain). Thus CMV UL29 forms its own species-specific group of orthologs. Numerous proteins with US22 domains are specific to CMV (and thus all paralogous to each other) given current data: apoptosis inhibitor UL38, early nuclear protein HWLF1, tegument protein UL26, US24, protein UL24, UL29, UL36, US23, US26, protein IRS1, and protein TRS1. C.M. Zmasek et al. Virology 529 (2019) 29-42 2.8. The inferred minimal proteomes of the human herpesviruses

As described above, we classified viral proteins into "strict ortholog groups," requiring that all proteins exhibit the same domain architecture and are orthologous to each other. We attempted to give an informative name for each of these groups including a suffix that indicates the taxonomic distribution of a protein. For example, an "aG" suffix would indicate that proteins of this group are found in some (but not all) members of human alphaherpesvirus species (lowercase "a"), and members of both human gammaherpesvirus species (uppercase "G").

Families which have a (some) domain(s) in common but differ in their domain architectures, are more difficult to rationally name (we found 17 of these cases). An example of such a family is DNA polymerase. In such cases, the suffix is split by a period into two parts. The first part indicates overall presence of common domain(s) for all members of this SOG, the second part (after the period) relates to specific domain architectures. Thus, "DNA polymerase_ ABG.aBG" refers to the simpler DNA_pol_B_exo1--DNA_pol_B domain architecture present in nearly all Alphaherpesvirinae species. "DNA polymerase_ ABG.a" refers to the DNA_pol_B_exo1--DNA_pol_B-DNAPolymera_Pol DA that is present in a smaller subset of Alphaherpesvirinae species.

The rationale behind this approach for labeling members of protein families that have different domain architectures is that it gives users a choice between "traditional" ortholog groups, which do not consider domain architectures (by ignoring the part after the period), and SOGs (taking the full name into account).

In total, we were able to establish 169 SOGs (Supplementary Table 1 ). Of these, 40 (23 +8 +9) functionally similar groups (Table 2) are present in all 9 human Herpesviridae species and represent the core proteins of human herpesviruses.

Besides proteins with clearly defined Pfam domains, we found 29 protein families for which Pfam domains have not been defined. Classification of these proteins was based on manual BLAST searches. An example of such a family is the virion host shutoff protein UL41.

Another unusual case is the HSV1 UL13 serine threonine protein kinase. All nine human herpesviruses have homologs of this protein, but its associated Pfam domain UL97 only matches sequences in betaherpesviruses. Extension of the family to alpha-and gammaherpesviruses is thus based on manual BLAST searches.

Finally, two protein families could not be classified due to lack of phylogenetic signal: protein B8 of HHV-6A and HHV-6B (associated gene names U92, U93, HN1, HN92D, B8) and protein UL28/UL29/U8 of HHV-6A, HHV-6B, and HHV-7.

Proteins which are species or strain specific are listed in Supplementary Table 2.

In order to make the results of DAIO classification available to all Herpesvirus researchers for experimental hypothesis testing, we incorporated SOG data into the Virus Pathogen Resource (ViPR) at https://www.viprbrc.org (Pickett et al., 2012) . Through ViPR, scientists can search, sort, and download SOG names (including taxonomic distribution), Pfam domain architecture data, and individual protein sequences belonging to selected SOGs. Fig. 6A shows an example of a search result table, which includes data for some of the protein families discussed above, namely glycoprotein B family members (associated with two distinct SOGs: "Glycoprotein B_ ABG.b" and "Glycoprotein B_ ABG.AbG"), DNA polymerase ("DNA polymerase_ ABG.a" and "DNA polymerase_ ABG.aBG"), and multifunctional regulator of expression ("Multifunctional regulator of expression_ABG.a" and "Multifunctional regulator of expression_ABG.aBG"). By clicking on the "Total # of Proteins" table entries, users can view and download the individual protein sequences belonging to a given SOG. Fig. 6B shows how SOG data, including domain architecture information, is part of protein annotations in ViPR (Simplexvirus "DNA polymerase_ABG.a" example). As new genome sequence data become available, the SOG data in ViPR is continuously updated in order to keep current with the ever expanding universe of Herpesvirus protein sequences. In addition, SOG annotations in ViPR will be expanded to include non-human Herpesviruses in the future. SOG data is also available for Pox-and Coronaviruses in ViPR, and will be applied to other virus families in the future.

In this work, we used Domain-architecture Aware Inference of Orthologs (DAIO) to provide a classification for proteins of human herpesviruses, based on domain architecture and phylogenetic history. While the work presented here is limited to human herpesviruses, and thus does not take full advantage of all the sequence data that is currently available, we plan to extend our DAIO approach to all herpesviruses with a known phylogenetic history.

A major contribution of our classification system to herpesvirus biology is that it provides a series of testable hypotheses for further experimental investigations. For example, it informs experimental reconstruction of minimal genome viruses. Such synthesized minimal genomes could prove useful for identification of genes responsible for pathogenic and other biological differences between viruses.

Of particular interest in the field of molecular biology is the relationship between domain architecture and protein function. The detailed analysis of domain architectures presented here suggests studies that investigate the functional effects of removing or swapping domains in viral multidomain protein architectures The fact that Simplexvirus DNA polymerases contain the extra DNAPolymera_Pol domain and that this domain architecture is conserved among Simplexvirus isolates suggests that it may provide some unique function necessary for efficient replication of Simplexviruses. This hypothesis could be explored experimentally. Similarly, what would be the consequence of adding a Cterminal GlyL_C domain to the gL protein of VZV (which contains one Herpes_UL1 domain), and so making it similar to the gL protein found in HSV-1 and HSV-2 (which has a Herpes_UL1--GlyL_C architecture)?

Interestingly, while it has been noted that domain loss is an important mechanism in eukaryote evolution (probably equally-and possibly even more-important than domain gain) (Zmasek and Godzik, 2011) ; and references therein), in herpesvirus evolution domain loss seems to play a lesser role, as most of the events we were able to detect are domain gains (according to the parsimony principle).

Another implication of this work relates to the observation that in some cases proteins that share the same name are composed of either unrelated (e.g. gL) or very distantly related domains (e.g. DNA polymerase processivity factor) in different herpesvirus species. This raises the question -are such share named truly justified for proteins composed of unrelated domains? And to what extent has their putative shared function been experimentally validated.

Our approach is also expected to facilitate the detection and subsequent experimental study of species-(and strain-) specific proteins (listed in Supplementary Table 2). Whereas HSV1 and HSV2 do not have any species specific proteins given current data, VZV has six, and CMV has by far the most with 130 proteins which are not found in any other species. Interestingly, many of these 130 proteins are specific to one strain (or isolate) of CMV. Unsurprisingly, many of these species-and strain-specific protein do not yet have a Pfam domain (and thus were analyzed by manual BLAST searches in this work). An example of such a protein is the ORF45 protein of KSHV (Zhu and Yuan, 2003) . Our automated approach provides a starting point for the systematic computational and experimental study of these species-and strain-specific proteins-studies, which eventually will provide answers to such questions as: Are these species-and strain-specitic proteins essential under certain conditions? Do they result in altered pathology or clinical symptoms? Do they function in host interaction? Do they possess as of yet undiscovered, but shared protein domains?

In summary, we developed a computational approach called Domain-architecture Aware Inference of Orthologs (DAIO) for the classification of viral proteins into groups of orthologous proteins with identical domain architecures (SOGs). In addition, we established a nomenclature for SOGs that provides the user with information about the biological function and taxonomic distribution for the member proteins of a SOG. We applied this classification and nomenclature to the proteomes of all human Herpesviridae species and made the results publicly accessible via the ViPR database. The acquisition and retention of novel domain architectures suggests that some Herpesviridae proteins may have acquired novel functional characteristics, which can now be explored experimentally.

We developed a semi automated software pipeline to analyze amino acid sequences for their protein domain based architectures and to infer multiple sequence alignments and phylogenetic trees for the molecular sequences corresponding to these architectures, followed by gene duplication inference. This pipeline contains the following five major steps: (1) sequence retrival; (2) domain architecture anlysis, including the inference of the taxonomic distributions of domain architectureseach of which corresponding to one preliminary SOG, and manual naming of domain architecures/preliminary SOGs (to be automated in future versions of this pipeline); (3) extraction of molecular sequences corresponding to domain architectures/preliminary SOGs; (4) multiple sequence alignment and phylogenetic inference; (5) gene duplication inference, to determine which preliminary SOGs contain sequences related by gene duplications and thus need to divided in multiple, final SOGs. Links to all custom software programs developed for this work are available here: https://sites.google.com/site/cmzmasek/home/ software/forester/daio. In the following the tools and methods used are described in more detail.

Individual protein sequences were downloaded from the ViPR database (Pickett et al., 2012) , while entire proteomes were downloaded from UniProtKB (Bateman et al., 2017) .

Multiple sequence alignments were calculated using MAFFT version 7.313 (with "localpair" and "maxiterate 1000" options) (Katoh and Standley, 2013; Kuraku et al., 2013) . Prior to phylogenetic inference, multiple sequence alignment columns with more than 50% gaps were deleted. For comparison we also performed the analyses based on alignments for which we only deleted columns with more than 90% gaps.

Protein domains were analyzed using hmmscan from HMMER v3.1b2 (Eddy, 2011) and the Pfam 31.0 database (Finn et al., 2016) . C.M. Zmasek et al. Virology 529 (2019) 29-42 

Phylogenetic trees were calculated for individual domain architectures (not full-length sequences) except for US22 domain proteins, because US22 domain alignments lack phylogeneticly sufficient signal. Distance-based minimal evolution trees were inferred by FastME 2.0 (Desper and Gascuel, 2002) (with balanced tree swapping and "GME" initial tree options) based on pairwise distances calculated by TREE-PUZZLE 5.2 (Schmidt et al., 2002) using the WAG substitution model (Whelan and Goldman, 2001 ), a uniform model of rate heterogeneity, estimation of amino acid frequencies from the dataset, and approximate parameter estimation using a Neighbor-Joining tree. For maximum likelihood approaches, we employed RAxML version 8.2.9 (Stamatakis et al., 2005) (using 100 bootstrapped data sets and the WAG substitution model). Tree and domain composition diagrams were drawn using Archaeopteryx

[https://sites.google.com/site/cmzmasek/home/ software/forester]. Rooting was performed by the midpoint rooting method. Unless otherwise noted, Pfam domains are displayed ith a E = 10 −6 cutoff. Gene duplication inferences were performed using the SDI and RIO methods Eddy, 2002, 2001) . Automated genome wide domain composition analysis was performed using a specialized software tool, Surfacing version 2.002 [Zmasek CM (2012) , a tool for the functional analysis of domainome/genome evolution [available at https://sites.google.com/site/cmzmasek/home/software/ forester/surfacing]. All conclusions presented in this work are robust relative to the alignment methods, the alignment processing, the phylogeny reconstruction methods, and the parameters used. All sequence, alignment, and phylogeny files are available upon request.

The processes for defining and naming strict ortholog groups were formalized into a set of "rules" and then implemented into a semi-automatic domain-centric phyloinformatics pipeline. Any unique arrangement of single or multiple Pfam domains is considered a domain architecture (DA) Godzik, 2012, 2011) . Most proteins of members of the Herpesviridae have DAs consisting of only a single domain. For example, the UDG domain of uracil DNA glycosylase is a single domain DA, whereas the combination of N-terminal DNA_-pol_B_exo1 and C-terminal DNA_pol_B (denoted as DNA_pol_B_ex-o1--DNA_pol_B) of DNA polymerases is a DA with two domains.

In this analysis, we consider a given DA "present" in a given Herpesviridae species S if the DA is present under a set of thresholds in at least one strain of the species S. The rationale for this is that it is possible to miss a DA in a genome, due to incomplete or erroneous sequences, erroneous assembly and gene-predication (false negatives), and even recent, actual gene loss. The opposite (false positive), on the other hand, is far less likely. For this work, we used two thresholds: a minimal domain length of 40% of the length set forth in the Pfam database (domain fragments are unlikely to be functionally equivalent to full length domains) and a hmmscan E-value cutoff of E = 10 −6 .

For every domain architecture, a set of bootstrap resampled phylogenetic trees (gene trees) was calculated by RAxML (Stamatakis et al., 2005) using protein sequences from one representative for each of the nine human Herpesviridae species. For comparison and validation, we also calculated phylogenetic trees that included non-human hosted Herpesviridae. For illustrations, gene duplications were inferred by comparing the consensus gene trees to the species tree ( Fig. 1) for Herpesviridae using the SDI (Speciation Duplication Inference) algorithm (Zmasek and Eddy, 2001) . To obtain confidence values on orthology assignments (bootstrap support values), we employed the RIO approach (Resampled Inference of Orthologs) to compare sets of bootstrap resampled phylogenetic trees with the species tree for Herpesviridae (Zmasek and Eddy, 2002) .

In this work, we define a strict ortholog group (SOG) as sequences related by speciation events and exhibiting the same domain architecture (based on Pfam domains from Pfam 31.0, a length threshold of 40%, and E-value cutoff of E = 10 −6 ). Based on this approach for defining SOGs, we developed the following naming syntax.

For protein families such as uracil DNA glycosylase, which exhibit the same DA in all nine human Herpesviridae, and which are related by speciation events only, we base our names on (Mocarski and Edward, 2007) as the base name and add a case-sensitive suffix that indicates the taxonomic distribution -"ABG" in this case, since uracil DNA glycosylase appears in each human Alpha-, Beta-, and Gammaherpesvirinae species. Therefore, the full name is "uracil DNA glycosylase_ABG". To indicate presence in some, but not all members of a subfamily, we use lower-case suffixes. "Replication origin-binding protein_ Ab" implies that members of this SOG are present in all human Alphaherpesvirinae species ("A"), and in some (but not all) Betaherpesvirinae ("b").

While most of the human Herpesviridae protein families fall into these basic cases, families which have a (some) domain(s) in common but differ in their DA, are more difficult to rationally name. An example of such a family is glycoprotein B described above. Because members of this family have different DAs, namely "Glycoprotein_B" and "HCMVantigenic_N-Glycoprotein_B", it is composed of two SOGs (named "Glycoprotein B_ ABG.AbG" and "Glycoprotein B_ ABG.b"). In such cases, we split the suffix into two parts, separated by a period. The first part ("ABG") indicates overall presence of common domain(s) for all members of this SOG, Glycoprotein_B in this case. The second part (after the period) relates to entire DAs. ". AbG" of "Glycoprotein B_ ABG.AbG" means that the Glycoprotein_B DA is present in all human Alpha-and Gamma-, and some Betaherpesvirinae. ".b" of "Glycoprotein B_ ABG.b" implies that the "HCMVantigenic_N-Glycoprotein_B" DA is present in some Betaherpesvirinae.

Ovine herpesvirus 2 glycoproteins B, H, and L are sufficient for, and viral glycoprotein Ov8 can enhance, cell-cell membrane fusion

Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs

The herpes simplex virus type 1 regulatory protein ICP27 is required for the prevention of apoptosis in infected human cells

UniProt: the universal protein knowledgebase

Secondary structure and structure-activity relationships of peptides corresponding to the subunit interface of herpes simplex virus DNA polymerase

Role of glycoprotein B of herpes simplex virus type 1 in viral entry and cell fusion

Constitutive signaling of the human cytomegalovirus-encoded chemokine receptor US28

Constitutive signaling of the human cytomegalovirus-encoded receptor UL33 differs from that of its rat cytomegalovirus homolog R33 by promiscuous activation of G proteins of the Gq, Gi, and GsClasses

Roles of uracil-DNA glycosylase and dUTPase in virus replication

The ortholog conjecture is untestable by the current gene ontology but is supported by RNA sequencing data

ICP27 interacts with the Cterminal domain of RNA polymerase II and facilitates its recruitment to herpes simplex virus 1 transcription sites, where it undergoes proteasomal degradation during infection

Herpesvirus systematics

Evolution of the herpesviruses

Fast and accurate phylogeny minimum-evolution principle

Specific inhibition of herpes simplex virus DNA polymerase by helical peptides corresponding to the subunit interface

Accelerated profile HMM searches

Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis

The Pfam protein families database: towards a more sustainable future

Distinguishing homologous from analogous proteins

Construction and properties of a mutant of herpes simplex virus type 1 with glycoprotein H coding sequences deleted

Multiple functions of DNA polymerases

Transcriptional analysis of the murine cytomegalovirus HindIII-I region: identification of a novel immediate-early gene region

Human herpesvirus 6 open reading frame U12 encodes a functional beta-chemokine receptor

Evolutionary history and functional implications of protein domains and their combinations in eukaryotes

Orthologs and paralogs -we need to get it right

MAFFT multiple sequence alignment software version 7: improvements in performance and usability

A comparative study of uracil-DNA glycosylases from human and herpes simplex virus type 1

aLeaves facilitates on-demand exploration of metazoan gene family trees on MAFFT sequence alignment server with enhanced interactivity

A herpes simplex virus mutant in which glycoprotein D sequences are replaced by beta-galactosidase sequences binds to but is unable to penetrate into cells

Characterization of the protease and other products of aminoterminus-proximal cleavage of the herpes simplex virus 1 UL26 protein

Intranuclear delivery of an antiviral peptide mediated by the B subunit of Escherichia coli heatlabile enterotoxin

The catalytic subunit of herpes simplex virus type 1 DNA polymerase contains a nuclear localization signal in the UL42-binding region

Herpes simplex virus ICP27 protein directly interacts with the nuclear pore complex through Nup62, inhibiting host nucleocytoplasmic transport pathways

Molecular phylogeny and evolutionary timescale for the family of Mammalian Herpesviruses

Toward a comprehensive phylogeny for Mammalian and Avian Herpesviruses

Comparative analysis of herpesvirus-common proteins

Gene content phylogeny of herpesviruses

Arrangements in the modular evolution of proteins

Viral exploitation and subversion of the immune system through chemokine mimicry

Testing the ortholog conjecture with comparative functional genomic data from mammals

Modular assembly of genes and the evolution of new functions

Rapid diversification of cell signaling phenotypes by modular domain recombination. Science. (80-.)

Herpesviridae: a brief introduction

ViPR: an open bioinformatics database and analysis resource for virology research

Genetic polymorphisms among human cytomegalovirus (HCMV) wild-type strains

Automatic clustering of orthologs and in-paralogs from pairwise species comparisons

Gene family level comparative analysis of gene expression n mammals validates the ortholog conjecture

A mutant herpes simplex virus type 1 unable to express glycoprotein L cannot enter cells, and its particles lack glycoprotein H

TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing

ICP27 interacts with SRPK1 to mediate HSV splicing inhibition by altering SR protein phosphorylation

Genome sequence of a chimpanzee herpesvirus and its relation to other primate alphaherpesviruses

Sequence variation of the amino-terminal antigenic domains of glycoprotein B of human cytomegalovirus strains isolated from Chinese patients

Herpesvirus entry: an update

Identification and functional comparison of Seven-transmembrane G-protein-coupled BILF1 receptors in recently discovered Nonhuman primate Lymphocryptoviruses

RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees

A genomic perspective on protein families. Science (80-.)

Structural basis for the recognition of cellular mRNA export factor REF by herpes viral proteins HSV-1 ICP27 and HVS ORF57

A hypothesis for DNA viruses as the origin of eukaryotic replication proteins

The Classification and Nomenclature of Viruses The Online (10th) Report of the ICTV

Herpes simplex virus processivity factor UL42 imparts increased DNA-binding specificity to the viral DNA polymerase and decreased dissociation from primer-template without reducing the elongation rate

A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach

Comparative analysis of protein domain organization

A novel immunity system for bacterial nucleic acid degrading toxins and its recruitment in various eukaryotic and DNA viral systems

Evolution by gene duplication: an update

The human herpesvirus 6 G protein-coupled receptor homolog U51 positively regulates virus replication and enhances cell-cell fusion in vitro

Association of herpes simplex virus type 1 ICP8 and ICP27 proteins with cellular RNA polymerase II holoenzyme

The ORF45 protein of Kaposi's sarcoma-associated herpesvirus is associated with purified virions

Processivity factor of DNA polymerase and its expanding role in normal and translesion DNA synthesis

RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs

A simple algorithm to infer gene duplication and speciation events on a gene tree

This Déjà Vu feeling-analysis of multidomain protein evolution in Eukaryotic genomes

Strong functional patterns in the evolution of eukaryotic genomes revealed by the reconstruction of ancestral protein domain repertoires

The crystal structure of an unusual processivity factor, herpes simplex virus UL42, bound to the C terminus of its cognate polymerase

The authors thank Sanjay Vashee for critical review of the manuscript. We also thank the primary data providers for sharing their data in public archives, including ViPR and UniProtKB. This work was funded by the National Institute of Allergy and Infectious Diseases (NIH/DHHS) under Contract no. HHSN272201400028C to RHS.

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.virol.2019.01.005.