key: cord-0723792-e5l1z2hr authors: Forni, Diego; Sironi, Manuela; Cagliani, Rachele title: Evolutionary history of type II transmembrane serine proteases involved in viral priming date: 2022-02-05 journal: Hum Genet DOI: 10.1007/s00439-022-02435-y sha: 8380c3fed9c11c9db21ebfa8c8417a8a07bbc997 doc_id: 723792 cord_uid: e5l1z2hr Type II transmembrane serine proteases (TTSPs) are a family of trypsin-like membrane-anchored serine proteases that play key roles in the regulation of some crucial processes in physiological conditions, including cardiac function, digestion, cellular iron homeostasis, epidermal differentiation, and immune responses. However, some of them, in particular TTSPs expressed in the human airways, were identified as host factors that promote the proteolytic activation and spread of respiratory viruses such as influenza virus, human metapneumovirus, and coronaviruses, including SARS-CoV-2. Given their involvement in viral priming, we hypothesized that members of the TTSP family may represent targets of positive selection, possibly as the result of virus-driven pressure. Thus, we investigated the evolutionary history of sixteen TTSP genes in mammals. Evolutionary analyses indicate that most of the TTSP genes that have a verified role in viral proteolytic activation present signals of pervasive positive selection, suggesting that viral infections represent a selective pressure driving the evolution of these proteases. We also evaluated genetic diversity in human populations and we identified targets of balancing selection in TMPRSS2 and TMPRSS4. This scenario may be the result of an ancestral and still ongoing host–pathogen arms race. Overall, our results provide evolutionary information about candidate functional sites and polymorphic positions in TTSP genes. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00439-022-02435-y. In order to initiate a successful infection, viruses must deliver their genome to the host cell cytosol. To achieve this, enveloped viruses trigger a sequence of events that lead to the fusion of their membrane with that of the target host cell, either by direct fusion with the plasma membrane or by an internalization process into endosomes (Barrett and Dutch 2020; Harrison 2008) . Usually, this process is mainly mediated by the interaction of specific viral attachment proteins expressed on the surface of the virion (envelope glycoproteins) with viral receptors. These latter are cell surface molecules that perform specific host cellular functions but are hijacked by viruses to assist infection. Multiple types of molecules, including proteins, carbohydrates, and lipids, can be used as viral receptors. However, proteins are considered to be more suitable receptors because of their stronger affinity and higher specificity for viral attachment, resulting in an increase in the efficiency of viral entry and in an expansion of the viral host range and tropism (Zhang et al. 2019a, b; Maginnis 2018) . Viral receptors favor the recognition and attachment of the virus to the host cell, but the activation of the membrane fusion machinery often requires the involvement of additional host factors. In fact, most viral fusion proteins are synthesized as a precursor and require processing by host cell proteases to trigger the fusion of the viral lipid envelope with the cellular membrane. Several cellular proteases, including proprotein convertases (i.e., furin), trypsin, cathepsins, and type II transmembrane serine proteases (TTSPs), have been described to participate in the proteolytic activation of viral proteins. TTSPs are a family of trypsin-like membrane-anchored serine proteases that play a key role in the regulation of some crucial processes, including cardiac function, digestion, cellular iron homeostasis, epidermal differentiation, and immune responses (Bugge et al. 2009 ). Consequently, deregulation of these enzymes is an important feature of several pathological conditions (List et al. 2002; Du et al. 2008; Scott et al. 2001) , including tumorigenesis (Choi et al. 2009 ). TTSPs expressed in the human airways were identified as host factors that promote the proteolytic activation and spread of respiratory viruses such as influenza virus, human metapneumovirus, paramyxoviruses, human parainfluenza viruses, Sendai virus, orthoreoviruses, Group A rotaviruses and coronaviruses (Shirogane et al. 2008; Bertram et al. 2010 Bertram et al. , 2013 Shirato et al. 2013; Sakai et al. 2014; Sasaki et al. 2021; Nygaard et al. 2012) , including SARS-CoV-2, the causative agent of the current COVID-19 pandemic (Zang et al. 2020; Hoffmann et al. 2020b) . For a successful infection, influenza viruses depend on the proteolytic activation of their entry-mediating surface protein, hemagglutinin (HA). HA may be activated by diverse host trypsin-like proteases in vitro or in vivo. In particular, influenza viruses (e.g. H1N1 and H7N9 strains) hijack transmembrane serine proteases (including TMPRSS2, TMPRSS4, TMPRSS13, TMPRSS11D, TMPRSS11E, and ST14) (Laporte and Naesens 2017) , that cleave HA into subunits HA1 and HA2, a prerequisite for conformational changes that trigger fusion. In coronaviruses, it is the spike (S) protein that promotes cell entry via membrane fusion (Shang et al. 2020; Li 2016) . After receptor engagement, the coronavirus S protein is enzymatically processed by cellular proteases, mediating efficient membrane fusion with host cells (Bertram et al. 2013; Shirato et al. 2013; Hoffmann et al. 2020a, b; Matsuyama et al. 2020; Simmons et al. 2005) . Human coronaviruses can use a variety of cellular proteases to prime their S protein, including TMPRSS2 and other members of the TTSP family (e.g., TMPRSS4 and TMPRSS11) (Zang et al. 2020; Millet and Whittaker 2015) . Likewise, coronaviruses can utilize different receptors to promote their entry into the host cells: both SARS-CoV and SARS-CoV-2 use angiotensin-converting enzyme 2 (ACE2), whereas MERS-CoV and HCoV-229E exploit DPP4 (dipeptidyl peptidase 4) and ANPEP (aminopeptidase N or CD13), respectively (Hoffmann et al. 2020b; Raj et al. 2013; Yeager et al. 1992; Forni et al. 2017) . Furthermore, previous studies have shown that SARS-CoV and other coronaviruses interact with multiple cellsurface molecules, such as CD209 (DC-SIGN), CLEC4M (L-SIGN/CD299), CLECG (LSECtin), and basigin (BSG or CD147) (Yang et al. 2004; Marzi et al. 2004; Gramberg et al. 2005; Vankadari and Wilce 2020; Wang et al. 2020a) . Interestingly, many of these receptors are targets of positive selection (Cui et al. 2013; Bibiana et al. 2020; Guo et al. 2020; Damas et al. 2020; Ortiz et al. 2008; Enard et al. 2016; Sironi et al. 2015; Wang et al. 2020b) . In line with its role in the activation of viral glycoproteins, TMPRSS2 variants were associated with a higher risk of severe disease and increased susceptibility to human influenza (Cheng et al. 2015) . Variants in the same gene they may also play a significant role in the interindividual differences, particularly in the gender-related bias, of COVID-19 susceptibility and severity (Alshahawey et al. 2020) . Moreover, TMPRSS2 polymorphisms may explain the higher mortality rate among Italians and the association to SARS-CoV-2 infection susceptibility in different populations (Asselta et al. 2020; Irham et al. 2020; Piva et al. 2021; Vargas-Alarcón et al. 2020; Hou et al. 2020) . Given their involvement in viral priming, we reasoned that members of the TTSP family may also represent targets of positive selection, possibly as the result of virus-driven pressure. To test this hypothesis, we systematically analyzed the evolutionary patterns of TTSP genes in a large phylogeny of mammals. We also evaluated their genetic diversity in human populations. Coding sequences for TTSPs were retrieved from the UCSC server (http:// genome. ucsc. edu/). Mammalian homologs of human genes were included only if they represented 1-to-1 orthologs, as reported in the EnsemblCompara GeneTrees (Vilella et al. 2009 ). Sequences that carry stop codons or have a sequence coverage of less than 50% were excluded. The list of species analyzed for each gene is reported in Supplementary Table S1 . The RevTrans 2.0 utility was used to generate multiple sequence alignments (http:// www. cbs. dtu. dk/ servi ces/ RevTr ans/, MAFFT v6.240 as an aligner) (Wernersson and Pedersen 2003) . Phylogenetic trees were reconstructed using the phyML program with gamma-distributed rates, 4 substitution rate categories, and estimation of transition/transversion ratio and proportion of invariable sites (Guindon et al. 2009 ). All alignments were screened for the presence of recombination using genetic algorithm recombination detection (GARD) (Kosakovsky Pond et al. 2006) . GARD is a genetic algorithm implemented in the HYPHY suite, which uses phylogenetic incongruence among segments in the alignment to detect the best-fit number and location of recombination breakpoints. To obtain an overview of the selective patterns of mammalian TTSPs coding genes, the average nonsynonymous substitution (dN)/synonymous substitution (dS) rate ratio was calculated using the single-likelihood ancestor counting (SLAC) method (Kosakovsky Pond and Frost 2005) . To detect positive selection, the codon-based codeml program implemented in the phylogenetic analysis by maximum likelihood (PAML) suite was applied (Yang 2007) . Specifically, a model (M8, positive selection model) that allows a class of sites to evolve with dN/dS > 1 was compared to two models (M7 and M8a, neutral models) that do not allow dN/dS > 1. To assess statistical significance, twice the difference of the likelihood (ΔlnL) for the models (M8a vs M8 and M7 vs M8) is compared to a χ 2 distribution (1 or 2 degrees of freedom for M8a vs M8 and M7 vs M8 comparisons, respectively). To assure reliability, two different codon substitution models were used (F3 × 4 and F61). Positively selected sites were identified using the following methods: (1) the Bayes empirical Bayes (BEB) analysis (with a cutoff of 0.90), which calculates the posterior probability that each codon is from the site class of positive selection (under model M8) (Anisimova et al. 2002) ; (2) fast unbiased Bayesian AppRoximation (FUBAR) (Murrell et al. 2013) , an approximate hierarchical Bayesian method that generates an unconstrained distribution of selection parameters to estimate the posterior probability of positive diversifying selection at each site in a given alignment (with a cutoff ≥ 0.90); (3) the fixed effects likelihood (FEL) (Kosakovsky Pond and Frost 2005) , a maximum-likelihood (ML) approach to infer dN/dS on a per-site basis, assuming that the selection pressure for each site is constant along the entire phylogeny (with a cutoff of ≥ 50). To be conservative, a site was considered under positive selection if it was detected by at least two methods. Protein structures were visualized with the PyMOL software (Schrödinger 2017) , which was also used to create 3D protein figures. GARD, FEL, FUBAR, and SLAC analyses were run locally through the HyPhy suite . In order to identify specific branches with a proportion of sites evolving with dN/dS > 1, the adaptive Branch-Site Random Effects Likelihood method (aBS-REL) was used. This method applies sequential likelihood ratio tests to identify branches under positive selection without a priori knowledge about which lineages are of interest (Smith et al. 2015) ; branches identified using this approach were crossvalidated using the branch-site likelihood ratio tests from PAML (models MA and MA1). Data from the 1000 Genomes Phase 1 Project (1000 Genomes Project Consortium et al. 2010 for individuals with East Asian (AS; Japanese plus Chinese) or European ancestry (CEU), and Yoruba subjects from Nigeria (YRI), were used to calculate nucleotide diversity and site frequency spectrum statistics for seven TTPS genes. Nucleotide diversity over the whole gene was measured as the average number of pairwise sequence nucleotide differences, π (Nei and Li 1979) and the expected per site heterozygosity, θ W (Watterson 1975) . The same statistics were also calculated in 5 kb sliding windows moving with a step of 250 bp. Three site frequency spectrum (SFS) statistics (Tajima's D, DT) (Tajima 1989) , Fu and Li's D* and F* (Fu and Li 1993) ] plus nucleotide diversity indicators were calculated for gene regions showing high level of diversity in the sliding-windows approach. In order to obtain a statistical significance for these values, the same tests were performed for a control set of ~ 2000 genes, as previously described (Forni et al. 2014 (Forni et al. , 2013 . Briefly, ~ 2000 genes were randomly selected, with the only limitations to be longer than 5000 bp and to have more than 80% human-outgroup (chimpanzee or orangutan genomes) aligning bases (Forni et al. 2014 (Forni et al. , 2013 . This approach allowed us to generate a distribution of the above statistics and the calculation of empirical percentiles. We considered as statistically significant those values with a percentile rank higher than the 95th. For the gene region analysis, we selected 5 kb independent gene regions from the 2000 random genes and we used them to generate SFS distributions. Orthologous genomic regions in chimpanzee, orangutan or macaque were retrieved using the liftOver tool (Kuhn et al. 2013 ). In humans, the TTSP family comprises 17 known members that share a number of common structural features (Fig. 1A ). These include a short N-terminal cytoplasmic domain, a transmembrane domain, a variable stem region, and a C-terminal serine protease domain with a catalytic triad (Ser-His-Asp) (Böttcher-Friebertshäuser 2018). The stem region contains an assortment of modular structural domains that contribute to TTSP activation and interaction with substrates (Hooper et al. 2001; Antalis et al. 2010) . Based on the composition of the extracellular protein portion and similarity in domain structure, chromosomal localization, and phylogenetic analysis of the serine protease domains, TTSPs are grouped in four subfamilies known as HAT/DESC, hepsin/ TMPRSS (transmembrane protease/serine S1), matriptase, and corin (Böttcher-Friebertshäuser 2018; Szabo and Bugge 2008) (Fig. 1A) . TTSPs expressed in the human airways have been identified as host cell factors that may support the proteolytic activation and spread of respiratory viruses, including influenza viruses and coronaviruses. In particular, in vitro studies have demonstrated the involvement of 8 TTSP proteases (TMPRSS11D, TMPRSS11E, TMPRSS11A, TMPRSS2, TMPRSS4, TMPRSS13, TMPRSS15 , and ST14) in the activation of HA and of spike proteins (Zang et al. 2020; Hoffmann et al. 2020b; Böttcher-Friebertshäuser 2018; Hayashi et al. 2018) . Other viruses also require TTSPs (at least in vitro) for entry into the host cell. Indeed mammalian orthoreovirus, paramyxoviruses, human metapneumovirus, and human parainfluenza viruses require proteolytic activation through TMPRSS2 and/or TMPRSS11D at the viral entry step (Nygaard et al. 2012; Böttcher-Friebertshäuser 2018) . More recently, Sasaki and colleagues demonstrated that TMPRSS2 and TMPRSS11D also mediate proteolytic activation and trypsin-independent infection in group A rotavirus ) (Supplemetary Table S2 , Fig. 1A ). To comprehensively analyze the selective pressure acting on TTSPs and to obtain an estimate of the extent of constraint acting on these genes, we retrieved coding sequence information for all available mammalian species in the UCSC database (Supplementary Table 1 ). In particular, we retrieved sequences for at least 46 species for 16 TTSP genes. We excluded TMPRSS11B from the analysis, as the existence of a human pseudogene (TMPRSS11BNL) could affect evolutionary inference. Concerning TMPRSS13, we excluded from the analysis the region encoding the first 89 amino acids, as it showed limited similarity across species. Because recombination can generate false positive inferences of positive selection (Sironi et al. 2015; Anisimova et al. 2003) , all alignments were screened for the presence of recombination using genetic algorithm recombination detection (GARD) (Kosakovsky Pond et al. 2006 ). Evidence of recombination was detected only for the CORIN gene. The CORIN coding alignment was thus split on the basis of the recombination breakpoints in three sub-regions (CORIN-reg1-3) and then used for the analyses. For coding genes, the strength of selection can be quantified by comparing the rate of non-synonymous nucleotide substitutions per non-synonymous site (dN) with that of synonymous substitution per synonymous site (dS) (Hughes and Nei 1989) . We thus calculated the average dN/dS ratio (also referred to as ω) using the single-likelihood ancestor counting (SLAC) method (Kosakovsky Pond and Frost 2005) . dN/dS values greater than 1 imply positive (diversifying) selection, whereas ratios lower than 1 indicate purifying selection (functional constraint). The expected dN/dS under selective neutrality is 1. In the 16 TTSPs, the average dN/dS ratio varied from 0.092 to 0.408 (Fig. 1B) . Thus, as is the case of most mammalian genes (Sironi et al. 2015) , purifying selection is the major force acting on these proteases. However, comparison of dN/dS values between TTSPs not involved (as of present knowledge) in the activation of viral glycoproteins and those with a verified role in this process indicated that the latter show generally faster evolutionary rates based on overall dN/dS (Wilcoxon test, p value = 0.02) (Fig. 1B) . This is consistent with the idea that positive selection may act on specific sites in proteins that are otherwise selectively constrained. To test this hypothesis in TTSP genes, we applied the "site models" implemented in the codeml tool from the PAML suite (see methods). Using likelihood ratio tests (LRTs), codeml compares models of gene evolution that allow (model M8) or disallow (models M8a and M7) a class of codons to evolve with dN/dS > 1. Specifically, M8 represents the positive selection model that is tested against the neutral M8a and M7 models. To assess statistical significance, twice the difference of the likelihood (ΔlnL) for the models (M7 vs M8 and M8a vs M8) was compared to a χ 2 distribution (2 degrees of freedom for the M7 vs M8 comparison, 1 degree of freedom for M8a vs M8). To be conservative and to obtain robust results we applied two different codon frequency models (F3 × 4 and F61) and we called a gene as a target of positive selection only if it was detected by both comparisons (M7 versus M8, M8a versus M8) with both models. Applying these criteria, we found seven TTSP genes (TMPRSS11D, TMPRSS11E, TMPRSS11A, TMPRSS15, TMPRSS2, TMPRSS4, to be targets of positive selection in mammals (Table 1) . To identify specific sites targeted by positive selection, we again applied a conservative strategy and we called a site as positively selected only if it was detected by at least two of the following methods: BEB analysis from the M8 model, FUBAR and FEL methods (Kosakovsky Pond and Frost 2005; Murrell et al. 2013) . Positively selected sites were identified in each of the seven genes and, as evident in Fig. 1A , they were not clustered, but rather distributed along the protein sequences. The only exception was TMPRSS11D, where a preponderance of selected sites was observed in the protease S1 domain. We next extended our analysis to explore possible variations in selective pressure across lineages. In particular, we aimed to assess whether specific internal branches in the mammalian phylogenies evolved under episodic positive selection. Because we did not want to make any a priori assumption about which lineages were more likely to have experienced adaptive evolution, we applied the branch siterandom effects likelihood (aBS-REL) method (Smith et al. (Szabo and Bugge. 2008) The name and the localization of the domains in the protein sequences are taken from the Uniprot database (https:// www. unipr ot. org/). LDL-receptor class A: Low-density lipoprotein receptor domain class A; SRCR: Scavenger receptor Cys-rich; SEA: sea urchin sperm protein, enterokinase, agrin; CUB: C1r, C1s, uEGF, and bone morphogenetic protein; MAM: domain in meprin, A5, receptor protein tyrosine phosphatase mu; SRCR: Scavenger receptor Cys-rich; FZ: Frizzled. Red arrow heads indicate the position of positively selected sites (see Table 1 ). The CORIN regions obtained on the basis of the recombination breakpoints are also reported. Viruses target of proteolytic activation by TTSPs are also reported (see Supplementary Table S2 Fig. S1 ). In particular, evidence of selection was detected for the branch leading to Eptesicus fuscus and the Myotis clade for TMPRSS11D, to the myomorpha suborder of Rodentia for TMPRSS11A, and to some lineages of the carnivora order for TMPRSS2. However, these results were not confirmed using the codeml branch-site LRT models (Zhang et al. 2005) , suggesting that positive selection in TTSP genes is mainly pervasive rather than episodic. As mentioned above, an involvement of TTSP genes in the activation and spread of influenza viruses and/or coronaviruses was verified for 8 TTSP genes (Böttcher-Friebertshäuser 2018; Hayashi et al. 2018) . Notably, six of the seven positively selected genes (TMPRSS11D, TMPRSS11E, TMPRSS11A, TMPRSS15, TMPRSS2, and TMPRSS4) are involved in the processing of viral glycoproteins, strongly suggesting that viral infections could represent the selective pressure driving the evolution of these TTSPs. Conversely, proteases with no known role in viral cell entry displayed no positive selection signals. The only exception was the CORIN gene, in which positive selection signals were identified in two of the 3 analyzed sub-regions. Corin is a cardiac membrane serine protease and plays a key role in maintaining cardiac function and regulating blood pressure (Yan et al. 2000) . Corin deficiency causes spontaneous hypertension and cardiac hypertrophy in mice (Chan et al. 2005) , and variants in the human CORIN gene can alter corin protein conformation, inhibit corin zymogen activation, and contribute to hypertension and cardiovascular diseases (Zhou and Wu 2014; Peng et al. 2015) . The N-terminal cytoplasmic tail and the transmembrane domain are unnecessary for corin expression and catalytic activity (Knappe et al. 2003) . However truncation of the cytoplasmic tail can create a new N-terminus that inhibits corin trafficking in the Golgi (He et al. 2020) . Interestingly, three out of the 16 positively selected sites localized to the N-cytoplasmic tail. In particular, positively selected site Cys13 is polymorphic in human populations (rs2289433, p.Cys13Tyr) and it has been associated with serum corin levels (Zhang et al. 2019a, b) . Taking into consideration the physiological functions of this protein and that, up to date, the involvement of corin in viral activation and spread has not been reported, the signal of positive selection is most likely driven by nonviral factors. For the other six positively selected genes, their role in the processing of viral glycoproteins has been widely described (Böttcher-Friebertshäuser 2018; Hayashi et al. 2018; Zmora et al. 2018; Kishimoto et al. 2021; Zang et al. 2020) , supporting the hypothesis that viral infections represented the main selective force in the evolutionary history of these genes. For instance, in swine and mouse, TMPRSS2 can activate HA with monobasic cleavage site, indicating that homologous proteases are involved in HA cleavage in different host species (Tarnow et al. 2014; Peitsch et al. 2014) . In addition to the canonical activator TMPRSS2, other TTSPs have been reported to function as entry cofactors of different viral fusion proteins, at least in vitro. For example, it was recently reported that TMPRSS4 works synergistically with TMPRSS2 to promote SARS-CoV-2 infection of small intestinal enterocytes , and three members of the HAT/DESC subfamily (TMPRSS11D, TMPRSS11A, and TMPRSS11E) are able to activate the SARS-CoV, SARS-CoV-2, and MERS-CoV spike proteins, The redundancy of function of the different proteases may appear to be in contrast with the presence of positive selection signals in the TTSP genes. However, we wish to mention that several lines of evidence suggest that this might not be the case. First, not all TTSPs have the same tissue/cell expression and not all of them may have the same cleavage efficiency. Second, protease activation may occur through different mechanisms in cells, as recently suggested by Liu and colleagues based on proteomic interaction data (Liu et al. 2021) . Third, most data suggesting redundancy of TTSP function in terms of viral protein cleavage have been obtained in vitro and do not necessarily translate to the situation in vivo or in natural infections. Finally, since more proteases are able to cleave proteins of different viruses and the same viral protein can be cleaved by more proteases, different infectious agents may have exerted distinct selection pressures on the same gene, and this could explain, at least in part, the presence of selected sites distributed in different domains. As stated above, positively selected sites in TTSPs genes are not clustered, but rather distributed along the protein sequences. However, all TTSPs involved in the processing of viral glycoproteins have at least one selected site in the peptidase domain (Fig. 1A) . Although such sites are not located near the active site (catalytic triad) of the protease domain (Fig. 2) , we cannot exclude that they may have a role in modulating the proteolytic activity of these proteins. Likewise, we cannot rule out the possibility that the sites identified in the protease domains, as well as in other protein regions, are involved in direct interactions with the viral target proteins (e.g., spike protein in coronavirus, HA in influenza viruses, VP4 protein in group A rotaviruses). It is currently unknown whether proteases are able to establish direct interactions with their substrates. In summary, evolutionary analyses indicate that most of the TTSP genes that have a verified role in proteolytic activation and spread of influenza viruses and/or coronaviruses present pervasive positive selection signals. This evidence strongly suggests that viral infections represent a major selective pressure driving the evolution of these type II transmembrane serine proteases in mammals. Nonetheless, we cannot exclude that the selection signals are due, at least partially, to other selective pressures associated with the physiological functions of these proteins. Future studies are required to clarify this aspect and to evaluate the functional relevance and selective advantages associated with these genetic changes in TTSP proteins during evolution. Infectious diseases have exerted strong selective pressures throughout human history (Haldane 1932; Fumagalli et al. 2011) . Indeed, a number of human loci have been shown to evolve in response to such pressures; the best documented example is the major histocompatibility complex (MHC) (Bernatchez and Landry. 2003) , which is highly polymorphic in several species as a consequence of balancing selection. This is a process whereby genetic variability is maintained in populations due to some selective pressure. Molecular signals of balancing selection have also been described for several genes involved in immune defense, including tolllike receptors (Fumagalli et al. 2009; Ferrer-Admetlla et al. 2008) , various cytokine genes (Fumagalli et al. 2009; Ferrer-Admetlla et al. 2008) , antimicrobial peptides (AMPs) (Cagliani et al. 2008; Hollox and Armour. 2008) , and TRIM5 (Cagliani et al. 2010) . Thus, we addressed the role of natural selection in the shaping of genetic diversity at the seven positively selected TTSP genes in human populations. To this aim, we calculated θ W (an estimate of the expected per site heterozygosity) (Watterson 1975) and π (the average number of pairwise sequence nucleotide differences between haplotypes) (Nei and Li 1979) in individuals with European ancestry (CEU), in Yoruba subjects from Nigeria (YRI), and in East Asian subjects (AS; Japanese plus Chinese). We then compared the values with the distribution obtained from 2000 randomly selected genes (see Methods for details). As reported in Supplementary Table S3 , only TMPRSS2 and TMPRSS4 showed evidence of high nucleotide diversity in all populations, although the rank of θ W and π did not reach the 95th percentile in all of them. High nucleotide diversity might be suggestive of balancing selection. Yet, as a result of mutations and recombination, balancing selection signatures tend to extend over relatively short genomic intervals (Wiuf et al. 2004; Charlesworth. 2006 ). Thus, in order to map the region(s) responsible for high nucleotide diversity in TMPRSS2 and TMPRSS4, we calculated θ W and π in 5 kb sliding-windows moving along these genes. In order to analyze potential regulatory regions, we extended the analysis to 20 kb upstream and downstream each gene. Overall, the same trend of θ W and π was evident in all populations for both genes (Figs. 3, 4) . In TMPRSS2, sliding-windows analysis highlighted 3 regions with high nucleotide diversity in at least two populations ( Fig. 3 and Table 2 ): one located in the 3′ region of TMPRSS2 (TMPRSS2-reg1) and falling in the MX1 gene, one in TMPRSS2 intron 8 (TMPRSS2-reg2), and one covering the promoter region (TMPRSS2-reg3). In the case of TMPRSS4, only two intragenic regions were identified ( Fig. 4 and Table 2 ): one covering the 2 kb surrounding exon 2 (TMPRSS4-reg1) and one covering the entire 3′UTR (TMPRSS4-reg2). An effect of balancing selection is a distortion of the site frequency spectrum (SFS) toward intermediate frequency alleles. Deviation in the SFS is identified by neutrality tests, including Tajima's D (DT) (Tajima 1989) and Fu and Li's D* and F* (Fu and Li. 1993) . DT tests the departure from neutrality by comparing θ W and π, and positive values indicate an excess of intermediate frequency variants. Fu and Li's F* and D* are also based on SNP frequency spectra and differ from DT in that they also take into account whether mutations occur in external or internal branches of a genealogy. For TMPRSS2-reg1 and TMPRSS2-reg2, tests indicated departure from neutrality with significantly (Table 2) . Conversely, for TMPRSS2-reg3, neutrality tests showed more significant values in the CEU population. With respect to TMPRSS4, significantly positive values for most statistics were observed only in the region covering the 3′-UTR (Table 2) . Overall, these analyses indicate that Fig. 3 Human population genetic analysis of TMPRSS2. A Screenshot from the UCSC genome browser (http:// genome. ucsc. edu/, GRCh37/hg19). The panel shows the analyzed genomic region (chr21:42816232-42900085), polymorphic variants associated with COVID-19 and with a potential functional effects (Asselta et al. 2020; Piva et al. 2021; Vargas-Alarcón et al. 2020) , as well as a track with genetic variants likely affecting proximal gene expression in human tissues from the genotype-tissue expression (GTEx, V6 data release). The eQTL item color indicates the effect size attributed to the eQTL: red, high positive; light red, moderate positive; light blue, moderate negative; blue, high negative. B Nucleotide diversity analysis in human populations. Data from 1000 genomes Project were used to calculate θ W and π in sliding windows of 5 kb moving with a step of 250 bp. Horizontal lines represent the 95th percentiles in the distribution of θ W (blue) and π (red). Vertical dashed lines indicate the start and the end of TMPRSS2. Regions with θ W and π greater than the 95th percentile in at least two populations are shaded in gray both TMPRSS2 and TMPRSS4 may be targets of ongoing balancing selection in human populations. The action of natural selection obviously implies the presence of a functional variant with an effect on fitness and a selective pressure to act on it. Recently, variants in TMPRSS2 with a potential functional role and/or associated to viral infection susceptibility have been identified (Asselta et al. 2020; Irham et al. 2020; Piva et al. 2021; Vargas-Alarcón et al. 2020) . Specifically, two variants in TMPRSS2-reg2 (rs55964536 and rs7364083) define two distinct haplotypes with different frequencies among populations and were predicted to modulate TMPRSS2 expression (Asselta et al. 2020) (Fig. 3) . The rs55964536 polymorphism was reported as an eQTL in lung tissue (genotype-tissue Fig. 4 Human population genetic analysis of TMPRSS4. A Screenshot from the UCSC genome browser (http:// genome. ucsc. edu/, GRCh37/hg19). The panel shows the analyzed genomic region (chr11:117927793-118012605) and a track with genetic variants likely affecting proximal gene expression in human tissues from the genotype-tissue expression (GTEx, V6 data release). The eQTL item color indicates the effect size attributed to the eQTL: red, high positive; light red, moderate positive; light blue, moderate negative; blue, high negative. B Nucleotide diversity analysis in human populations. Data from 1000 genomes Project were used to calculate θ W and π in sliding windows of 5 kb moving with a step of 250 bp. Horizontal lines represent the 95th percentiles in the distribution of θ W (blue) and π (red). Vertical dashed lines indicate the start and the end of TMPRSS4. Regions with θ W and π greater than the 95th percentile in at least two populations are shaded in gray expression, GTEx) (GTEx Consortium et al. 2017 ) and the haplotype to which rs55964536 belongs is functionally linked to another eQTL variant (rs8134378) located at a known androgen-responsive enhancer for TMPRSS2 (Asselta et al. 2020) . Based on these observations, Asselta and colleagues suggested that upregulation of TMPRSS2 expression level through androgens could explain the sex bias in COVID-19 severity. On the other hand, rs7364083 belongs to the second haplotype. This latter is also defined by a variant (rs2070788) associated with increased gene expression, a higher risk of severe disease with the 2009 H1N1 pandemic influenza, and increased susceptibility to human H7N9 flu (Cheng et al. 2015) . In TMPRSS2-reg2, another variant (rs2298661) in association with both COVID-19 disease susceptibility and with increased risk of worse disease outcomes was identified through a trans-ancestry meta-analysis (Horowitz et al. 2021) . The rs2298661 polymorphism is in tight linkage disequilibrium (LD R 2 > 0.8) with a non-synonymous variant (rs12329760, p.V160M). V160M falls in an exonic splicing enhancer site (Srp40) and is also predicted to decrease protein stability in structural models, with a plausible role in viral entry (Vishnubhotla et al. 2020) . Taken together, these evidences suggest that the balancing selection signal in TMPRSS2-reg2 could be secondary to the presence of variants with a potential functional effect, most likely on gene expression regulation. Overall, these variants with different allele frequencies in different populations could play an important role in viral entry, helping to explain the heterogeneity observed in different ethnic groups with regard to incidence, severity, progression, and therapeutic response to COVID-19 (Alshahawey et al. 2020) . As reported in Fig. 3 , other eQTL variants fall in TMPRSS2-reg1 and -reg3. In particular, a cluster of eQTLs modulating expression levels of TMPRSS2 and MX1 in lung and tyroid falls in TMPRSS2-reg1, whereas whole-blood MX2 eQTLs are present in TMPRSS2-reg3. MX1 and MX2 are involved in antiviral response. In particular, MX1 can restrict a wide variety of viruses, including orthomyxoviruses (e.g., influenza and Thogoto), bunyaviruses (e.g., La Crosse and rift valley viruses), and hepatitis B virus (Verhelst et al. 2013) . Conversely, MX2 is known to counteract retroviruses (Kane et al. 2013) , and to exert a weak effect against vescicular stomatitis virus infection (Liu et al. 2012) . Moreover, MX1 expression is higher in COVID-19 patients compared to non-infected subjects and tends to decrease with age (Bizzotto et al. 2020) . Variants in TMPRSS2-reg1 might thus contribute to modulate both TMPRSS2 and MX1 expression, acting on the ability to infect and to trigger an effective antiviral response. As stated above, TMPRSS2-reg1 falls within the MX1 gene, hence deviation from neutrality with significantly positive values for most SFS statistics in human populations may be attributable to a specific selection signal for the MX1 gene rather than TMPRSS2. However, within this region, there are eQTL variants for TMPRSS2; these variants may act to determine a different expression of the protease and, consequently, a different ability to activate the viral proteins that are TMPRSS2 protease targets. With respect to TMPRSS4, information on its regulation is more limited. However, eQTLs are present in TMPRSS4-reg2. Specifically, different eQTLs for the esophagous mucosa map to this genomic region. In summary, high nucleotide diversity and a shift in the SFS toward intermediate-frequency alleles suggest that the natural selection signals in these TMPRSS2 and TMPRSS4 regions are driven by the presence of polymorphic variants with a functional role that could contribute to infection with and spread of viruses, and/or to the modulation of disease clinical course. Type II transmembrane serine proteases (TTSPs) are involved in the proteolitic activation of different viruses. In particular, the cleavage of surface glycoproteins by some proteases has been reported for influenza viruses and coronaviruses, but also for other respiratory viruses. These observations led us to hypothesize that the members of this family may be targets of virus-driven selective pressure. It is known that natural selection events acted on the receptors used by viruses to promote their entry into the cell during the early stages of infection. From this point of view, coronavirus receptors, such as ACE2, DPP4, and ANPEP have been extensively studied. However, until now, systematic evolutionary studies had never been carried out on TTSP gene family. In this study, the application of specific molecular evolution tests showed that 7 TTSP genes evolved adaptively in mammals. Six of these are involved in the proteolytic activation of different viral proteins, suggesting that host-pathogen arms races drive the evolution of these genes. Since more proteases are able to prime proteins of different viruses, we may speculate that they play an important role in determining viral tropism and disease progression, explaining, at least partially, the apparent redundancy of the function of these proteases. Regarding the positively selected sites, we observed that all the TTSP genes have at least one site in the protease domain, that these sites are not in close proximity to the catalytic domain and that in general the selected sites are distributed along the gene. Taken together, these results suggest that different infectious agents may have exerted distinct selection pressures on the same gene and that sites targeted by positive selection in the protease and in other domains could participate in the formation of bonds or in stabilizing the reciprocal position of the host and viral proteins. Of particular interest, especially at the present time, is TMPRSS2. This protease represents the canonical activator of SARS-CoV-2, although its role in priming has been confirmed for many other respiratory viruses. Notably, TMPRSS2 presents pervasive positive selection signals in the mammalian phylogeny and could be target of ongoing balancing selection in human populations. In the latter case, the signals of natural selection could be driven by the presence of polymorphic variants with a functional role that may contribute to the infection with and the spread of viruses, and/or to the modulation of the clinical course of the disease. As the world continues to wrestle with the COVID-19 pandemic, research on the virus aims to develop therapies to control the health and socio-economic emergency. Different vaccine platforms are being used for vaccine development (Kyriakidis et al. 2021) , most of them aiming to induce neutralizing antibodies against the spike protein (S). Some of these are already used in vaccination campaigns worldwide, others are close to receiving approval or authorization for large-scale vaccinations. However, the emergence of new lineages of SARS-CoV-2 carrying variants in the spike protein raises concerns about a reduction in the efficacy of the vaccines currently in use. Furthermore, vaccination during the pandemic could exert selective pressures that might favor the expansion of vaccine-resistant SARS-CoV-2 lineages. In this context, it is important to combine vaccinations with alternative therapies aimed at impairing the virus capacity to spread through the host cells. Thus, several studies have identified serine proteases as potential drug targets for the treatment of SARS-CoV-2 infection (Baughn et al. 2020; Qiao et al. 2020; Tang et al. 2020; Zarubin et al. 2020; Deng et al. 2021) . TMPRSS2 is a well-characterized TTSP and represents one of the essential host factors in viral activation and cell entry of influenza viruses, such as pandemic influenza H1N1, and different coronaviruses, including SARS-CoV-2. TMPRSS2 is dispensable for development and homeostasis in mice (Kim et al. 2006 ). Thus, this protease is potentially one of the most promising drug targets for COVID-19 therapy. Therapeutics targeting TMPRSS2 expression and/or activity may be beneficial not only for coronavirus-infected patients but also for those infected with influenza viruses. Indeed, promising data have been obtained from the use of TMPRSS2 inhibitors such as camostat and nafamostat mesylate, which seem to have good efficacy against coronaviruses (Hoffmann et al. 2020b, c) . The design of selective inhibitors will be facilitated by a better understanding of functional genetic diversity, both in humans and in other mammals that are used as model organisms for drug development. Data reported in this study indicate that, in analogy to other genes with a role in viral infections, TTSPs have evolved adaptively in mammals (Sironi et al. 2015) . In particular a continuum of selective pressure acting on TMPRSS2 and TMPRSS4 was observed as these genes also represent selection targets in human populations. This scenario may be the result of an ancestral and still ongoing host-pathogen arms race. Overall, our results provide evolutionary information about candidate functional sites that may shape infection outcomes across mammals and within humans. A map of human genome variation from population-scale sequencing Sex-mediated effects of ACE2 and TMPRSS2 on the incidence and severity of COVID-19; the need for genetic implementation Accuracy and power of bayes prediction of amino acid sites under positive selection Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites The cutting edge: membrane-anchored serine protease activities in the pericellular microenvironment ACE2 and TMPRSS2 variants and expression as candidates to sex and country differences in COVID-19 severity in Italy Viral membrane fusion and the transmembrane domain Targeting TMPRSS2 in SARS-CoV-2 infection MHC studies in nonmodel vertebrates: what have we learned about natural selection in 15 years? TMPRSS2 and TMPRSS4 facilitate trypsin-independent spread of influenza virus in Caco-2 cells Cleavage and activation of the severe acute respiratory syndrome coronavirus spike protein by human airway trypsin-like protease TMPRSS2 activates the human coronavirus 229E for cathepsin-independent host cell entry and is expressed in viral target cells in the respiratory epithelium ACE2 diversity in placental mammals reveals the evolutionary strategy of SARS-CoV-2 SARS-CoV-2 infection boosts MX1 antiviral effector in COVID-19 patients Membrane-anchored serine proteases: host cell factors in proteolytic activation of viral glycoproteins Type II transmembrane serine proteases The signature of long-standing balancing selection at the human defensin beta-1 promoter Long-term balancing selection maintains trans-specific polymorphisms in the human TRIM5 gene Hypertension in mice lacking the proatrial natriuretic peptide convertase corin Balancing selection and its effects on sequences in nearby genome regions Identification of TMPRSS2 as a susceptibility gene for severe 2009 pandemic A(H1N1) influenza and A(H7N9) influenza Type II transmembrane serine proteases in cancer and viral infections Adaptive evolution of bat dipeptidyl peptidase 4 (dpp4): implications for the origin and emergence of Middle East respiratory syndrome coronavirus Broad host range of SARS-CoV-2 predicted by comparative and structural analysis of ACE2 in vertebrates Targeting androgen regulation of TMPRSS2 and ACE2 as a therapeutic strategy to combat COVID-19 The serine protease TMPRSS6 is required to sense iron deficiency Viruses are a dominant driver of protein adaptation in mammals Balancing selection is the main force shaping the evolution of innate immunity genes A 175 million year history of T cell regulatory molecules reveals widespread selection, with adaptive evolution of disease alleles An evolutionary analysis of antigen processing and presentation across different timescales reveals pervasive selection Molecular evolution of human coronavirus genomes Statistical tests of neutrality of mutations Parasites represent a major selective force for interleukin genes and shape the genetic predisposition to autoimmune conditions Signatures of environmental genetic adaptation pinpoint pathogens as the main selective pressure through human evolution LSECtin interacts with filovirus glycoproteins and the spike protein of SARS coronavirus Biospecimen Collection Source Site-NDRI, Biospecimen Collection Source Site-RPCI, Biospecimen Core Resource-VARI Estimating maximum likelihood phylogenies with PhyML Evolutionary arms race between virus and host drives genetic diversity in bat severe acute respiratory syndrome-related coronavirus spike genes The causes of evolution Enterokinase enhances influenza A virus infection by activating trypsinogen in human cell lines A common CORIN variant in hypertension reduces corin intracellular trafficking by exposing an inhibitory N-terminus A Multibasic cleavage site in the spike protein of SARS-CoV-2 is essential for infection of human lung cells SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor Nafamostat mesylate blocks activation of SARS-CoV-2: new treatment option for COVID-19. Antimicrob Agents Chemother 64:e00754-e820 Directional and balancing selection in human beta-defensins Type II transmembrane serine proteases. Insights into an emerging class of cell surface proteolytic enzymes Common genetic variants identify targets for COVID-19 and individuals at high risk of severe disease New insights into genetic susceptibility of COVID-19: an ACE2 and TMPRSS2 polymorphism analysis Nucleotide substitution at major histocompatibility complex class II loci: evidence for overdominant selection Genetic variants that influence SARS-CoV-2 receptor TMPRSS2 expression among population cohorts from multiple continents MX2 is an interferon-induced inhibitor of HIV-1 infection Phenotypic analysis of mice lacking the Tmprss2-encoded protease TMPRSS11D and TMPRSS13 activate the SARS-CoV-2 spike protein Functional analysis of the transmembrane domain and activation cleavage of human corin: design and characterization of a soluble corin Not so different after all: a comparison of methods for detecting amino acid sites under selection Automated phylogenetic detection of recombination using a genetic algorithm The UCSC genome browser and associated tools SARS-CoV-2 vaccines strategies: a comprehensive review of phase 3 candidates Airway proteases: an emerging drug target for influenza and other respiratory virus infections Structure, function, and evolution of coronavirus spike proteins Matriptase/ MT-SP1 is required for postnatal survival, epidermal barrier function, hair follicle development, and thymic homeostasis Systematic identification of type I and type II interferon-induced antiviral factors SARS-CoV-2-host proteome interactions for antiviral drug discovery Virus-receptor interactions: the key to cellular invasion DC-SIGN and DC-SIGNR interact with the glycoprotein of Marburg virus and the S protein of severe acute respiratory syndrome coronavirus Enhanced isolation of SARS-CoV-2 by TMPRSS2-expressing cells Host cell proteases: critical determinants of coronavirus tropism and pathogenesis FUBAR: a fast, unconstrained bayesian approximation for inferring selection Mathematical model for studying genetic variation in terms of restriction endonucleases Impact of host proteases on reovirus infection in the respiratory tract The evolutionary history of the CD209 (DC-SIGN) family in humans and non-human primates Activation of influenza A viruses by host proteases from swine airway epithelium Association between high serum soluble corin and hypertension: a cross-sectional study in a general population of China Expression and coexpression analyses of TMPRSS2 HyPhy: hypothesis testing using phylogenies Chinnaiyan AM (2020) Targeting transcriptional regulation of SARS-CoV-2 entry factors ACE2 and TMPRSS2 Dipeptidyl peptidase 4 is a functional receptor for the emerging human coronavirus-EMC The host protease TMPRSS2 plays a major role in in vivo replication of emerging H7N9 and seasonal influenza viruses Host serine proteases TMPRSS2 and TMPRSS11D mediate proteolytic activation and trypsin-independent infection in group A rotaviruses The PyMOL molecular graphics system, version 2.0 Schrödinger, LLC. Google Scholar there is no corresponding record for this reference Insertion of beta-satellite repeats identifies a transmembrane protease causing both congenital and childhood onset autosomal recessive deafness Cell entry mechanisms of SARS-CoV-2 Middle East respiratory syndrome coronavirus infection mediated by the transmembrane serine protease TMPRSS2 Efficient multiplication of human metapneumovirus in Vero cells expressing the transmembrane serine protease TMPRSS2 Inhibitors of cathepsin L prevent severe acute respiratory syndrome coronavirus entry Evolutionary insights into host-pathogen interactions from mammalian sequence data Less is more: an adaptive branch-site random effects model for efficient detection of episodic diversifying selection Type II transmembrane serine proteases in development and disease Statistical method for testing the neutral mutation hypothesis by DNA polymorphism Coronavirus membrane fusion mechanism offers a potential target for antiviral development TMPRSS2 is a host factor that is essential for pneumotropism and pathogenicity of H7N9 influenza A virus in mice Emerging WuHan (COVID-19) coronavirus: glycan shield and structure prediction of spike glycoprotein and its interaction with human CD26 Variability in genes related to SARS-CoV-2 entry into host cells (ACE2, TMPRSS2, TMPRSS11A, ELANE, and CTSL) and its potential use in association studies Mx proteins: antiviral gatekeepers that restrain the uninvited EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates Genetic variants in TMPRSS2 and Structure of SARS-CoV-2 spike glycoprotein and TMPRSS2 complex CD147-spike protein is a novel route for SARS-CoV-2 infection to host cells Host-virus arms races drive elevated adaptive evolution in viral receptors On the number of segregating sites in genetical models without recombination RevTrans: multiple alignment of coding DNA from aligned amino acid sequences The probability and chromosomal extent of trans-specific polymorphism Corin, a transmembrane cardiac serine protease, acts as a pro-atrial natriuretic peptideconverting enzyme PAML 4: phylogenetic analysis by maximum likelihood pH-dependent entry of severe acute respiratory syndrome coronavirus is mediated by the spike glycoprotein and enhanced by dendritic cell transfer through DC-SIGN Human aminopeptidase N is a receptor for human coronavirus 229E Ding S (2020) TMPRSS2 and TMPRSS4 promote SARS-CoV-2 infection of human small intestinal enterocytes Structural variability, expression profile, and pharmacogenetic properties of TMPRSS2 gene as a potential target for COVID-19 therapy Evaluation of an improved branchsite likelihood method for detecting positive selection at the molecular level Associations between potentially functional CORIN SNPs and serum corin levels in the Chinese Han population Cell membrane proteins with high N-glycosylation, high expression and multiple interaction partners are preferred by mammalian viruses as receptors Intracellular autoactivation of TMPRSS11A, an airway epithelial transmembrane serine protease Corin in natriuretic peptide processing and hypertension TMPRSS11A activates the influenza A virus hemagglutinin and the MERS coronavirus spike protein and is insensitive against blockade by HAI-1