key: cord-0004567-6p55tzyj authors: Hänske, Jana; Hammacher, Tim; Grenkowitz, Franziska; Mansfeld, Martin; Dau, Tung Huy; Maksimov, Pavlo; Friedrich, Christin; Zimmermann, Wolfgang; Kammerer, Robert title: Natural selection supports escape from concerted evolution of a recently duplicated CEACAM1 paralog in the ruminant CEA gene family date: 2020-02-25 journal: Sci Rep DOI: 10.1038/s41598-020-60425-4 sha: 2e45e30b65fd34a5727e967dda22436bffc2b57e doc_id: 4567 cord_uid: 6p55tzyj Concerted evolution is often observed in multigene families such as the CEA gene family. As a result, sequence similarity of paralogous genes is significantly higher than expected from their evolutionary distance. Gene conversion, a “copy paste” DNA repair mechanism that transfers sequences from one gene to another and homologous recombination are drivers of concerted evolution. Nevertheless, some gene family members escape concerted evolution and acquire sufficient sequence differences that orthologous genes can be assigned in descendant species. Reasons why some gene family members can escape while others are captured by concerted evolution are poorly understood. By analyzing the entire CEA gene family in cattle (Bos taurus) we identified a member (CEACAM32) that was created by gene duplication and cooption of a unique transmembrane domain exon in the most recent ancestor of ruminants. CEACAM32 shows a unique, testis-specific expression pattern. Phylogenetic analysis indicated that CEACAM32 is not involved in concerted evolution of CEACAM1 paralogs in ruminants. However, analysis of gene conversion events revealed that CEACAM32 is subject to gene conversion but remarkably, these events are found in the leader exon and intron sequences but not in exons coding for the Ig-like domains. These findings suggest that natural selection hinders gene conversion affecting protein sequences of the mature protein and thereby support escape of CEACAM32 from concerted evolution. the bovine ceA gene family. In order to identify bovine CEACAMs we performed similarity searches using three different bovine genomic sequence databases (Bos taurus, Bos indicus and Bos grunniens) using Blast and Blat algorithms as described previously 8 . By comparing data from these genomes, we predicted the coding sequence (CDS) of eight CEACAM genes and two pseudo genes. We identified orthologous genes to human CEACAM1, CEACAM16, CEACAM18, CEACAM19, CEACAM20 and three putatively functional CEACAM1 paralogs, CEACAM32, CEACAM33, and CEACAM35. The two putative pseudogenes have stop codons in multiple exons, indicating that they do not encode functional proteins. According to the current genome assembly in Ensembl (Cow (UMD3.1)) the chromosomal location of the CEA gene family is syntenic to the CEA gene family in other mammals located in the extended leucocyte receptor complex on chromosome 18 (Fig. 1A) . CEACAM1 is the only member of the CEA gene family located between TGFB1 and LIPE. All CEACAM1 paralogs including one pseudogene are positioned in one cluster flanked by the CD79A and XRCC1 genes (Fig. 1B) . The second pseudogene is a CEACAM18 paralog located next to the CEACAM18 gene within the SIGLEC cluster (Fig. 1B) . The location of the orthologous genes CEACAM16, CEACAM19 and CEACAM20 is similar as in other mammals 7, 8 . The orthologous CEACAM genes CEACAM16, CEACAM18, CEACAM19 and CEACAM20 have the same exon arrangement as previously described in other mammals 8 (Fig. S1 ). In contrast, CEACAM1 in cattle does not contain a B domain exon (255 bp) commonly found in other mammalian CEACAM1 genes as previously described 23 . The CEACAM1 paralog CEACAM32 has a unique transmembrane domain which significantly differs from the transmembrane domains found in CEA gene family members with immune receptor tyrosine-based inhibitory motifs (ITIM) or immune receptor tyrosine-based activation-like motifs (ITAM-like). CEACAM33 and CEACAM35 contain a transmembrane domain, which is very similar to that of human CEACAM3, CEACAM4, and CEACAM21. Human CEACAM3 and CEACAM4 have a cytoplasmic tail with an ITAM-like motif encoded by four exons. Similar exons could be identified in the CEACAM33 and CEACAM35 genes. CEACAM35 is the only CEACAM1 paralog which contains a B domain exon in cattle (Fig. S1 ). CEACAM34 described in a previous paper 8 is no longer in the Whole-genome shotgun contigs (wgs) database at NCBI which is in accordance with our inability to amplify any fragment with CEACAM34-specific primers from cDNA of different tissues or genomic DNA from different cattle. These results suggest that CEACAM34 does not exist as a separate gene in the cattle genome or that the sequence information previously available was incomplete. A detailed depiction of the exon arrangement of all CEACAM genes is shown in Figs. 1C and S1A. Differential expression of bovine CEACAM1 paralogs. First, we designed primers for the bovine housekeeping gene GAPDH. The primer pair was designed to amplify a specific fragment of GAPDH mRNA-derived cDNA and a fragment of genomic DNA from the GAPDH gene but not from a processed pseudogene. Both fragments differ in size and therefore could be used to detect and quantify genomic DNA contamination in the cDNA preparations (Table 1 , Fig. 2A ). Next, we designed CEACAM-specific primers (Table 1) complementary to individual leader exon and N domain exon sequences. Using these primers we screened the expression of bovine CEACAMs in different candidate tissues, including liver, lung, skin, kidney, rumen, small intestine, large intestine, spleen, udder, tonsils, lymph nodes, granulocytes and lymphocytes. CEACAM32 mRNA was detected in testis but not in any other tissue tested (Fig. 2B) . Expression of CEACAM33 and CEACAM35 was detected in bovine granulocytes isolated from peripheral blood (Fig. 2C) and at low level in other tissues (data not shown). To substantiate our prediction of the exon composition of the CEACAM32 gene we performed 3′ RACE experiments starting from the N domain exon to identify the 3′ end of CEACAM32 cDNA. Sequencing of the cDNA obtained by 3′ RACE, revealed that the CEACAM32 mRNA encodes a unique transmembrane domain, which contains a stop codon and a poly A signal, resulting in a very short 3′ untranslated region. Based on the 3′ sequence of CEACAM32 cDNA we designed primers for the amplification of full length CEACAM32 cDNA which was then cloned and sequenced ( Table 1 ). The CEACAM32 mRNA (GenBank accession no. MH684294) codes for a single isoform (Fig. 2D ) composed of a leader sequence, an IgV-like domain (N domain), an IgC-like domain (A2 type domain), a transmembrane domain and an extremely short cytoplasmic domain consisting of four amino acids. Amplification and sequencing of the full length CEACAM33 cDNA indicated that four different splice isoforms exists (Fig. 2D ). However, we could only confirm two of these isoforms by cDNA cloning. One CEACAM33 splice variant contains the ITAM-like motif (GenBank accession no. MH684295), while another isoform (GenBank accession no. MH684296) lacks the transmembrane domain indicating that it encodes a secreted protein (Fig. 2E) . CEACAM35 (GenBank accession no. MH684297) exist as one isoform (Fig. 2D ) composed of a leader exon, an N domain exon, one IgC-like domain (A1 type) exon, a transmembrane domain exon and four cytoplasmic domain exons. The four cytoplasmic domain exons code for a ITAM-like motif slightly different from the ones found in CEACAM33 cDNA. The B domain exon of the CEACAM35 gene identified as a putatively spliceable exon in the genome (Fig. 1C ) was found to be not included in analyzed mRNAs. The putative proteins encoded by bovine CEACAM1 and its paralogous genes are depicted in Fig. 2E. phylogenetic relationship of bovine ceAcAMs. We compared the IgV-like (N domain) exon nucleotide sequences of all bovine CEACAMs (Fig. 3A) . The CEACAM N domains of CEACAM1 paralogs cluster together with those of the CEACAM1a and b alleles while the N domain exon sequences of conserved CEACAMs are more distantly related. The N domain exon sequence of the inhibitory receptor CEACAM1 was most closely related to that of the activating receptor CEACAM33 followed by the second activating receptor CEACAM35. The N domain exon sequence of CEACAM32 showed the greatest difference to that of the N domain exon of CEACAM1 and of all CEACAM1 paralogs (Fig. 3A) . We further determined the relationship of IgC-like domain exons of bovine CEACAMs (Fig. S3 ). CEACAM1 and CEACAM33 contain two IgC-like domains of the A1 and A2 type. CEACAM35 is composed of an A1 and a B domain. The IgC-like domains of CEACAM32 and the CEACAMps are of the A2 type (Figs. 1C; S3). Next, we analyzed the relationship of the transmembrane exon nucleotide sequences of bovine CEACAM1 paralogs with that of other species (Fig. 3B ). As previously observed, two main forms of transmembrane domain exons can be discerned: one cluster is composed of TM domain exon sequences present in genes encoding ITIM motifs and the other containing the transmembrane domain exons associated with exons which encode a ITAM-like signaling motif. The transmembrane domain exon sequence of CEACAM32 is more closely related to the CEACAM1-related transmembrane domain exon sequence but is clearly separated from the CEACAM1 transmembrane exon cluster (Fig. 3B ). When we compared all identified N domain exons of CEACAM32 with N domain exons from CEACAM1 and closely related CEACAM1 paralogs of selected mammalian species we found that all CEACAM32 N domain exons cluster together while N domain exons of CEACAM1 and other bovine CEACAM1 paralogs formed a separated cluster (Fig. 3C ). phylogenetic history of the bovine ceA gene family. We used the composition of mobile elements within the different bovine CEACAM genes to analyze the phylogenetic history of bovine CEACAMs (Figs. S4, S5). CEACAM33 contains most of the elements at the same positions as in CEACAM35 indicating that CEACAM33 and CEACAM35 are the result of a duplication event. Interestingly further modifications of CEACAM33 seem to have taken place, since the B domain exon is replaced by an A2 domain exon, which is most closely related to the A2 domain exon of CEACAM1 and/or CEACAMps (Fig. S3 ). The mobile elements around the N domain exons (i.e. in introns 1 and 2) suggest a close relationship between CEACAM1 and CEACAM32. On the other hand, a similar origin of the transmembrane domain exon of these two genes is not supported by the surrounding mobile elements (Fig. S4 ). Based on the mobile elements found in the CEACAM1-related pseudogene it is closer related with the ITAM containing CEACAMs, CEACAM33 and CEACAM35 than with CEACAM1 and CEACAM32 (Fig. S4 ). Mobile elements of conserved CEACAMs indicate that the duplication of CEACAM18 was a rather recent event and that the other conserved CEACAMs evolved separately for quite a while (Fig. S5 ). The loss of the B domain exon in bovine CEACAM1, the integration of different artiodactyl-specific mobile elements (Fig. 4) , and the cooption of the transmembrane domain exon of CEACAM32 into the ruminant CEA gene family are additional genetic markers which allowed further analysis of the evolutionary history of the bovine CEA gene family. Previously we have speculated that B domain exon loss is due to the insertion of mobile elements 23 (Fig. 4) . However, when we analyzed CEACAM1 from other artiodactyls, i.e. pig (Sus scrofa), alpaca (Vicugna pacos), and the Wild Bactrian (Camelus ferus) and Arabian camel (Camelus dromedarus), two old world camelids, we also did not find B domain exons in the CEACAM1 genes ( Fig. 5A; www.nature.com/scientificreports www.nature.com/scientificreports/ years ago (mya). On the other hand, similar mobile elements as found in the A1-A2 intron of bovine CEACAM1 were detected in CEACAM1 of goats (Capra hircus), sheep (Ovis aries) and giraffe (Giraffa Camelopardalis tippelskirchi) but not in CEACAM1 of pigs and camelids (Fig. 5A ). This suggests that the B domain exon loss occurred before mobile elements were integrated into the A1-A2 intron of the common ancestor of ruminant CEACAM1. Furthermore, in the data base CEACAM35 has a complete B domain exon while CEACAM33 contains only part of the B domain exon from the 5′ region (Fig. 5B ). This indicates that the duplication event which created CEACAM35 took place before the CEACAM1-like ancestor lost its B domain. CEACAM33 seems to be younger than CEACAM35 and may have been evolved by duplication of a CEACAM35 ancestor followed by further modifications. www.nature.com/scientificreports www.nature.com/scientificreports/ A close relative of the transmembrane domain exon of bovine CEACAM32 was found in bovidae, cervidae and in some giraffidae i.e. in okapi but not in giraffe and not in suidae and camelidae. This data suggests that the birth of CEACAM32 took place in the most recent ancestor of ruminants. Two different CEACAM1 alleles exist in cattle, CEACAM1a (GenBank accession no. AY345127) and CEACAM1b (GenBank accession no. AY487416). The alleles differ mainly in their N exon sequences by a number of non-synonymous mutations and most remarkably by an in frame 9-nucleotide deletion in the CEACAM1a allele. Interestingly, the CEACAM1a allele was found in the wgs database for Bos taurus and Bos indicus but not for Bos grunniens, Bison bison, and Bubalus bubalis. In contrast, the CEACAM1b allele was found in the database of all five bovine species. Taken together, this indicates that the CEACAM1b allele is the original bovine CEACAM1 allele and that the CEACAM1a allele appeared first in the ancestor of domestic cattle 0.2-1 mya 4 ago. Concerted evolution and gene conversion between CEACAM1 paralogs. The phylogenetic analysis of the N domains of bovine CEACAM1 paralogs indicates that the N domains of CEACAM1, CEACAM33 and CEACAM35 exhibit concerted evolution while the N domain of CEACAM32 evolved independently. To better understand the mechanism that allows the independent evolution of CEACAM32 we analyzed recombination and gene conversion events that may have taken place between bovine CEACAM1 paralogs. First, we searched the protein-coding region of bovine CEACAM1 paralogs using GARD. Three breakpoints were identified one was in the leader sequence one in the N domain and one in the A1 domain exon. When we compared the www.nature.com/scientificreports www.nature.com/scientificreports/ sequences between these breakpoints, we observed that CEACAM32 differs from CEACAM1, CEACAM33 and CEACAM35 in particular in the N and in the first IgC-like domain exons (Fig. 6A ). Next, we used the PipMaker software to compare the whole sequences of the CEACAM1 paralogous genes. As shown in Fig. 6B CEACAM32 Human tRNA GLU TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCTCTCACCGCCGCGGGCCCGGGTTCGATTCCCGGTCAGGGAACCA ******* **** *********** ** ***** **** **** ************ ** ** **** ****** * CHR-1 consensus (type I) www.nature.com/scientificreports www.nature.com/scientificreports/ and CEACAM1 N exon sequences differ strongly. However, high similarity between CEACAM1 and CEACAM32 was found in the intron sequences particular around the N domain exon (Fig. 6B upper panel) . In contrast, sequence similarities between CEACAM1 and CEACAM33 and CEACAM35 were most pronounced in the exon sequences ( Fig. 6C middle and lower panels) . Finally, we used GENCONV to detect putative gene conversion events between CEACAM1, CEACAM32, CEACAM33, CEACAM35 and CEACAMps1. 23 gene conversion events were detected by GENECONV in the region starting at the leader exon and ending after ~1000 nucleotides of the intron following the N domain exon (Fig. 7 and Table 2) ). Gene conversion events were detected for all CEACAM1 paralogs at the N domain exon except for CEACAM32 (Fig. 7) . Remarkably, gene conversion between CEACAM32 and other CEACAM1 paralogs were detected, but they were restricted to the leader exon and the intron sequence following the N domain exon (Fig. 7) . We used only the region of the CEACAM1 paralogs for the analysis of gene conversion where the sequence similarity was high enough to guarantee a high quality alignment. All gene conversion events located around the N domain exon have a high statistical support Table 2 . These results demonstrate that CEACAM32 is still involved in gene conversion and, therefore, in concerted evolution of CEACAM1 paralogs, however, since gene conversion only affects noncoding regions and the leader exon of CEACAM32, the mature CEACAM32 protein has escaped concerted evolution. The nucleotide sequence of CEACAM1 starting -2000 bp upstream of exon 1 including exons encoding the extracellular part was compared with that of the corresponding region of CEACAM1 paralogs. For contiguous stretches of nucleotides conserved between the gene pairs using a sliding window, the degree of identity was calculated and displayed as horizontal lines. The location of CEACAM1 exons is indicated by numbered boxes and highlighted by red lines. Note, that the sequence similarity between CEACAM1 and CEACAM32 is highest in intron sequences, while the similarity of CEACAM1 sequences with that of CEACAM33 and CEACAM35 is highest for the exon sequences. The different repeat sequences are indicated by differently shaped forms. Figure 7 . Gene conversion of bovine CEACAM1 paralogs. Gene conversion was analyzed using GENECONV. Sequences of all bovine CEACAM1 paralogs starting from the leader exon (exon 1) to the first ~1000 nucleotides from the intron between the N domain (exon 2) and A domain (exon 3) exon were aligned using muscle. Gene conversion was detected by GENECONV. 23 gene conversion events were detected using default parameters. Events were numbered from 1 to 23. Gene conversions between two CEACAM genes are depicted. The direction of gene conversion is not shown. Statistical support can be found in www.nature.com/scientificreports www.nature.com/scientificreports/ substitutions was observed (Fig. 8A ). In addition, the MEME application was used to analyze the sequence alignment of CEACAM32 sequences to search for episodic positive selection. Only one codon (codon 27) was found to be under episodic positive selection i.e. selection for diversification at a significance level of 0.1 (Fig. 8B) . Additional sites were identified, which were indicated to be under positive selection by the LRT (p-value > 0.1), four of them (sites 39, 40, 41, 42) are located in the region between position 36 to 52 (Fig. 8B) . Modeling the structure of the CEACAM32 N domain revealed that most the codons putatively selected for diversification are placed in or near to the CFG face (Fig. 8C) . The CFG face is known to be the major ligand interaction area of CEACAM N domains. Together these data suggest that natural selection has favored differentiation of the putative ligand-binding face of CEACAM32 from other bovine CEACAMs. In the present study, we performed a comprehensive analysis of the bovine CEA gene family and tried to reconstruct its evolutionary history. In contrast to the equine CEA gene family where we could not identify two of the conserved CEACAMs e.g. CEACAM18 and CEACAM20 11, 19 , in cattle all conserved CEACAMs i.e. CEACAM1, CEACAM16, CEACAM18, CEACAM19 and CEACAM20 could be identified in the NCBI genome database. Interestingly we also identified a CEACAM18 paralog in cattle, however this paralog is most likely a pseudogene, since several stop codons were found in the coding sequence. Thus, the gene duplication of CEACAM18 had let to pseudogenization of one paralog, putatively due to an unfavorable consequence of the enhanced gene dosage and/or the lack of a novel function upon gene duplication 2 . In cattle, CEACAM1 paralogs have not undergone a substantial expansion as seen in other species 8, 24, 25 . Only four CEACAM1 paralogs were detected and one of them www.nature.com/scientificreports www.nature.com/scientificreports/ seems to be a pseudogene. Two of the expressible paralogs contain an ITAM-like signaling motif in their cytoplasmic tails. Thus, they may form paired receptors with CEACAM1 as described before in other species 17, [26] [27] [28] . These paired receptors are thought to be a counter measure to the use of CEACAM1 as cellular receptors by various bacterial pathogens 8, 17, 27 . Interestingly, CEACAM1 paralogs with ITAM-like signaling motifs in cattle show a preferential expression in granulocytes, as does human CEACAM3, which was found to be an innate pathogen receptor, which mediates uptake and destruction of bacterial pathogens 17, 27 . This is in contrast to the finding in dogs were CEACAMs with ITAM-like signaling motifs were more broadly expressed 26 . Thus, bovine CEACAM33 and CEACAM35 may also be innate immune receptors for yet unknown pathogens. Why do two different ITAM-bearing bovine CEACAMs exist? The maintenance of two ITAM-bearing CEACAMs with a very similar expression pattern may imply that two different pathogens exist that use CEACAM1 as a receptor or that both activating CEACAMs bind to different epitopes of the same pathogen. In addition, the presence of two activating CEACAMs may allow the development of more than one inhibitory receptor without losing the "protection" by the activating receptor 8, 17, 27 . Indeed, we had previously identified two different CEACAM1 alleles in cattle. We assumed that this is due to usurpation of bovine CEACAM1 by a virus as its cellular receptor. However, such a virus has not yet been identified 23, 29 . Here we show, that the CEACAM1a and CEACAM1b alleles are found in certain Bos taurus and Bos indicus breeds, but CEACAM1a is absent from the genomes of Bos grunniens, Bison bison, and Bubalus bubalis. This finding strongly indicates that the three amino acid deletion in the CEACAM1a allele occurred in the common ancestor of Bos taurus and Bos indicus after separation from the ancestor of Bos grunniens and Bison bison about 150-400 kya 30 . At that time, both ITAM-containing CEACAMs already existed, since their descendants could be found in the genome of both Bos grunniens and Bison bison. Thus it may be speculated that the presence of two ITAM-bearing receptors with different ligand-binding domains favor the evolution of distinct CEACAM1 alleles. From an evolutionary point of view, it is remarkable that CEACAM35 is the only CEACAM1 paralog in cattle which contains a functional B domain exon. Since the B domain exon of CEACAM1 is replaced by a LINE/ L1_Art and a CHR1 family SINE interspersed repeat element, duplication of CEACAM1 leading to CEACAM35 may have occurred before this event. Thus, CEACAM1 and CEACAM35 seem to be the original paired receptors of the CEA gene family in artiodactyls. In contrast, CEACAM33 does not have a B domain exon indicating, that CEACAM33 either evolved by duplication of CEACAM1 after the loss of the B domain exon or by duplication of CEACAM35 followed by an independent loss of the B domain exon. The later possibility is supported since CEACAM33 contains a region that exhibits sequence similarity to the sequence comprising the end of the intron between the A1 and B domain exon and the first part of the B domain exon of CEACAM35. In addition, the presence of similar mobile elements in CEACAM35 and CEACAM33 suggests that the CEACAM35 ancestor gave rise to CEACAM33. However, the presence of an A2 domain in CEACAM33 implies the presence of an A2 exon in the CEACAM35 ancestor or requires further modifications of an A2 exon-less CEACAM33 ancestor after gene duplication possibly by recombination or gene conversion. There is one member of the bovine CEA gene family which is of particular interest since it has a unique transmembrane domain which was not found previously in other species. Further searches in the NCBI database suggest that CEACAM32 is specific for ruminants. Thus CEACAM32 is an example of a new CEACAM born in the most recent ancestor of all ruminants about 35-50 mya 31 . Although the origin of the transmembrane domain of CEACAM32 is unknown, the novelty of this domain within the CEA gene family indicates that CEACAM32 is the result of a gene duplication and/or exon shuffling event together with the cooption of a novel transmembrane exon. This novel gene encoding a transmembrane anchored cell surface glycoprotein has gained a unique expression pattern within the bovine CEA gene family. An exclusive expression in testis was previously also observed for CEACAM17, which is a muroid-specific CEACAM 7 . Taken into consideration that CEACAMs could interact with a variety of ligands we speculate that the expression in testis provides a new very specific environment for these testis-specific CEACAMs with a unique repertoire of putative ligands that may favor adaptation of the putative ligand binding domain of CEACAM to bind to a testis-specific ligand. Indeed, the relaxed or even positive selection at the CFG face fits well to the view that the ligand binding face did undergo affinity maturation to this putative new ligand or that adaptation to species-specific ligands took place. Nevertheless, we want to point out that it is well known that transcription is very permissive in testis and, therefore, duplicated genes are often transcribed in this organ 32, 33 . Thus, further analysis of the role of CEACAM32 is needed to support the view that CEACAM32 plays a role in testis. Comparison of the N domains of bovine CEACAM1 paralogs showed that CEACAM1 and the two ITAM-like signaling motif containing CEACAMs (CEACAM33 and CEACAM35) underwent concerted evolution. A similar finding was recently reported for human CEACAM1 paralogs 5 . Furthermore, concerted evolution of the protein coding sequence of CEACAM1, CEACAM33 and CEACAM35 seems to be due to gene conversion events that affect the exon sequences and only to a minor extend the intron sequences. In contrast gene conversion events of CEACAM32 affecting non-protein coding intron sequences are preferentially maintained during evolution. This indicates that natural selection favors maintenance of the protein sequence of the extracellular part of CEACAM1 and the ITAM-containing CEACAMs while it prevents homogenization of the extracellular part of CEACAM32 and CEACAM1. Gene conversion is supported by both small distance of genes and sequence similarity. Thus, gene conversion at certain places of the intron sequence between CEACAM1 and CEACAM32 may be preferred compared to CEACAM33, CEACAM35 or CEACAMps1. On the other hand, we have noted inversions of the non-CEACAM gene locus present between CEACAM1 and members with ITAM-like motif-encoding exons in cattle and mouse when compared with the human CEACAM gene locus 8 . This intrachromosomal inversion can be explained by a recombination event between e.g. CEACAM1 and a ITAM-like motif-encoding CEACAM1-like gene with an transcriptionally inverse orientation (which indeed all CEACAM1-like genes exhibit; see Fig. 1B ) possibly using a lopping-out mechanism comprising the non-CEACAM gene region. This mechanism could engage even genes with large physical distances, suggesting that the different distance between CEACAM1 and the individual CEACAM1 paralog is probably of minor importance for the frequency of gene conversion events. It is of particular importance that concerted evolution within bovine CEACAM1 paralogs is only relevant for the extracellular, ligand-binding part. Due to the contrary signaling capacity of bovine CEACAM1 and its paralogs, functional diversification of the duplicated genes already had taken place 1 . Thus their transmembrane and cytoplasmic parts are not under concerted evolution due to prominent nucleotide sequence differences. We hypothesize that the ligand-binding domain of inhibitory CEACAM1 and activating CEACAMs evolve in a concerted way in order to maintain the counter measure function of activating CEACAMs against the use of the inhibitory CEACAMs by pathogens. Since the evolution of these CEACAMs is most likely driven by pathogens which evolve very rapidly the ligand binding domains also evolve rather fast. To keep the ligand (probably pathogen adhesins) binding domains similar, these paired receptors evolve by concerted evolution mediated by gene conversion and homologous recombination. On the other hand, duplicated genes that have gained a novel expression pattern face a novel environment and may interact with novel putative ligands. Once they interact to a certain extend with these ligands further optimization by natural selection may exclude concerted evolution with the original parental gene 34 . According to the Red-Queen-Hypothesis 35 this means not moving in concert with its paralogs means separation from them without prominent selection for diversification, as it was observed in the current investigation for CEACAM32. Taking together analysis of the CEA gene family of cattle and other artiodactyls, provided evidence, that members of the CEA gene family can escape from concerted evolution by excluding protein coding regions of the gene from gene conversion most likely through natural selection. Thus, conserved CEA gene family members are expected to have ligands distinct from their founder gene CEACAM1 ligands. Datasets and nomenclature of genes. Sequence similarity searches were performed using the NCBI BLAST tools blastn http://blast.ncbi.nlm.nih.gov/Blast.cgi and Ensembl BLAST/BLAT search programs http:// www.ensembl.org/Multi/Tools/Blast?db=core using default parameters. For identification of bovine CEACAM exons, exon and cDNA sequences from known CEACAM and PSG genes were used to search whole-genome shotgun contigs (wgs) databases limited to organisms Bovidae. Hits were considered to be significant if the E-value was 50%. Once a wgs contig was identified that contained CEACAMrelated sequences we confirmed manually the presence of the complete exon by the number of nucleotides and identification of CEACAM-typical splice site sequences. Only sequences which were considered to be complete exons were used for further analyses. In a second step we used the identified exon sequences to search the database again in order to identify all existing paralogous CEACAM genes. Once we had identified individual exons we predicted the gene structure based on the organization of known CEACAM genes. The location of different exons on the same contig was a prerequisite for considering that these exons belong to the same gene. Gene predictions were further supported by the identification of expressed sequence tags (est) and/or predictions in genome builds at NCBI and Ensemble, if available. Short exons, like exons coding for cytoplasmic tails, were identified by alignments of downstream sequences of identified transmembrane exons with cytoplasmic exon sequences of human CEACAMs. Sequence alignments for exon identification were performed using clustalw . The CEA gene family in cattle is not well annotated; therefore, we adopted the nomenclature according to the one previously used for the CEA gene family of other mammals i.e. bovine CEACAM1 paralogs were numbered CEACAM32-CEACAM35 following the canine CEACAM numbers 8 . Gene names and corresponding sequences are summarized in (Additional File 1). The following databases were used for gene loci analyses: UMD3.1 assembly and Btau_5.0.1. Bovine peripheral blood lymphocytes and granulocytes were isolated from blood of healthy cattle by centrifugation through a Ficoll-Paque gradient 1.077 g/l (GE Healthcare, Chalfont St Giles, UK). Lymphocytes were taken from the interphase of the gradient and separated from monocytes by plastic adherence for one hour. Granulocytes were collected from the top of the red blood cell (RBC) pellet and further purified by RBC lysis with ammonium chloride. Different bovine tissue samples were collected from freshly slaughtered healthy cattle and stored in the RNA stabilization reagent RNAlater ® (Invitrogen, Carlsbad, US) at 4 °C for 24 h or at −80 °C for long term storage. Reverse transcription-polymerase chain reaction analysis. Total RNA was isolated with the RNeasy ® Mini Kit (Qiagen, Langen, Germany) or using the TRIzol ® reagent (Life Technologies, Karlsruhe, Germany). One µg of total RNA was used for cDNA syntheses by reverse transcription (RT) using the Reverse Transcription (2020) 10:3404 | https://doi.org/10.1038/s41598-020-60425-4 www.nature.com/scientificreports www.nature.com/scientificreports/ System ® (Promega, Mannheim, Germany). The RT product was amplified by polymerase chain reaction (PCR) with DreamTaq polymerase (Thermo Fisher Scientific Inc., Waltham, USA) and gene-specific primers (Metabion, Planegg-Martinsried, Germany) using standard conditions. Primers used are summarized in Table 1 . Eight µl of each PCR were analyzed by electrophoresis on a 1.8% agarose gel and visualized by ethidium bromide staining. Bovine ceAcAM cDnA cloning and sequencing. Primers used for amplification of full length cDNAs are shown in Table 1 . For cDNA cloning the RT product was amplified by PCR with Easy-A High-Fidelity PCR Cloning Enzyme (Agilent) and analyzed by agarose gel electrophoresis. Specific bands were extracted from the agarose gel using QIAEX II Gel Extraction Kit (Qiagen). The PCR products were cloned using the StrataClone PCR Cloning Kit (Agilent). Plasmid DNA isolated from various clones were analyzed by PCR and sequencing. Nucleotide sequencing was performed with the BigDye Terminator Cycle Sequencing Kit (PE Applied Biosystems, Weiterstadt, Germany). For the amplification of the 3′ end of CEACAM32 mRNA we used the 3´ RACE System for Rapid Amplification of cDNA Ends from Invitrogen according to the standard protocol. The amplicons were isolated from an agarose gel using the QIAquick Gel Extraction Kit from Qiagen and sequenced using the BigDye Terminator Cycle Sequencing Kit (PE Applied Biosystems). This sequence was used to design specific primers to amplify full-length CEACAM32 cDNA. The sequences of the PCR products were determined by direct nucleotide sequencing. phylogenetic analysis and bioinformatics. Phylogenetic analyses based on nucleotide and amino acid sequences were conducted using MEGA6 or MEGA7 36, 37 . Sequence alignments were performed using Muscle implemented in MEGA7. Phylogenetic trees were constructed using the maximum likelihood (ML) method with bootstrap testing (500 replicates) and the Tamura-Nei substitution model. For comparing sequences by the diagonal plot method we used Dotlet (https://myhits.isb-sib.ch/cgi-bin/dotlet) 38 . The program PipMaker (http://bio. cse.psu.edu/) was used to identify conserved contiguous stretches of nucleotides between gene pairs and to calculate the degree of identity which is summarized as a 'percent identity plot' 39 . For identification of mobile DNA elements we used RepeatMasker (http://www.repeatmasker.org/). Recombination between bovine CEACAM1 and its paralogs was detected using the genetic algorithm for recombination detection (GARD) software 40 . Gene conversion was analyzed using the GENECONV program (version 1.81a) 41 . In order to determine the selective pressure on the maintenance of the nucleotide sequences, the number of nonsynonymous nucleotide substitution per nonsynonymous site (dN) and the number of synonymous substitutions per synonymous site (dS) were determined for N domain exons. The dN/dS ratios as well as the cumulative synonymous and nonsynonymous substitutions along coding regions of N domain exons from orthologous genes were calculated after manual editing of sequence gaps or insertions guided by the amino acid sequences applying the SNAP program (Synonymous Nonsynonymous Analysis Program; http://www.hiv.lanl.gov/content/sequence/SNAP/SNAP.html), which uses the modified Nei-Gojobori model with Jukes-Cantor correction. We verified the results using the JCoDa software (http://www.tcnj.edu/nayaklab/jcoda). For the detection of individual sites under positive selection we used the mixed effects model of evolution software (MEME) 42 . 3D modeling was performed with geno3D-release 2 homology modeling software 43 and visualized by the PyMOL software (Schrödinger Inc., New York, US). The N domain of CEACAM32 was modeled using pdb2qsqA-0 and pdp116zA-0 as templates. ethics approval and consent to participate. Healthy cattle were slaughtered for meat production at the abattoir "LandWert Hof Sundhagen", not as part of this study, however we got permission from the abattoir to use the tissues for the present study. Further tissue collection was approved by the animal use committee of local authorities (Landesamt für Landwirtschaft, Lebensmittelsicherheit und Fischerei (LALLF) Rostock, Germany; 7221.3-2.1-011/13). All experiments were performed in accordance with relevant guidelines and regulations. Nucleotide sequences from bovine CEACAMs are available at NCBI GenBank accession numbers MH684294 -MH684297. The birth-and-death evolution of multigene families revisited Concerted and birth-and-death evolution of multigene families The life and death of gene families Gene conversion: mechanisms, evolution and human disease Gene conversions are under purifying selection in the carcinoembryonic antigen immunoglobulin gene families of primates Maternal-fetal conflict: rapidly evolving proteins in the rodent placenta Identification of a novel group of evolutionarily conserved members within the rapidly diverging murine Cea family Coevolution of activating and inhibitory receptors within mammalian carcinoembryonic antigen families Redefined nomenclature for members of the carcinoembryonic antigen family Pregnancy-specific glycoproteins: complex gene families regulating maternal-fetal interactions Convergent evolution of pregnancy-specific glycoproteins in human and horse Several carcinoembryonic antigens (CD66) serve as receptors for gonococcal opacity proteins Mouse hepatitis virus strain A59 and blocking antireceptor monoclonal antibody bind to the N-terminal domain of cellular receptor Carcinoembryonic antigen-related cell adhesion molecule (CEACAM)-binding recombinant polypeptide confers protection against infection by respiratory and urogenital pathogens Helicobacter pylori exploits human CEACAMs via HopQ for adherence and translocation of CagA CEACAM1 recognition by bacterial pathogens is speciesspecific Granulocyte CEACAM3 is a phagocytic receptor of the innate immune system that mediates recognition and elimination of human-specific pathogens The carcinoembryonic antigen (CEA) family: structures, suggested functions and expression in normal and malignant tissues Alternative splicing after gene duplication drives CEACAM1-paralog diversification in the horse Loss of mammal-specific tectorial membrane component carcinoembryonic antigen cell adhesion molecule 16 (CEACAM16) leads to hearing impairment at low and high frequencies Loss of the tectorial membrane protein CEACAM16 enhances spontaneous, stimulus-frequency, and transiently evoked otoacoustic emissions Carcinoembryonic antigen-related cell adhesion molecule 16 interacts with alpha-tectorin and is mutated in autosomal dominant hearing loss (DFNA4) Identification of allelic variants of the bovine immune regulatory molecule CEACAM1 implies a pathogen-driven evolution Recent expansion and adaptive evolution of the carcinoembryonic antigen family in bats of the Yangochiroptera subgroup A comprehensive phylogenetic and structural analysis of the carcinoembryonic antigen (CEA) gene family Species-specific evolution of immune receptor tyrosine based activation motif-containing CEACAM1-related immune receptors in the dog Defining the roles of human carcinoembryonic antigen-related cellular adhesion molecules during neutrophil responses to Neisseria gonorrhoeae Coevolution of paired receptors in Xenopus carcinoembryonic antigen-related cell adhesion molecule families suggests appropriation as pathogen receptors Crystal structure of bovine coronavirus spike protein lectin domain Correlating Bayesian date estimates with climatic events and domestication using a bovine case study A complete estimate of the phylogenetic relationships in Ruminantia: a dated species-level supertree of the extant ruminants Transcriptional promiscuity in testes Cellular source and mechanisms of high transcriptome complexity in the mammalian testis Ohno's dilemma: evolution of new genes under continuous selection Running with the Red Queen: the role of biotic conflicts in evolution MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets Molecular Evolutionary Genetics Analysis version 6.0. Molecular biology and evolution Dotlet: diagonal plots in a web browser PipMaker: a World Wide Web server for genomic sequence alignments GARD: a genetic algorithm for recombination detection Statistical tests for detecting gene conversion Detecting individual sites subject to episodic diversifying selection Geno3D: automatic comparative molecular modelling of protein Genealogy of families of SINEs in cetaceans and artiodactyls: the presence of a huge superfamily of tRNA(Glu)-derived families of SINEs We would like to thank Andrea Braun and Lisa Faust-Klueger for excellent technical assistance. This study was supported by GIZ (Contract no. 81170269; Project No. 13.1432.7-001.00) and DFG (HE 6249/4-1) to R.K. This funding source had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript. J.H., F.G., T.H., C.F., M.M., T.H.D., P.M. performed experiments and/or contributed to data analysis. W.Z. performed data mining and contributed substantially to data interpretation and critically revised the manuscript. R.K. conceived the study, carried out data analysis and drafted the manuscript. All authors contributed to manuscript writing, read and approved the final version. The authors declare no competing interests. Supplementary information is available for this paper at https://doi.org/10.1038/s41598-020-60425-4.Correspondence and requests for materials should be addressed to R.K.Reprints and permissions information is available at www.nature.com/reprints.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.