key: cord-0000023-yba7mdtb
authors: Dufraigne, Christine; Fertil, Bernard; Lespinats, Sylvain; Giron, Alain; Deschavanne, Patrick
title: Detection and characterization of horizontal transfers in prokaryotes using genomic signature
date: 2005-01-13
journal: Nucleic Acids Res
DOI: 10.1093/nar/gni004
sha: f1d1b9694aa43c837d9b758cb2d45d8a24d293e3
doc_id: 23
cord_uid: yba7mdtb

Horizontal DNA transfer is an important factor of evolution and participates in biological diversity. Unfortunately, the location and length of horizontal transfers (HTs) are known for very few species. The usage of short oligonucleotides in a sequence (the so-called genomic signature) has been shown to be species-specific even in DNA fragments as short as 1 kb. The genomic signature is therefore proposed as a tool to detect HTs. Since DNA transfers originate from species with a signature different from those of the recipient species, the analysis of local variations of signature along recipient genome may allow for detecting exogenous DNA. The strategy consists in (i) scanning the genome with a sliding window, and calculating the corresponding local signature (ii) evaluating its deviation from the signature of the whole genome and (iii) looking for similar signatures in a database of genomic signatures. A total of 22 prokaryote genomes are analyzed in this way. It has been observed that atypical regions make up ∼6% of each genome on the average. Most of the claimed HTs as well as new ones are detected. The origin of putative DNA transfers is looked for among ∼12 000 species. Donor species are proposed and sometimes strongly suggested, considering similarity of signatures. Among the species studied, Bacillus subtilis, Haemophilus Influenzae and Escherichia coli are investigated by many authors and give the opportunity to perform a thorough comparison of most of the bioinformatics methods used to detect HTs.

It is now widely admitted that actual genomes have a common ancestor (LUCA, Last Universal Common Ancestor). Their current diversity results from events that have modified genomes during evolution. While some of these events happen at the nucleotide level (point mutation, indel of few nucleotides), others [strand inversion, duplications, repetitions, transpositions and horizontal transfers (HTs)] may concern significant parts of the genome. It has been postulated that HTs (exchange of genetic material between two different species) were very frequent during the first stages of evolution and are essentially subsisting nowadays in prokaryotes (1) (2) (3) (4) . As a consequence, the detection of HTs appears crucial to the understanding of the evolutionary processes and to the qualitative and quantitative evaluation of exchange rate between species (5) (6) (7) (8) (9) .

The recent complete sequencing of several genomes allows to systematically search for the presence of DNA transfers in species, especially in prokaryotes where the probability of occurrence is higher (10) (11) (12) (13) (14) . It has been reported in particular that (i) HTs in bacteria account for up to 25% of the genome (8, (14) (15) (16) ; (ii) archaebacteria and non-pathogenic bacteria are more prone to transfers than pathogenic bacteria (15, 16) ; and (iii) operational genes are more likely transferred than genes dealing with information management (15) (16) (17) .

The HT concept has been originally coined to explain the dramatic homologies between genes of unrelated species (18, 19) . An 'unusual' match is subsequently the criteria for the detection of HTs (20, 21) . While this approach allows detection of gene transfers with only a partial knowledge of genomes, it requires the sequencing of homologous genes in a number of species and consequently cannot be used for HT screening.

Genes from a given species are very similar to one another with respect to base composition, codon biases and short oligonucleotide composition (15, 16, (22) (23) (24) . As a general rule, usage of oligonucleotides varies less along genomes than among genomes (24) (25) (26) (27) . In addition, it has been observed that transferred DNA retains (at least for some time) characteristics from its species of origin (8, 14) . These particularities are used alone or in conjunction to detect DNA transfers between species (8, 12, 13) . Transferred DNA is consequently detected on the basis of some of its singularities with respect to the sequence characteristics of the recipient species. However, these techniques suffer several drawbacks and weaknesses (28) (29) (30) that led us to consider generalizing the above approach for the screening of atypical regions in sequences. In fact, the genomic signature that accounts for all possible biases in DNA sequences has been shown to be speciesspecific (26, 27, 31, 32) . The signature is approximately invariant along the genome in such a way that the species of origin of DNA segments as small as 1 kb could be identified with a surprisingly high efficiency by means of their signatures (25, 27) . As a consequence, the sequence signature may be most often (at least in bacteria) considered a valuable estimation of the genomic signature. Assuming that (i) transferred DNA fragments exhibit signature of the species they come from and (ii) recipient and donor signatures are different, the screening of local variations of signature along genomes is expected to reveal regions of interest where HTs might be located. In addition, the status of HT is strongly suggested if the signatures of these regions of interest are found close to the signature of other species.

The sequence signature is defined as the frequencies of the whole set of short oligonucleotides observed in a sequence (26, 31) . It can be easily obtained thanks to a very fast algorithm derived from the Chaos Game Representation (CGR) (33) , which allows coping with a 1 Mb sequence in a few seconds on a laptop computer. Signatures may be visualized as square images where the color (or gray level) of each pixel represents the frequency of a given oligonucleotide (called word thereafter) (31) (for examples of signatures, see Supplementary Materials 2, 4 and 6).

DNA sequences are gathered from GenBank. The genomes of 22 prokaryotes are scanned for HTs, B.subtilis, E.coli and H.influenzae genomes being given a special attention to illustrate our approach. In particular, B.subtilis and E.coli provide valuable benchmark thanks to the set of previous works addressing that very issue (12, 14, 16, (34) (35) (36) (37) . Signatures of about 12 000 species are obtained from genomic sequences longer than 1.5 kb. Sequences derived from the same species are concatenated for accuracy purposes. Species from the three domains of life, archaea ($260 species), bacteria ($3950 species) and eukarya ($6750 species) as well as viruses ($1300 species), are represented for a total amount of 1.0 Gb.

The detection of atypical regions is based on the observation of deviation of local signatures (i.e. signature of small fragments of DNA) from the genomic signature of the recipient species. Genomes are consequently sampled by means of a sliding window with an appropriate size. In fact, it would be interesting to have windows the smallest as possible for highest sampling accuracy. However, intra-genomic variability of signature increases for small windows. In addition, variability depends on species and word length. Base composition (1-letter word), 2-and 3-letter words are poorly speciesspecific: they do not allow a good discrimination between species (25, 27) . As a general rule, the longer the words (up to 9-letter long), the higher the specificity of the signature (25, 27, 31) . However, counts of long words in small windows are too low to allow a reliable estimation of the parameters. In our hands, the analysis of 4-letter words in a sliding window of 5 kb (with a 0.5 kb step) offers a good trade-off between reliability of count, file size and computational charge, whatever the species. In addition, a double-strand signature (called local signature thereafter) is computed for each window to get rid of variations induced by strand asymmetry (38) (39) (40) (41) (42) .

For illustration purposes, local signatures are developed as vertical vectors and stacked together in genome order to give an overall picture of word usage variations along each genome. In such plots, horizontal lines show the variation in frequency of words along the genome, whereas local changes in word usage appear as vertical breaks ( Figure 1 ). Figure 1 . Signatures (4-letter words and 5 kb windows) along genome for Clostridium acetobutylicum, Deinococcus radiodurans and Mycobacterium tuberculosis. In this kind of displays, lines represent the frequency of words along genome, columns represent signature of windows.

Considering that the greatest part of the genome is speciestypical, the signature of the recipient species might have been estimated from the analysis of the whole sequence. Although the vast majority of local signatures look mostly the same (believed to be instances of the recipient species signature), some of them may greatly differ. In order to avoid potential biases linked to these outliers, it has been subsequently decided to select typical local signatures on the basis of their similarities, observed after clustering. The underlying idea is that typical local signatures aggregate in few large groups, whereas outliers are found in small complementary groups at a great distance from the recipient genome signature. Groups were consequently determined with the K-means clustering tool, using every scheme of clusters between 3 and 8 for each species. Finally, the best scheme of clusters was obtained by a decision tree-based partition [CART algorithm (43) ]. The purpose of the CART algorithm is to predict values of a categorical dependent variable (clusters of local signatures in this work, each signature being characterized by its distance to the estimated genomic signature) from one or more continuous and/or categorical predictor variables [the different clustering schemes (3-8 clusters) in this work]. The CART algorithm thus provides an optimal split between groups collecting signatures close to the estimated recipient genome signature and the others groups. For each species, a clustering scheme is selected (e.g. the 5-group clustering) and a partition offered (continued example: group 2 and 3 on one side; 1, 4 and 5 on the other). The recipient species signature is subsequently calculated as the mean of the signatures of the groups belonging to the partition with the smallest distance to the estimated genomic signature.

Comparison of signatures is made possible, thanks to an Euclidian metric, accounting for differences in word usage. It must be pointed out that distances between signatures are calculated for high dimensional data (256 dimensions corresponding to the 256 different 4-letter words) and are consequently subjected to the so-called 'concentration of measure phenomenon' (44) . All distances in a high dimension space seem to be comparable since they increase with the square root of the dimension of the space, whereas the variance of their distribution remains unchanged. In fact, the radius of the hyper sphere holding 99% of the signatures of our database is only seven times the nearest neighbor distance (smallest distance between two species). Small differences in distance may consequently be considered highly significant.

For each species, a set of recipient-specific distances is obtained, every local signature belonging to the large clusters being given a distance to the host signature. In order to select outlying signatures, a cut-off distance is chosen on the basis of the distribution of distances observed for each species. It appears that the 99% percentile offered a good trade-off between sensibility and specificity for outlier detection (for impact of the threshold on detection of atypical regions, see Results). Most signatures from minority clusters are detected in this way. Isolated signatures are detected as well, while very few signatures from the recipient species clusters are selected (1%). Outliers together with the flanking regions on the genome are later on reanalyzed with smaller window and step (1/10 th of the original size typically) in order to more accurately determine their limits, when signal-to-noise ratio allows it.

Finally, the gene content of all detected regions is analyzed with the help of species dedicated databases [Genome Information Broker, http://gib.genes.nig.ac.jp/]. A BlastN search (GenBank, default settings) is carried out for each atypical region in order to identify the origin of potential HTs if homology is high enough.

Search for the origin of atypical regions About 12 000 species (including chromosomal, plasmidic, mitochondrial and chloroplastic DNA) from GenBank are found eligible for a genomic signature. Given the signature of an atypical DNA fragment, species with a close signature might be considered as potential donors. Such a screening is performed for every atypical region of the 22 species under consideration. The first five nearby species are retained when their distance to the outlier was donor-compatible.

A total of 22 genomes are screened for atypical regions (Table 1 and Supplementary Material 1). On the average, the 6-cluster scheme offers the best partition. However, in a single case (Aeropyrum pernix), nine clusters are required. In general, a single cluster is devoted to rRNA. The mean distance of windows to host varies over species from 121 to 145 (mean = 132, coefficient of variation = 3%). It is tightly correlated (P-value for the Pearson correlation coefficient <10 À4 ) with the cut-off distance that varies from 178 to 289 (mean = 234, coefficient of variation = 14%). Such large variations can hardly be explained on the mere basis of statistical fluctuations. As already observed (31, 45, 46) , variation of oligonucleotides usage along genome depends on species and can consequently be considered as a species property.

Segmentation quality of atypical regions can be tested using rRNA genes. About 94% of rRNA is detected as atypical ( Table 1) . Borders of rRNA genes are accurate to within 130 nt (0.5 kb window and 50 bp step, threshold 99%). Meanwhile, adjacent tRNAs are identified as well. As a general rule, it can be concluded that rRNA has a specific signature that is consistently at variance with the host signature. In this context, it is worth noticing that rRNA and the remaining outliers lie at comparable distances from the species they belong to, but they are clearly different from one another, rRNAs being consistently found in their own cluster.

The percentage of RNA-free outliers (at the nucleotide level) varies from 1.3 to 13% as a function of species (threshold 99%, Table 1 ). B.subtilis shows the highest percentage of atypical regions, whereas Pyrococcus abyssi has the lowest. Percentages among species are found correlated with the cut-off distance: the higher the cut-off distance, the lower the percentage of outliers (P = 0.007). In fact, a high cutoff distance takes place in species that display a high intragenomic variability, also expressed by a high mean distance to the host (Table 1) . Whether the actual percentage of atypical DNA is an intrinsic property of the species or a mere consequence of the resolution power of nucleotide biases-based methods remains consequently an open question. In addition, as already observed (13, 14) , the percentage of outliers is significantly higher for longer genomes (P = 0.004), whereas the cut-off distance is not related to the length of the genome (P = 0.69).

The mean cut-off distance for the 22 species is 234 (Table 1) . This value is chosen to select credible donors. About 50% of atypical regions are subsequently given credible donors (Supplementary Material 1). Each species has it own set of (Table 1) . Many plasmids and viruses are also found in agreement with the known molecular mechanisms of horizontal transfer (Table 1 and Supplementary Material 1).

A clustering with three classes allows assessing the signature of B.subtilis. The most populated class (collecting 84% of the segments) is chosen to represent B.subtilis. For this subpopulation, the mean distance (arbitrary unit) to the recipient (centroid of the class) and the cut-off distance are 126 and 204, respectively ( Table 1 ). Runs of contiguous outlying windows sharing the same cluster are considered as single transfer events. As a consequence, 58 regions (Figure 2a and Supplementary Material 2) fall beyond the cut-off distance and are thus potential candidates for hosting foreign DNA (for a segmentation of the B.Subtilis genome in terms of genes, see Supplementary Material 3). Figure 2b illustrates the accuracy of segmentation of an atypical region obtained by using a sliding window of 0.5 kb with a 50 bp step. rRNA genes make up $1.1% of B.subtilis genome ( Table 1) . All rRNA genes are found in the outlier population. In addition, all windows containing rRNA are assigned to a specific cluster. In fact, it is known that rRNA has its own signature, which is at variance from the host signature (12) . rRNA genes account for 7% of the outliers (tRNAs are not considered in this study, because their size is too small to generate a significant deviation from the host signature if they are isolated).

A total of 86% of the B.subtilis genome should be considered as B.subtilis typical (Table 1) . When looking for the origin of B.subtilis segments in the 12 000 signature database, B.subtilis appears in the 10 first potential donors for 84% of the whole set of 5 kb sequences that can be derived from its genome. This result confirms that segments having signatures belonging to the predominant clusters are good representatives of the recipient species signature.

The 49 rRNA-free atypical regions vary in size from 1.5 to 135 kb and make up 13% of the total genome (Table 1) . About 50% of atypical regions are less than (or around) 6 kb long. Distances of outlier from first potential donor often fall within the intra-genomic range ( However, in some instances, the outlier-to-donor distance is too great to consider the 'closest' species as potential donor. In contrast, unusual small values deserve a specific attention. In particular, the very small distance between bacteriophage SPBc2 and '2150751-2285750' atypical region (d = 2) allows to spot the part of B.subtilis genome where bacteriophage SPBc2 is incorporated (12, 47) . Other regions in the genome are also found similar (in terms of signature) to bacteriophage SPBc2. Most of them correspond to bacteriophages, imbedded in B.subtilis genome, whose free forms are not sequenced (12, 47) . Observed similarities with SPBC2 are, however, expected since signatures of phages usually share some characteristics with the species they infect (48) . The SPBc2 sequence is the only foreign sequence identified in B.subtilis, using homology as criterion (BlastN, with parameters set to default). In fact, Blast analysis of B.subtilis outliers leads to contrasted results. Besides SPBc2 and 7 out of 9 prophages imbedded in the genome, the only atypical regions identified are those containing the 30 rRNA genes coded in B.subtilis genome. The only few genes that are homologous to parts of atypical regions are found in species belonging to the Bacillus genus. It is interesting to note that no house-keeping genes (except rRNA) are detected in atypical regions. In fact, a great number of genes in atypical regions (except bacteriophage genes and rRNA) have no known function.

A clustering with five classes is required to determine the recipient species signature of H.influenzae. The three most populated classes (collecting 94% of the segments) are chosen to calculate the H.influenzae signature. Mean distance to host and cut-off distance is subsequently found equal to 130 and 239, respectively (Table 1) . Similarly to B.subtilis, one cluster (1.5% of H.influenzae genome) is devoted to the 18 rRNA gene copies (Table 1) . A total of 91% of rRNA is labeled atypical and account for 29% of the outliers.

Analysis of Table 1 shows that 95% of the H.influenzae genome should be considered as H.influenzae typical. In fact, H.influenzae is one of the 10 first potential donors for 92% of all 5 kb sequences that can be derived from its genome. As already observed for B.subtilis, the concordance of these two percentages corroborates the partition procedure used for the selection of typical/atypical fragments.

The 13 rRNA-free atypical regions vary in size from 1.5 to 19.5 kb and make up 3.3% of the genome (Table 1 , Annex 4 and Figure 3 , see Annex 5 for a segmentation of the H.influenzae genome in terms of genes). About 50% of atypical regions are less than (or around) 2.5 kb long. Numbers for H.influenzae are clearly at variance with those for B.subtilis: a smaller percentage of the genome qualifies as atypical and the average size of atypical regions is also smaller. This result is examined below in the context of intra-species signature variability (see Discussion).

A clustering with six classes is required to determine the recipient species signature of E.coli. The main features are summarized in Table 1 . The potential donors of the 84 RNAfree atypical regions are given in Annex 6 (for a segmentation of the E.coli genome in terms of genes, see Annex 7). It is worth noticing that 56% of E.coli potential donors belong to the Enterobacteriales family. Segmentation in terms of genes is displayed in Annex 7. The analysis of this genome is particularly useful for the comparison with literature (see below).

Numerous approaches for detecting horizontal gene transfers have been proposed in the last 2 decades. Phylogenetic trees of protein or DNA sequences, unusual distribution of genes, nucleotide composition (including codon biases) are some of the HT features that are considered within the framework of these models (16, 34) , Hidden Markov Models (HMMs) (12, 14, 35) and Factorial Correspondence Analysis (FCA) (37) are some criteria that are currently employed. Each of the resulting models has its own advantages and caveats (28) (29) (30) . As it has been recently pointed out by Ragan (49) and Lawrence and Ochman (50) , each approach deals with a particular subset of HTs, being for example more efficient for detecting recent transfers, or more effective for the detection of ancient HTs. Our approach, which is clearly based on oligonucleotide composition, assumes that different species have different signatures but does not rely on any other assumption. It is not surprising, therefore, that the genomic signature approach provides results (in terms of % of DNA transferred) in reasonable agreement with those proposed by Garcia-Vallve (16) and Nakamura et al. (14) for the 22 species that were analyzed in common. Correlations between percentages of HTs found by these three methods are highly significant Two species are extensively studied for HT content: B.subtilis (five methods including ours) and E.coli (six methods including ours). H.influenzae is also analyzed by Garcia-Vallve (16) and Nakamura (14) . Comparisons of methods are presented in Tables 2-4 and detailed in Supplementary Materials 3, 5 and 7. A voting procedure (majority rule) has been implemented to determine the status of genes with respect to atypicality. For that task, our initial analysis is converted in terms of genes (Supplementary Materials 3, 5 and 7). Degree of agreement between methods is subsequently observed using the statistical Kappa coefficient (51) . Kappa measures the degree of agreement on a scale from minus infinity to 1. A Kappa of one indicates full agreement, a Kappa of zero indicates that there is no more agreement than expected by chance and negative values are observed if agreement is weaker than expected by chance (a very rare situation). (14, 13, 11, 13 and 15%, respectively). The number of detected genes per method is close, ranging from 457 for Nakamura (14) to 599 for this work (median 537). Detailed votes are given in Table 2 . Among the 4100 genes of B.subtilis genome, 1011 genes are detected by at least one method (about 25% of B.subtilis genes). The number of 'single vote' genes ranges from 116 for Garcia-Vallve (16) to 47 for Nicolas (12) . A total of 470 genes make up the majority consensus set and we detected 453 of them, which is the best score of the five methods. The best agreement with the majority consensus (in terms of Kappas) is reached by Nicolas (12), followed by our method and Moszer (36) ( Table 2 ). Our method gets the best agreement with Nicolas (12) and the worst with the other HMM method used by Nakamura (14) (pairwise Kappa comparison, Table 2 and Supplementary Material 3). In fact, Nakamura approach is at variance with every other approach (14) . It gets the lowest Kappa with the Garcia-Vallve (16) Hayes (35) Lawrence (34) Nakamura (14) Medigue (49) This work majority consensus or with whatever other methods. From Table 2 , the probable number of HT genes in B.subtilis would range from 230 to 1011 with a 'reasonable' estimation around 470 corresponding to the majority consensus. It is to be noted that our method is unable to find two genes that are detected by every other methods (Supplementary Material 3) . These genes are 338 and 236 nt long, respectively, as compared with 2500 nt, the median size of atypical regions detected by our method (Table 1) . Clearly, our method is not appropriate for detecting short isolated atypical genes.

H.influenzae. Garcia-Vallve (16), Nakamura et al. (14) and we are the voters concerned with the analysis of the H.influenzae genome (Supplementary Material 5 and Table 3 , H.influenzae). The originality of results obtained by Nakamura (14) is the salient feature of this comparison. The number of detected HT genes is more than twice higher for Nakamura et al., whereas the part belonging to the majority consensus is the smallest ( Table 3) . Eleven genes are detected both by Garcia-Vallve and Nakamura (14, 16) but not by our method; however, the small number of voters precludes any specific comment in this respect. The probable number of HT genes in H.influenzae would range between 11 and 273, with a 'reasonable' estimation around 60 (majority consensus of 57) ( Table 4 ).

The results obtained by Hayes and Borodovsky (35) are clearly at variance with the others (Table 4 ). Although the proportion of claimed outliers is within the range of published numbers for E.coli (14, 16, 24, 34, 35, 37) , 37% of them are method-specific, and the agreement with other methods is weak (Table 4 ). Hayes and Borodovsky have obviously developed an approach based on HMM dealing with specific outliers. Lawrence and Ochman (34) also get a poor rating especially because they detect about twice as many genes as the other authors do (Table 4) .

It is worth noting that if the cut-off distance for our method is lowered, i.e. 95% instead of 99% for instance, some of the 'single vote' genes are dug out (for details about the impact of the cut-off distance, see Supplementary Material 7). Meanwhile, the percentage of outliers as reported by our approach rises to 20% and the percentage of 'single vote' genes reaches 24%. As expected, a high cut-off distance provides few single vote genes at the risk of missing some potentially transferred genes. Lowering the cut-off increases the proportion of single vote genes with the advantage of detecting most of the potential transfers (Supplementary Material 7) . There is obviously a continuous grading in gene 'atypicality'. It is suggested to first consider most 'consensual' genes as potential HTs and then apply amelioration models to explain the grading.

It is difficult to assess the relevancy of proposed donors, because genes detected as potential HT have generally undergone amelioration (8) . The comparison of recently diverged genomes (species or strains) provides the opportunity to find recent HTs, for which corresponding homologous genes in the donor species may be detected (52) . Such a study is performed for five E.coli strains (two K12 strains: E.coli MG1655, E.coli W3110, one uropathogenic strain: E.coli CFT073, two enterohaemorrhagic strains: E.coli O157-H7 RIMD 0509952, E.coli O157-H7_EDL933) and two Shigella flexneri strains (S.flexneri 2a 2457T, S.flexneri 2a 301). These seven strains/ species have recently diverged, genome sizes are different and the proportion of horizontally transferred genes varies from one strain/species to another (14, 52) . For instance, only $40% of the non-redundant set of proteins is common to E.coli strains CFT073, 0157-H7 EDL 9333 and MG1655 (53) . These strains/species can be clustered in four groups with respect to phylogeny (Table 5) .

Two criteria are used to searching for 'recent horizontally transferred genes': atypical regions (window size 1 kb, step 0.5 kb) (i) must have a signature that differs greatly from that of the host [distance to host must be at least >325, 2.5 times the E.coli intrinsic mean distance (Table 1) ] and (ii) must be present in a limited number of strains/species to ascertain their recentness. In fact, outliers meeting the first criterion generally aggregate into several heterogeneous clusters (K-means clustering) that usually include samples from each strain/species. In some instances, however, some strains/species were absent from the cluster. It was subsequently considered that the corresponding regions might have been recently acquired by the relevant strains/ species. Table 5 shows a selection of potential recently transferred genes. Each cluster of atypical regions contains genes present in a specific set of strains. Some atypical genes are strainspecific, some are only absent in the non-pathogenic K12 strains and intermediate situations are also encountered.

FASTA and Blast searches confirm that these genes are absent from some of the tested strains as already observed in the analysis complete genomes (53) (54) (55) . In a large number of cases, we are able to find a well-conserved homologous gene in another species (Table 5) . It is interesting to note that some of the suggested donors using our 12 000 signature database are in agreement with the species found by alignment methods. When no homologous gene is found, the proposed donors give credit to the known mechanisms of gene transfer (bacteriophages or plasmids) ( Table 5) .

It is worth noticing that most of the selected genes that are absent in K12 strains are involved in the pathogenicity of the other strains (52) . E.coli 0157-H7 is the strain exhibiting the greatest number of genes absent in K12 strains [about 1400 (54) ]. It has the greatest number of genes for which no homolog can be found (Table 5) . Moreover, we are unable to propose a donor for a great part of these genes (Table 5) . Many selected genes for E.coli 0157-H7 lie in the Ter region of the genome (between positions 2 000 000 and 2 500 000) in agreement with the published results (56). 

We have observed that most genomic regions are typical of the genome they belong to, using the signature as endpoint.

Considering that the genomic signature is species-specific, atypicality of a region in terms of oligonucleotide usage has been promoted as a criterion for the detection of HTs.

However, atypicality-based methods suffer several caveats that reduce their effectiveness in such a way that only a part of HTs can be detected. In fact, transfers between species with close signatures cannot be detected: significant differences between characteristics of transferred DNA and recipient species DNA are required. For similar reasons, HTs that were drastically ameliorated following their introduction cannot be detected either (8, 14) . The most stringent constraint, however, results from the size of the screening window. On the one hand, ideally, the best signal-to-noise ratio would be obtained when windows and HTs have a comparable size.

On the other hand, the window size must be large enough to provide significant word counts, a requirement that strengthens with the size of the words under consideration and the intrinsic variability of the genomic signature along the genome. All together, the trade-off that has been implemented in this paper allows detecting atypical regions as small as 1 kb. In fact, rRNA regions sharing this characteristic were consistently detected. It must be pointed out that smaller fragments can be eventually detected if their signatures are radically atypical. G+C% atypicality has often been considered as criterion for detecting HTs (8, 24) , but this approach suffered several drawbacks (28) (29) (30) . It is to be noted that our signature-based method detects regions for which the G+C% lies within one standard deviation from the mean G+C% of the species (for instance, regions 2675251-2676250 in B.subtilis or 534751-535250 in H.Influenzae, see also Supplementary Materials 2 and 4).

As already observed by Nicolas et al. (12) for B.subtilis, rRNA has definitely an atypical signature. It is systematically classified as outlier, whatever the species (Table 1) . Although transfer of rRNA from one species to another is unlikely (11, 57) , it cannot be firmly ruled out. However, it is clear that the atypical signature of rRNA does not imply that they are horizontally transferred.

The signature approach has an interesting property (that it shares with HMM) (7, 12, 28) : detection is not bound to any specific function in the genome. In contrast with most other methods, the signature approach not only detects genes, but whole transferred regions as well, in agreement with the described mechanisms of DNA exchange between species. It is to be noticed that the method allows detecting several atypical non-coding regions (Supplementary Materials 3, 5 and 7). One major difference between HMM and signature method lies beyond the time required for the learning process, in the few resources that HMM can mobilize to deal with a short 'one of its kind' HT. On the other hand, HTs shorter than 1 kb can hardly be detected by a signature-based approach. An innovative HT detector is likely to result from an adequate fusion of both methods.

Several factors contribute to the efficiency of the search for donors. Of course, distance between putative HT and donor signatures is essential. Accuracy of signatures, linked to the length of available sequences, density of signatures in the 'vicinity' of HT, amount of amelioration sustained by HT during its presence in the host are also of importance [P. Deschavanne, S. Lespinats and B. Fertil, unpublished results; (25, 27, 31) ]. Distance between the signature of a putative HT and the closest species varies to a large extent, but usually the shortest ones fall within the intra-genomic range ( Table 1 , Supplementary Materials 1, 2, 4 and 6) . In some cases, the distance between the closest donor signature and the atypical segment signature is so great that no potential donor can be proposed (Supplementary Materials 1, 2, 4 and 6) .

When strong similarities between a given DNA sequence and a foreign species are observed, the hypothesis for an underlying transfer is highly strengthen. However, the 'true' donor has to be previously sequenced and included in our bank of signatures to allow such a situation to occur. Moreover, we must take into account the intrinsic variability of short DNA segment signature (which is a function of their size, but also species-specific) when compared with the signature of a complete genome or any other large species sample (25, 27, 31) . In the present state, our signature database is in no way representative of the diversity and richness of life. However, it must be noticed that there is already an obvious structure (in terms of distances between signatures) expressing taxonomy relationships between species in our signature database (31, (58) (59) (60) (61) . Related species are often found close to one another. Clusters of potential donors may consequently provide pertinent information about the origin of HTs.

The diversity of signatures of putative HTs that can be observed for most of the species analyzed in this paper reveals the multiplicity of transfer events and donors (Supplementary Materials 2, 4 and 6). However, several outliers, not necessarily neighbors in the genome, are given the same set of potential donors (Table 1 , Supplementary Materials 1, 2, 4 and 6). In general, the potential donors belong to few sets of taxonomically close species (Table 1 ) and share the biotope of the host (Supplementary Materials 1, 2, 4 and 6). For instance, B.subtilis, H.Influenzae and E.coli live in distinct biotopes; their potential donors do so as well. It is particularly encouraging to find that most of the potential donors that our approach has pointed out have had the opportunity to exchange DNA material with the recipient species.

Numerous viruses and plasmids qualify as potential donors (Tables 1 and 5 , Supplementary Materials 1, 2, 4 and 6 ). It is not really surprising since they are known as HT vectors. They are often totally or partially inserted together with transferred genes in the host genome (14) .

Some atypical DNA segments are particularly peculiar. They are isolated, have a specific signature (distances from neighbors are great), so that they cannot be given a credible set of donors (Supplementary Materials 1, 2, 4 and 6) . Lack of data in the search domain, shift of signature features after a substantial amelioration process, structural constraints serving special functions or roles (14,62) (as it is for rRNA coding regions) are some of the tracks that remain to explore in these circumstances.

It would be interesting to localize the region the transfer may come from when the complete genome of the donor is available. However, homology (at the DNA level) is not a pertinent criterion for the comparison of sequences as soon as amelioration has taken place (8, 14) . In fact, homology is sometimes weak, e.g. between genes of Escherichia and Salmonella although these species have 'recently' diverged (34) . It is clear that a more powerful search for the origin of putative HTs would have to embody models of amelioration [such as the one designed by Lawrence and Ochman (8) ].

When searching for very recent horizontally transferred genes, in different strains of a species for instance, it was possible to find a great homology between detected genes and some genes from other species (Table 5 ). In numerous cases, the selection of donors is consistent with FASTA results ( Table 5 ). This confirms the pertinence, beyond the similarity of signature between putative HTs and donors, of the proposed method to retrieve the species of origin of a transferred region. It seems that the search for origin of HTs on the basis of genomic signature is a powerful approach to understand some of the mechanisms of evolution (13, 63) .

Oligonucleotide usage is known to be species-specific and to suffer only minor variations along the genome (25, 27) . Considered together, these properties allow searching for atypical local signatures that may point out DNA transfers. Results obtained with the 22 genomes analyzed in this paper are found in good agreement with literature (Tables 2-4 , Supplementary Materials 3, 5 and 7) (12, (14) (15) (16) 24, 34, 35) .

The species specificity of signature allows searching for donor species. Quite often, sets of donor species with common taxonomic features are obtained. With the help of environmental considerations, it is subsequently possible to identify (or collect clues about) potential donors. The search for donor makes use of non-homologous sequences. Partially sequenced species become consequently eligible, inasmuch 1.5 kb of the genome is available (25, 27) . Thanks to the exponentially growing rate of nucleotide databanks, the search for donor species by means of the sequence signature will turn more and more pertinent and fruitful in the future. In this context, it is worth noticing that computational power is clearly not an issue since the CGR algorithm described in this paper is fast and of 0 order (calculation time is proportional to the number of nucleotides).

Several methods are proposed to look for HTs. The signature method, based on different hypotheses, is complementary to those already described. It seems that each method detects preferentially certain types of HTs (49, 50) . In agreement with many authors (1, 16, 49, 50, 64) , it appears that the conjunction of several methods is required to obtain an overview of HT extent in a genome.

The signature method described in this paper generalized many approaches that ground the detection of outliers on the basis of the bias in oligonucleodides. The strong species specificity of the signature not only allows detecting various kinds of outliers but also provides clues about their possible origin. Obviously, the detection of HTs remains an open question; a consensus has still to emerge.

Additional materials and experimentation with the genomic signature are available from the GENSTYLE site (http:// genstyle.imed.jussieu.fr).

Horizontal gene transfer among microbial genomes: new insights from complete genome analysis

Horizontal gene transfer contributes to the wide distribution and evolution of type II restriction-modification systems

Gene transfer is a major factor in bacterial evolution

Evolution by acquisition: the case for horizontal gene transfers

Horizontal gene transfer and bacterial diversity

Lateral genomics

A Hidden Markov Model approach to variation among sites in rate of evolution

Amelioration of bacterial genomes: rates of change and exchange

Horizontal gene transfer: evidence and possible consequences

Horizontal gene transfer and the origin of species: lessons from bacteria

Compositional biases of bacterial genomes and evolutionary implications

Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models

Lateral gene transfer and the nature of bacterial innovation

Biased biological functions of horizontally transferred genes in prokaryotic genomes

Horizontal gene transfer in bacterial and archaeal complete genomes

HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes

Horizontal gene transfer among genomes: the complexity hypothesis

Detecting recombination from gene trees

Escherichia coli molecular phylogeny using the incongruence length difference test

Evolution of aminoacyl-tRNA synthetases-analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events

Phylogenetic classification and the universal tree

Heterogeneity of genomes: measures and values

Detecting alien genes in bacterial genomes

Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes

Genomic signature is preserved in short DNA fragments

Dinucleotide relative abundance extremes: a genomic signature

Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier

Codon bias and base composition are poor indicators of horizontally transferred genes

Limitations of compositional approach to identifying horizontally transferred gene

Intragenomic base content variation is a potential source of biases when searching for horizontally transferred genes

Genomic signature: characterization and classification of species assessed by chaos game representation of sequences

An archaeal genomic signature

Chaos game representation of gene structure

Molecular archaeology of the Escherichia coli genome

How to interpret an anonymous bacterial genome: machine learning approach to gene identification

Codon usage and lateral gene transfer in Bacillus subtilis

Evidence for horizontal gene transfer in Escherichia coli speciation

Oligonucleotide bias in Bacillus subtilis: general trends and taxonomic comparisons

Bacterial DNA strand compositional asymmetry

Asymmetric substitution patterns in the two DNA strands of bacteria

Asymmetric directional mutation pressures in bacteria

Strand compositional asymmetry in bacterial and large viral genomes

Classification and Regression Trees

The Concentration of Measure Phenomenon

Global dinucleotide signatures and analysis of genomic heterogeneity

Comparative DNA analysis across diverse genomes

The complete genome sequence of the gram-positive bacterium Bacillus subtilis

Similarities and dissimilarities of phage genomes

On surrogate methods for detecting lateral gene transfer

Reconciling the many faces of lateral gene transfer

Statistical Methods for Rates and Proportions

The source of laterally transferred genes in bacterial genomes

Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli

Genome sequence of enterohaemorrhagic Escherichia coli O157:H7

Complete genome sequence and comparative genomics of Shigella flexneri serotype 2a strain 2457T

G+C3 structuring along the genome: a common feature in prokaryotes

Statistical significance of sequence patterns in proteins

Genomic signature: A global sequence analysis concept applied to phylogeny

A genomic schism in birds revealed by phylogenetic analysis of DNA strings

Relationship of SARS-CoV to other pathogenic RNA viruses explored by tetranucleotide usage profiling

Evolutionary implications of microbial genome tetranucleotide frequency biases

Use and misuse of correspondence analysis in codon usage studies

Ancient horizontal gene transfer

Assessing evolutionary relationships among microbes from whole genome analysis

We thank Lawrence, Ochman, Hayes, Borodovsky, Ragan and Charlebois for kindly supplying their original data. This work was supported by grant contract N 120910 from the 'Action inter-EPST Bio-informatique 2001' of French Research Ministry. Funding to pay the Open Access publication charges for this article was provided by INSERM.

Supplementary Material is available at NAR Online.