key: cord-0005018-yv76yvy5 authors: Demers, G. William; Matunis, Michael J.; Hardison, Ross C. title: The L1 family of long interspersed repetitive DNA in rabbits: Sequence, copy number, conserved open reading frames, and similarity to keratin date: 1989 journal: J Mol Evol DOI: 10.1007/bf02106177 sha: ae5dbcbf679a830e8a0c021339476a42c8285a08 doc_id: 5018 cord_uid: yv76yvy5 The L1 family of long interspersed repetitive DNA in the rabbit genome (L1Oc) has been studied by determining the sequence of the five L1 repeats in the rabbit β-like globin gene cluster and by hybridization analysis of other L1 repeats in the genome. L1Oc repeats have a common 3′ end that terminates in a poly A addition signal and an A-rich tract, but individual repeats have different 5′ ends, indicating a polar truncation from the 5′ end during their synthesis or propagation. As a result of the polar truncations, the 5′ end of L1Oc is present in about 11,000 copies per haploid genome, whereas the 3′ end is present in at least 66,000 copies per haploid genome. One type of L1Oc repeat has internal direct repeats of 78 bp in the 3′ untranslated region, whereas other L1Oc repeats have only one copy of this sequence. The longest repeat sequenced, L1Oc5, is 6.5 kb long, and genomic blot-hybridization data using probes from the 5′ end of L1Oc5 indicate that a full length L1Oc repeat is about 7.5 kb long, extending about 1 kb 5′ to the sequenced region. The L1Oc5 sequence has long open reading frames (ORFs) that correspond to ORF-1 and ORF-2 described in the mouse L1 sequence. In contrast to the overlapping reading frames seen for mouse L1, ORF-1 and ORF-2 are in the same reading frame in rabbit and human L1s, resulting in a discistronic structure. The region between the likely stop codon for ORF-1 and the proposed start codon for ORF-2 is not conserved in interspecies comparisons, which is further evidence that this short region does not encode part of a protein. ORF-1 appears to be a hybrid of sequences, of which the 3′ half is unique to and conserved in mammalian L1 repeats. The 5′ half of ORF-1 is not conserved between mammalian L1 repeats, but this segment of L1Oc is related significantly to type II cytoskeletal keratin. The repeated DNA sequences that are dispersed throughout eukaryotic genomes have been divided into two classes (reviewed by Weiner et al. 1986 ). Both classes appear to transpose by an RNA intermediate, and the insertion of either class of repeated DNA generates short flanking direct repeats at the target site--hallmarks of transposition first recognized in prokaryotes. One class of repeated DNA resembles retroviruses in that members of this class are flanked by long terminal repeats (Baltimore 1985) . This class includes the yeast Ty-1 repeat, the Drosophila copia repeat, and the human THE1 repeat (Paulson et al. 1985) . Another class of repeated sequences resembles processed pseudogenes and lacks long terminal repeats (LTRs). This second class of repeats has been termed retroposons (Rogers 1983) , nonviral retroposons (Weiner et al. 1986) , and non-LTR retrotransposons (Xiong and Eickbush 1988) . In this paper, this second class of RNAtransposed repeats will be called retroposons. Two groups of retroposons have been identified based on their length: the short interspersed repeats, or SINEs, that are tess than 500 bp long, and the long inter- Repetitive DNA in the rabbit/3-like globin gene cluster. The 0-like globin genes ~, 7, 6, and/3 are shown as boxes along the 45-kb segment of cloned DNA (Lacy et al. 1979 ). Transcription of the active genes is from left to right. The location and orientation of L1 repeats are shown by the filled arrows. The L1 repeats are named LIOcl-L1Oc5 (Demers et al. 1986 ). The location and orientation of C repeats, a rabbit SINE, are shown by the open arrows. spersed repeats, or LINEs, that are greater than 6000 bp long (Singer 1982) . Although no precise sequence specificity has been observed at the insertion sites, SINEs and LINEs do have a regional preference for integration in the human genome, as shown by the enrichment of different chromosome bands for either LINEs or SINEs (Korenberg and Rykowski 1988) . Although several different sequences have been dispersed as SINEs in mammals (reviewed in Weiner et al. 1986 ), only one sequence element, called L1, has been found to be dispersed as a LINE in mammals (reviewed in Singer and Skowronski 1985) . The L1 sequence has been identified in a wide variety of species including primates (Lerman et al. 1983) , mice (Brown and Dover 1981; Fanning 1982) , rats (Econonmou-Pachnis et al. 1985; Soares et al. 1985; D'Ambrosio et al. 1986 ), dogs (Katzir et al. 1985) , eats (Fanning and Singer 1987) , and rabbits (Demers et al. 1986 ). Genomic blot-hybridization analysis indicates that the L1 sequence is present in all mammalian species at a frequency of about 104-105 copies per haploid genome (Burton et al. 1986) . Although the parent genes of SINEs are transcribed by RNA polymerase III, the L1 repeats appear to be derived from an RNA polymerase II transcript. The parent gene of L1 is proposed to be a protein-coding gene (reviewed in Singer and Skowronski 1985) . Long open reading frames (ORFs) are found in the L1 sequences (Manuelidis 1982; Martin et al. 1984; Potter 1984) , and sequenced members from the mouse genome have two overlapping ORFs of 1137 bp (ORF-1) and 3900 bp (ORF-2) Shehee et al. 1987) . The ORF-2 regions of primate and rabbit LI are 65% similar, but the similarity ends abruptly at a conserved stop codon (Demers et al. 1986 ). In previous studies on the L 1 repeats from rabbits (L 10c, for LINE 1 from Oryctolagus cuniculus), the B, E, and D repeats identified by Shen and Maniatis (1980) were shown to be parts of the L1Oc repeat. The sequence of one truncated L1 repeat and part of another repeat were presented as a composite sequence, and the ORF (corresponding to ORF-2) and 3' untranslated region were identified (Demers et al. 1986 ). In this paper, the rabbit L1 repeats are characterized more thoroughly, and the similarities and differences of L1 sequences between species are explored further. Interspecies comparisons reinforce the conclusion that the L1 repeat has two ORFs that are conserved for their protein-coding capacity. However, the region between the two ORFs is not conserved among species, and this observation is used to indicate possible start and stop codons for the ORFs. ORF-1 encodes a composite protein, and the 5' half of ORF-1 from L1Oc is related to type II cytoskeletal keratin. Subcloning and Sequencing of LlOc Repeats. The sequenced members of the L1Oc family were from the rabbit 0-like globin gene cluster isolated by Lacy et al. (1979) . Interspersed repetitive DNA was identified by Shen and Maniatis (1980) by hybridization and heteroduplex mapping. The five L1 members (Demers et al. 1986 ) were sequenced by dideoxynucleotide chain termination reactions (Sanger et al. 1977) using subclones in M13 phages as templates (Messing 1983 ). Analysis of DNA Sequences. Sequence matches were first identified by dot plots generated by the computer program MATRIX (Zweig 1984) . This provides a graphical display of sequence similarity that plots matches (forward similarity) of 23 out of 30 bases. Similar sequences were then aligned by the computer program NUCALN (Wilbur and Lipman 1983) using the parameters K-tuple = 3, window size = 20, gap penalty = 7. The protein sequence databases at the Protein Identification Resource (National Biomedical Research Foundation) were searched using the FASTp program (Lipman and Pearson 1985) . The statistical significance of the similarities found by FASTp were tested using the program RDF (National Biomedical Research Foundation); this program scrambles the target sequence (revealed by FASTp) into 20 shuffled sequences and computes the mean similarity score for the shuffled sequence with the test sequence (in this case, ORF-1 of L 10c). The similarity score for the match between the true sequences is compared with the mean score for the shuffled sequences in terms of the number of standard deviations that separate them. conditions as in the Southern blot analysis. The ratio of percentage of plaques that hybridized to the percentage of the rabbit genome in one h clone gives the approximate copy number of the region. The average size of an insert in this ~, library is 17 kb (Maniatis et al. 1978) . Thus, the fraction of the rabbit genome per phage is 17 • 103/3 x 109 or 5.7 • 10-4%. The fact that 96% of the phage in the library have rabbit DNA (Maniatis et al. 1978) was also taken into account. Rodent and Human L1 Sequences. The mouse LI sequence, LIMdA2 , and the rat LI sequence, L1Rn or LINE3 (D'Ambrosio et al. 1986 ) are randomly isolated L1 members from their respective genomes. The human L1 sequence, L1Hs-TBG41, is located 3.3 kb 3' to the human/]-globin gene (Hanori et al. 1985) . A consensus L1Hs sequence (Scott et al. 1987 ) was used in the analysis of ORF-I in Fig. 8 . The interspersion of repetitive sequences among the rabbit B-like globin genes is shown in Fig. 1 . The genes ~ and 7 (formerly/34 and/33) are expressed in embryonic development (Rohrbaugh and Hardison 1983) , ~ (~b/32) is an inactive pseudogene (Lacy and Maniatis 1980) , and/3 (/31) is expressed in fetal and adult life (Hardison et al. 1979; Rohrbaugh et al. 1985) . The 5' to 3' orientations of the proposed RNA intermediates of the repetitive elements are indicated by the arrows in Fig. 1 ; the A-rich tracts are at the 3' ends. The sequences of the five L1Oc repeats are presented in Fig. 2 . L1Oc5 is adjacent to L1 Oc4 (Fig. 1) , so the last nucleotide in the L1 Oc5 sequence is followed by the first nucleotide in the L1 Oc4 sequence (Fig. 2) in the sequence of the gene cluster (Margot et al. 1989) . The longest member of the rabbit L1 family in the /3-like globin gene cluster is L1Oc5. The next longest member is LI Oc4; it has an internal deletion of 667 bp (Fig. 2, . This is clearly a deletion from L I Oc4 and not an insertion in L1Oc5 because a similar sequence is present in both mouse and human Lls (Demers et al. 1986 ). L1 Oc5 will be the prototypical rabbit L1 for further analysis because it is the longest and has no extensive internal deletions. The 5' end of L1Oc5 is also the end of the cloned region of the rabbit /3-like globin gene cluster (see Fig. 1 ). Only two of the Shen and Maniatis (1980) are shown at the bottom of the diagram. individual repeats, L1Oc4 and L1Oc5, contain sequences for the ORF region (Demers et al. 1986 ). The other three repeats contain part or all of the 3' untranslated region. L1Oc5 and L1Ocl have internal direct repeats of 78 bp in the 3' untranslated region. One copy of the repeat is at positions 6015-6092 and the other is at positions 6212-6289 (lower case letters in Fig. 2 ). L1Oc4 and L1Oc3 have only one copy of this 78-bp sequence, and they do not contain the sequence between the 78-bp direct repeat (present in L1Oc5 and L1Ocl). Thus, the class of L1Oc repeats containing one copy of the 78-bp sequence could be derived from the class containing two copies by a deletion between the two 78-bp sequences. Another example of a sequence rearrangement is the apparent insertion of 34 bp into L 10e4 between positions 5701-5702 of L1Oc5. Most members of the L1Oc family are flanked by short direct repeats. L 10c 1 and L 1 Oc2 are flanked by direct repeats of 9 bp and 5 bp, respectively (Fig. 2) . The flanking direct repeats differ for the two individual L 1 repeats, showing that they are not part of the L1 sequence. Such flanking direct repeats are often generated by insertion of transposable elements presumably by repair of a staggered break at the target site. The flanking direct repeats for L 1 Oc4 and L1Oc5 cannot be identified with the available data. The 5' end of L1Oe5 has not been cloned. Because L1 Oc5 is juxtaposed to LI Oc4, it is possible that L1Oc5 may have inserted into L1Oc4, in which case the 5' end of L1Oc4 is also not available. The only other L1 member, L1Oc3, does not have obvious flanking direct repeats generated by a duplication of the target site. The sequence GTTAAAAAAA found just 3' to the polyadenylation site (positions 6438-6447) is also found upstream from L1Oc3 (Margot et al. 1989 ). However, because the sequence GTT(A)7 (or a slight variation ofi0 is also found in all of the other L1 sequences just 3' to the polyadenylation signal, it is likely not to have been generated by a target site duplication around L1Oc3. This terminal repetition could be generated by insertion of a circular form of L1 by homologous recombination into a GTT(A)7 sequence at the target site. The structural features revealed by the alignment and comparison of the L1 members from the rabbit ~3-1ike globin gene cluster are summarized in Fig. 3 . The B, E, and D repeats identified by Shen and Maniatis (1980) are also aligned with their position in the L10c sequence. The D repeat is confined to the 3' untranslated region, whereas the B repeat and most of the E repeat are from the ORF region. L1Ocl begins immediately after the conserved translation stop codon. Figure 3 also illustrates the internal sequence rearrangements described above. The diagram of L1Oc repeats in Fig. 3 shows that they are truncated at a variable distance from the 5' end of the longest elements. This truncation from the 5' ends is common in the whole population of LI repeats, as demonstrated by using four regions of L1Oc5 as probes against the rabbit genomic DNA library in a plaque hybridization assay. By counting the number of plaques that hybridized to a given probe, the approximate copy number of each region of the L1Oc5 repeat was determined (see Materials and Methods). As shown in Fig. 4 , the 5'-most region of LIOc5 is represented about 11,000 times in the haploid genome of the rabbit, and regions of L1 located more 3' are found more frequently. The largest increase in copy number is seen in the region from positions 4351 to 6004 that includes the 3' untranslated region; this region is represented at least 66,000 times. However, the relationship between the length of the repeat and the copy number is not linear; only a gradual decrease in copy number is observed as probes going from position 4350 to position 1 are used (Fig. 4) . Therefore, many of the L1 repeats detected with the probe from the 5' end may be full length, indicating that up to 17% of the population of LIOc repeats could be full length. This difference in copy number at the 5' and 3' ends of LIOc repeats is also observed when uncloned genomic DNA is hybridized with the different L1Oc probes (data not shown). Thus, the lower copy number at the 5' end is not a result of underrepresenration in the cloned genomic library. Because the 5' end of L1Oc5 is at the end of the cloned portion of the rabbit/3-1ike globin gene cluster, it is likely that the nucleotide sequence obtained from L1Oc5 is not that of a full-length L1 repeat. Therefore, cloned subfragments of L 1 Oc5 were used as probes against Southern (1975) blots of rabbit genomic DNA to determine the average structure of full-length rabbit L1 repeats. Discrete genomic restriction fragments detected with L1Oc5 probes were mapped by two strategies. The portion of L 1 Oc contained within the genomic restriction fragment was determined by which probes from L1Oc5 hybridized to the fragment, and then the genomic restriction fragment was aligned with conserved restriction sites found in the cloned LI Oc DNA. This analysis is presented in detail in Demers (1987) , and the portion relevant to the 5' end of L1Oc is summarized in Fig Seal 2.1 Sphl 1.9 Xmnl 3.7 The longest restriction fragment extending 5' to the cloned end of L 1 Oc5 is the PstI 4.0-kb fragment that ends 1 kb 5' to the cloned region of L1Oc5 (Fig. 5 ). The ScaI 2.1-kb, SphI 1.9-kb, and XmnI 3.7-kb genomic fragments all have 5' ends between the conserved PstI site located outside L1Oc5 and the 5' end of L1Oc5 (Fig. 5) . These data indicate that fulllength L1Oc repeats wiI1 extend at least 1 kb further 5' than the sequenced portion of L1Oc5. Several clones from the rabbit genomic DNA library are currently being studied in order to determine the 5' end of L1Oc repeats. The sequence of the rabbit L1 repeat was compared with the sequences of the mouse and human LI repeats by dot-plots and by sequence alignments. The dot-plot analyses in Fig. 6 show that the internal sequence of L1Oc is very similar to both L1Md (mouse) and L1Hs (human) over very long segments, whereas the 5' and 3' ends are not conserved between species. The internal region of sequence similarity of about 4.5 kb is divided into two pans, a short region of similarity of about 300 bp followed by a very long segment of similarity. The long segments of internal similarity are in the portion of L 1 that encodes open reading frames (ORFs). The ORFs found in the L1Oc5 sequence are shown in Fig. 7 , along with a comparison of the ORFs from L1Md. The mouse LIMdA2 sequence contains two ORFs, one of 1137 nucleotides (top strand, N frame in Fig. 7 , bottom panel) and one of 3900 nucleotides (top strand, N + 1 frame in Fig. 7 ), that overlap by 14 nucleotides ). Seven open reading blocks are in the rabbit L1Oc5 sequence in frames N, N + 1, and N + 2 ( Fig. 7 , top panel). The bar between the stop codon maps of each species shows the regions of similarity ( Fig. 6 ) as filled boxes. It is apparent that the regions of L1 that are similar between species contain extensive ORFs, although the ORFs at the 5' end are not similar between species. Rabbit L1 repeats have only two major ORFs. Although the data in Fig. 7 show that L1Oc5 has several ORFs, they are probably derived from longer reading frames in the ancestral L1 sequence. The Fig. 9 . Sequence similarities in the ORF-! region. The L]Oc ORF-I region is shown as a black box, numbered according to the codon positions in Fig. 8 . The ORF-1 regions from L1Md and L 1Hs are displayed as composite boxes. The darkness of the fill in each box is proportional to the extent of similarity of the L 10c sequence. The percent identity ofthe encoded amino acids, compared to the L1Oc sequence, are given in the boxes. A box representing a portion of the type II cytoskeletal keratin sequence is aligned with the segment of the LIOc sequence that matches it. The percent of amino acids identical to the L1Oc ORF-1 translated sequence is given in the boxes, and the amino acid positions in the keratin sequence are listed below the boxes. A gap penalty of -! was assessed in calculating the percent identities. larity corresponds to ORF-2 and the short region of similarity corresponds to the 3' portion of ORF-1. The two ORFs are overlapping in L1Md, and it is of interest to determine whether this feature is conserved in LI repeats from other species. Also, ORF-1 appears to be a hybrid sequence because it is well conserved between species in the 3' half but it is not well conserved in the 5' half. Therefore, the sequence of ORF-1 and the region between the ORFs were aligned for the L1 repeats from rabbit, mouse, rat, and humans. Figure 8 shows both the aligned nucleotide sequences and the predicted amino acid sequences. Sequences that match well between species are in reverse text, whereas sequences that do not match well are in plain text. Inspection of the aligned L1 sequences allows a tentative identification of the start and stop sites of the ORFs. This analysis reveals that no overlap between reading frames is seen in rabbit and human L1 repeats. The end of ORF-1 in L1Md is the TAA at positions 1163-1165 (boldface in Fig. 8) . The same sequence is found in the rat L1 sequence (L 1Rn), and in-phase terminators are found nearby in L10c and L1Hs (boldface TAAs in Fig. 8 ). ORF-2 in L1Md begins in a different reading frame at position 1149, and thus it overlaps with ORF-1 for 14 nucleotides. By aligning the sequences of the different L1 s in the well-conserved ORF-2 region, it is apparent that an ATG is conserved in the rabbit and human sequences at positions 1235-1237. An in-frame ATG two codons upstream was previously identified as the start of ORFb in the L1Rn sequence (D'Ambrosio et al. 1986 ) and an ATG is also in frame in the L1Md sequence seven codons upstream. One can propose that the TAA close to position 1163 is the end of ORF-1 and the ATG at positions 1235-1237 is the start of ORF-2 in rabbit and human L1 repeats. In an independent analysis of several individual L1Hs repeats, these same codons were assigned as the end of ORF-1 and the start of ORF-2 in the consensus L 1Hs sequence (Scott et al. 1987) . As shown in Fig. 8 , ORF-2 is in the same reading frame as ORF-1 in the L 10c and L 1Hs sequences. Thus, the overlap in reading frames seen for L1Md is not observed in L1Oc and L1Hs. ORF-2 in L1Rn is in a different reading frame than ORF-1, but the L1Rn sequence does have an ATG proposed as the start of ORF-2. Thus, LIRn has overlapping reading frames, but the sequence in the overlap may not be used to encode a protein. The region between ORF-1 and ORF-2 is not conserved between mammalian species. The sequence between the TAA that ends ORF-1 and the ATG proposed to be the start of ORF-2 is in a region that is quite dissimilar between rabbit and mouse and between rabbit and human (plain text region between positions 1121 and 1240 in Fig. 8 ). This is the region of no similarity previously seen in dotplots (Fig. 6) . The sequence between the L1 ORFs is also not conserved in comparisons between the human and rodent sequences (Scott et al. 1987 ). Because this region is not conserved, whereas the sequences before and after it are conserved, probably for their capacity to encode a protein, it is unlikely that the inter-ORF region encodes a protein. This lack of conservation supports the proposed assignments for the start of ORF-2 in L1Oc and L1Hs. The mouse L1 sequence is ATA at positions 1235-1237; this same sequence is found in three sequenced members of the L1Md family (Shehee et al. 1987) . Therefore, the overlap between reading frames 1 and 2 are conserved in mouse Lls, but the overlaps are not seen in the rabbit and human L1 sequences. The ORF-1 sequence is a composite of conserved and nonconserved regions. As shown diagrammatically in Fig. 9 , codons 79-294 are highly related between species in different mammalian orders, and a long segment from codons 171 through 294 shows a 52-56% amino acid identity in these comparisons. A short region from codons 97 to 122 is not conserved, nor are the last 14 codons in the sequence, but in general the C-terminal two-thirds of ORF-1 is conserved between orders. A search through the databanks at the Protein Identification Resource (National Biomedical Research Foundation) did not identify any known proteins (besides the L1 proteins) that are related to the C-terminal half of the ORF-1 sequence. (Lipman and Pearson 1985) is shown starting at amino acid position 1 of ORF-1 from L1Oc5 (Fig. 8) and position 303 of the sequence of type II cytoskeletal keratin of humans (Johnson el al. 1985) . The ORF-1 sequence of rabbit L1 is labeled LI, and the type II keratin sequence is labeled KII. Identical amino acids are indicated by colons, and similar amino acids are indicated by periods. The following groups of amino acids are considered similar: P, A, G, S, and T (neutral or weakly hydrophobic); Q, N, E, and D (acids and amides); H, K, and R (basic); L, I, V, and M (hydrophobic); F, Y, and W (aromatic); and C. In contrast, the N-terminal portion of ORF-1 is not highly conserved between mammalian orders. This region shows almost no similarity between rabbit and human (sequence between nucleotide positions 3 and 476 in Fig. 8; Fig. 9 ), and the comparison between rabbit and mouse shows only a short segment of matching sequence at the 5' end (Figs. 8 and 9) . The dissimilarity of the sequences makes it difficult to assign a start point to ORF-1. However, an ATG is found in the rabbit, mouse, and rat sequences at positions 240-242 of Fig. 8 (shown in boldface). An ATG is found three codons downstream in the human L1 sequence. Other ATG codons are either immediately adjacent (mouse and rat) or are 20 codons upstream (rabbit, underlined in Fig. 8) . The ATG at positions 240-242 has been tentatively assigned as the start of ORF-1, and the codons in Fig. 8 are numbered starting here. This is 71 codons into ORF-1 as defined by Loeb et al. (1986) . Although the N-terminal half of ORF-1 differs among rabbits, mouse, and humans, it is similar between the two rodents, mouse and rat. This region surrounds a 66-bp tandemly repeated sequence in L1Rn (Soares et al. 1985; D'Ambrosio et al. 1986) and contains several in-frame stop codons in L1Rn (Fig. 8) . It is possible that the coding function of this region has been lost in L1Rn. The N-terminal half of ORF-1 from the rabbit L 1 sequence is related to type II cytoskeletal keratin. Protein sequence databanks were searched using the FASTp program (Lipman and Pearson 1985) , and a significant match was found with type II cytoskeletal keratin. The region of L1Oc ORF-1 that matches with keratin, along with the percent amino acid identity, is shown in Fig. 9 , and the alignment with the human 67 kDa type II keratin ) is shown in Fig. 10 . The sequences align OVer a 156-amino acid region, with an average of 20.5% identity. The segment between amino acid positions 95 and 126 ofLIOc ORF-1 is most similar to type II keratin; this segment contains identical amino acids at 32% of the positions. The similarity between the N-terminal half of ORF-1 from L10c and type II cytoskeletal keratin is statistically significant. The sequence of the type II keratin was scrambled into 20 different sequences and aligned with the ORF-1 sequence to generate an average match score. The match score with the true keratin sequence is 13 standard deviations above the average match score with the scrambled sequences; a difference of 10 standard deviations in this test is an indicator of a significant evolutionary relationship (Lipman and Pearson 1985) . Although statistical significance does not establish biological significance, it is helpful to compare this match with that of a part of ORF-2 with reverse transcriptases whose similarity has been cited as significant in the past (Hattori et al. 1986; Loeb et al. 1986) 9 The alignment between the L1Md ORF-2 sequence and the sequence of reverse transcriptase from Moloney murine leukemia virus shows 17.5% amino acid identity, whereas the alignment between L1Oc ORF-1 and type II keratin shows 20.5% identity. It is apparent that ORF-1 of the rabbit LI contains a region related in sequence to type II cytoskeletal keratin. The propagation of L1 repeats probably has occurred independently in different mammalian gehomes. Although the L1 repeats from lagomorphs, rodents, and primates are similar in size and sequence organization, the 5' and 3' ends are distinctive (summarized in Fig. 11) . Also, the LI repeats I I I I I I I I are located in different positions in orthologous regions of chromosomes, specifically the B-like globin gene cluster of rabbits and humans (Margot.el al. 1989 ) and mice (Shehee et al. 1989) . Because the contemporary /3-like globin gene clusters are descended from a preexisting gene cluster in the last common ancestor, the presence of L1 repeats at different positions in different species indicates that the L1 repeats have integrated independently into these gene clusters (and probably the whole genome) is each species. It is noteworthy, therefore, that the structure of the population of L1 repeats is quite similar in several mammals. Most members of the L1 repeat family in rabbits (this paper), mouse (Voliva et al. 1983) , and monkeys (Grimaldi et al. 1984 ) are truncated from the 5' end, resulting in a higher frequency in the genome of the 3' end of L1 (about 50,000 copies) than the 5' end (about 10,000 copies). This similarity in copy number suggests that the time of onset and the rate of propagation of L1 repeats is similar in the different species. The rabbit, mouse, and monkey L1 repeats also show a similar pattern for the increase in copy number in which the 5' regions increase gradually in copy number before a large increase in copy number at the very 3' end. This very large increase in copy number in the Y region could indicate a strong stop for reverse transcriptase during the conversion of the L1 transcript to a DNA copy. Given this frequency of polar truncations of L1 in rabbits, humans, and mice, it is striking that most of the L1 repeats in rats are full length (D' Am-brosio et al. 1986 ). Some aspect of the mechanism for synthesis and propagation of the Lls is apparently different in rats, e.g., to allow more full length reverse transcripts or to select for these in the integration process. Full length L1 transcripts have been observed in teratocarcinoma cells (Skowronski and Singer 1985) . Given the assignments of start and stop codons proposed in this paper, then transcripts of the L1 repeat of rabbits and humans have the characteristics of a dicistronic RNA. Polycistronic mRNAs are common in bacteria, and a polycistronic arrangement of genes is found in the genomes of some RNA viruses that infect animals and plants, e.g., togaviruses, coronaviruses, and tobacco mosaic virus. In contrast, most mRNAs from eukaryotic cellular genes are monocistronic. Regardless of whether the ORFs are overlapping, as in L1Md, or are part of a dicistronic RNA, as in L1Oc and L1Hs, the structure of the L 1 repeats resembles DNA copies of viral genomes more than conventional cellular transcription units. This suggests that the ancestor to L1 repeats in fact may be some type of animal virus rather than a normal cellular gene, as is often proposed (reviewed in Weiner et al. 1986) . A viral ancestor with a wide host range would provide an explanation for the independent, and perhaps simultaneous, entry of the L1 element into different mammalian genomes. The ORFs in the L1 repeal appear to encode hybrids of different types of proteins (Fig. 11) . OR.F-1 can be divided into two parts, the N-terminal por-tion that is not well conserved between species and the C-terminal portion that is well conserved. In the rabbit L 1 repeat, a sequence similar to keratin has been fused to the conserved C-terminal portion of ORF-1. Although ORF-2 is conserved in L 1 s from different orders of mammals it also seems to be a hybrid of sequences related to several proteins (Fig. I 1) . The middle portion of ORF-2 is related to reverse transcriptase (Hattori et al. 1986; Loeb et al. 1986 ). Different parts of the C-terminal region are related to transferrin (Hattori et al. 1986 ) and to nucleic acid binding proteins with the cysteine structural motif, such as the binding proteins derived from retroviral gaggenes (Fanning and Singer I987) . The cysteine structural motif is related to the zinc fingers characterized in TFIIIA and other nucleic acid binding proteins (Fanning and Singer 1987) . This pastiche of similarities suggests that the L1 element is a fusion of several different sequences, Some of which are derived from cellular genes, possibly by a viral vector. Another fusion event may account for the variation in sizes and sequences of the 3' untranslated regions of L1 repeats in different mammals. The 3' untranslated regions of orthologous globin genes in mammals have retained obvious sequence similarities over the course of eutherian evolution (e.g., Hardies et al. 1984; Hardison 1984) , so it is puzzling that no sequence similarity is seen in the 3' untranslated region of L1 repeats in comparisons between mammals (Fig. 11) . Perhaps the conserved coding region was fused tO a different 3' untranslated sequence in each species. It is noteworthy that the 5' end of L1Ocl begins immediately after the conserved termination codon that ends ORF-2, suggesting that the sequence corresponding to the 3' untranslated region of L 1Oc may exist as a distinct repetitive element in the rabbit genome in addition to its presence in the L1 sequence. If so, this would be an additional factor in explaining the large increase in copy number of Ll repeats in this region. A similar situation has been observed in Drosophila melanogaster, in which suffix, an element repeated about 300 times in the genome, is almost identical to the sequence of the 3' untranslated region (but not the coding region) of the F element that is present about 70 times in the genome (DiNocera and Casari 1987). The mammalian L1 repeats show a clear similarity to the ingi repeat in the protozoan Trypanosoma brucei (Kimmel et al. 1987) , the I factor of the I-R system of hybrid dysgenesis in D. melanogaster (Fawcett et al. 1986 ), F elements in D. melanogaster (DiNocera and Casari 1987) , and the R 1 Bm (Xiong and Eickbush 1988) and R2Bm (Burke et al. 1987 ) insertion sequences in some rRNA genes OfBombyx mori (Fig. 11) . The similarity has been 17 recognized only in the region proposed to encode reverse transcriptase, and these sequences are more similar among themselves than to retroviral reverse transcriptases (DiNocera and Casari 1987; Xiong and Eickbush 1988) . The mammalian L1 s and these protozoan and insect repeats share other structural features, such as the absence of long terminal repeats, the presence of at least two ORFs (ORF-2 containing sequences similar to reverse transcriptase and either ORF-1 or ORF-2 encoding a cysteine motif), a length from 5 to 7.5 kb, and a 3' untranslated region with a sequence similar to AATAAA close to the 3' end. The dicistronic structure proposed for L1Oc and LIHs may also be present in the I factor, the F element, and the R IBm repeat (Fawcett et al. 1986; DiNocera and Casari 1987; Xiong and Eickbush 1988) . Each type of repeated element also has some distinctive features, e.g., the specific insertion sites for R1Bm and R2Bm in the rRNA genes and the absence of A-rich tracts at the 3' ends of some of the insect repeats. However, at least parts of these repeats in mammals, insects, and a parasitic protozoan appear to be evolutionarily related. If this type of repeat is restricted to these groups of organisms, it may indicate that the genetic information was transferred among parasites, their mammalian hosts, and insect vectors (K.immel et al. 1987) . A viral progenitor, suggested by the dicistronic arrangement shown in this paper, would provide a means for the horizontal transmission of the L1 sequences. Retroviruses and retrotransposons: the role of reverse transcription in shaping the eukaryotic genome Screening Xgt recombinant clones by hybridization to single plaques in situ Organization and evolutionary progress of a dispersed repetitive family of sequences in widely separated rodent genomes The site-specific ribosomal insertion element type II ofBombyx mori (R2Bm) contains the coding sequence for a reverse transcriptase-like enzyme Conservationthroughoutmammaliaand extensive protein encoding capacity of the highly repeated DNA LI Genomic sequencing Structure of the highly repeated, long interspersed 18 DNA family (LINE or L1Rn) of the rat Long interspersed L1 repeats in rabbit DNA are homologous to L1 repeats of rodents and primates in an open-reading-frame region Related polypeptides are encoded by Drosophila F elements, I factors, and mammalian L1 repeats Insertion of long interspersed repeated elements at the lgh (immunoglobulin heavy chain) and Mlvi-2 (Moloney leukemia virus integration 2) loci of rats Characterization ofa highly repetitive family of DNA sequences in the mouse The LINE-1 DNA sequences in four mammalian orders predict proteins thin conserve homologies to retrovirus proteins Transposable elements controlling I-R hybrid dysgenesis in D. metanogaster are similar to mammalian LINEs Defining the beginning and end of KpnI family segments Evolution of the mammalian fl-globin gene cluster Comparison of the ~8-1ike globin gene families of rabbits and humans indicates that the gene cluster 5'-~-3'-fi-fl-3' predates the mammalian radiation Efstratiadis A (1979) The structure and transcription of four linked rabbit/~-like globin genes Sequence analysis of a KpnI family member near the 3' end of human fl-globin gene LI family of repetitive DNA sequences in primates may be derived from a sequence encoding a reverse transcriptase-related protein Structure ofa gene for the human epidermal 67-kDa keratin Retroposon" insertion into the cellular oncogene c-myc in canine transmissible venereal tumor Ingi, a 5.2-kb dispersed sequence element from Trypanosoma brucei that carries half of a smaller mobile element at either end and has homology with mammalian LINEs Human genome organization: Alu, Lines, and the molecular structure of metaphase chromosome bands The nucleotide sequence of a rabbit fl-giobin pseudogene Linkage arrangement of four rabbit fl-like giobin genes KpnI family of long interspersed repeated DNA sequences in primates: polymorphism of family members and evidence for transcription Rapid and sensitive protein similarity searches The sequence of a large L 1Md element reveals a tandemly repeated 5' end and several features found in retrotransposons The isolation of structural genes from libraries of eucaryotic DNA Nucleotide sequence definition ofa major human repeated DNA, the HindlII 1.9 kb family Complete nucleotide sequence of the rabbit fl-like globin gene cluster: analysis of intergenic sequences and comparison with human fl-like globin gene cluster A large interspersed repeal found in mouse DNA contains a long open reading frame that evolves as if it encodes a protein New M13 vectors for cloning A transposon-like element in human DNA Rearranged sequence of a human KpnI element Labeling deoxyribonucleic acid to high specific activity in vitro by nick translation with DNA polymerase I Retroposons defined Analysis of rabbit fl-like globin gene transcripts during development Transcription unit of rabbit fl I-globin gene DNA sequencing with chain-terminating inhibitors Origin of the human LI elements: proposed progenitor genes deduced from a consensus DNA sequence Determination of a functional ancestral sequence and definition of the 5' end of A-type mouse LI elements The nucleotide sequence of the BALB/C mouse fl-globin complex The organization of repetitive sequences in a cluster of rabbit fl-like globin genes SINEs and LINEs: highly repeated short and long interspersed sequences in mammalian genomes Making sense out of LINEs: long interspersed repeat sequences in mammalian genomes Expression of a cytoplasmic LINE-1 transcript is regulated in a human teratocarcinoma cell line Rat LINEI: the origin and evolution of a family of long interspersed middle repetitive DNA elements Detection of specific sequences among DNA fragments separated by gel electrophoresis The L1Md long interspersed repeat family in the mouse: almost all examples are truncated at one end roposons: genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information Rapid similarity searches of nucleic acid and protein data banks The site-specific ribosomal DNAinsertion element RIBm belongs to a class of non-long-terminal-repeat retrotransposons Analysis of large nucleic acid dot matrices on small computers Acknowledgments. We