introduction in higher eukaryotes splicing of pre-mrna occurs with a help of at least two different...

1
Introduction Introduction In higher eukaryotes splicing of pre-mRNA occurs with a help of at least two different major (U2) and minor (U12) spliceosomes. Introns, spliced by U12 spliceosome, are rare (<0.5%) and thus, are commonly ignored by the majority of gene prediction and annotation pipelines. However, some well- known disease-related genes such as huntingtin and PTEN contain one or more U12 introns making determination of their precise gene structure challenging. Slower rate of U12 spliceosome processing is thought to contribute to regulation of gene expression. U12 spliceosome, composed of U11, U12, U4atac, U5 and U6atac small nuclear ribonucleoproteins (snRNPs), surprisingly resembles U2 spliceosome in structure and function; however, they seem to evolve independently of each other. U12 spliceosome was initially discovered to operate on AT-AC introns [1,2]. Later, it was shown that GT-AG introns are in fact its major substrate. Sequencing of U11 and U12 snRNAs confirmed that U12 donor (5'- [AG]TATCCTT) and U12 branch point (TCCTTAAC) consensus sequences are remarkably distinct from relatively variable U2 splice sites. The evolution of U12 and U2 introns represents an interesting case study with implications to all gene structures. Burge at al [3] suggested that comparison of orthologous genes from different species could produce the following outcomes: intron conservation, GT-AG and AT-AC subtype conversion, U12/U2 intron conversion and a loss of an intron. We focused our attention mostly on U12-type introns and also introduced the analysis of U12/U2 introns in paralogous genes. We mapped all available human, mouse, chicken and zebrafish ESTs/cDNAs with high accuracy to the corresponding genomes using our new fast algorithm implemented in ssahaEST allowing refined splice site analysis of the genome structure. In this work we focused on detection and evolution of U12 introns in the four eukaryotic genomes. [1] Jackson IJ (1991) Nucleic Acids Res. 19: 3795-8 [2] Hall SL & Padgett RA (1994) J. Mol. Biol. 239:357- 365 [3] Burge C, Padgett RA & Sharp PA (1998) Molecular Cell 2: 773-85. [4] Ning Z, Cox AJ and Mullikin JC (2001) Genome Research 11:1725-9. [5] Levine A & Durbin R (2001) Nucleic Acids Res. 29:4006-13. [6] Zhu W & Brendel V (2003) Nucleic Acids Res. 31:4561-72. [7] Abril JF, Castelo R & Guigo R (2005) Genome Res. References References COMPARATIVE ANALYSIS of U12 INTRONS COMPARATIVE ANALYSIS of U12 INTRONS Nikolai V. Ivanov Nikolai V. Ivanov , Zemin Ning and Richard Durbin , Zemin Ning and Richard Durbin The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SA, UK Hinxton, Cambridge CB10 1SA, UK Discussion Discussion I.Analysis of the overall splice site variation in four eukaryotic genomes We have used ssahaEST to map ~11.6 million ESTs/cDNAs from four organisms: human, mouse, chicken and zebrafish to their corresponding genomes using U2 and U12 splice site models. Table 1 shows the outcome of this experiment. All intron counts represent non-redundant introns uniquely mapped to the genome where only one occurrence of intron start and end is taken into account. Not surprisingly, the majority of the introns (>99%) belongs to U2-type introns. We found no significant differences between the splice site matrices from one species to another. In all four sets, GT-AG introns were the dominant (~70%) U12 subtype and AT-AC introns were the minor (~30%) U12 subtype. Out of 404 human U12-type introns reported previously [5] we were able to identify 368 U12-type introns (260 GT-AG and all 108 AT-AC subtypes). II. Cross genome comparison of U12-type introns in four eukaryotic genomes Mapping of 6883 homologous transcripts to four eukaryotic genomes resulted in identification of 90 human, 115 mouse, chicken and zebrafish U12-type introns, for which we can look at homologues. Comparison of human and mouse genes containing these introns is shown in Table 2. Approximately half (53) of the introns were conserved in intron position and remained a U12-type. In this set we have found no examples of GT-AG and AT-AC subtype conversion. Surprisingly, most of the examples listed as U12/U2-type conversion have not been conserved at the position of the intron and therefore, could be considered as a loss of U12 and gain of U2-type intron at a different position. We found some interesting examples of a true U12/U2 type conversion in paralogous genes Materials and Methods Materials and Methods Expressed sequence tags (ESTs) were downloaded from the NCBI dbEST database (July 8 th 2005 release, ftp://ftp.ncbi.nih.gov/repository/dbEST/) for Homo sapiens (~6.1x10 6 ), Mus musculus (~4.3x10 6 ), Danio rerio (~0.63x10 6 ) and Gallus gallus (~0.55x10 6 ). Files containing large numbers of FastA formatted sequences were split into files of manageable size (~0.6x10 6 ). Alignment of the ESTs to corresponding genomes of H. sapiens (NCBI35), M. musculus (NCBI_m34), D. rerio (WTSI Zv5) and G. gallus (WashU ver. 1) was performed using newly developed ssahaEST program on an SGI Altix machine equipped with 16 IA-64 1.6Ghz processors. ssahaEST combines a fast algorithm for k-mer positioning implemented in SSAHA program [4] and an implementation of the banded Smith-Waterman- Gotoh algorithm from phrap/cross_match package [Phil Green] with high- scoring pair (HSP) clustering and accurately trained splice site models for U2-type and U12-type introns. An intron was classified as a U12-type based on thresholds for individual scores for donor, branch point, and acceptor as well as the branch-to-acceptor distance (<50) derived from the training set 1. This set included introns that were experimentally confirmed to be spliced by U12 spliceosome and orthologous genes from closely related genomes. Matrices for U2 and U12 splice sites were generated using ML method with pseudo counts. The score and length thresholds were derived from a training set 2 compiled from the 368 human U12 introns described by Levine & Durbin [5]; 36 U12-type introns were removed as they did not fit our splice site model for the U12-type intron and had patterns different from those in training set 1 (Figure 1); thus, we cannot be confident they are true U12 introns. Similar U12- type intron definitions were described previously [3, 5, 6]. For comparative studies of U12-type introns between four eukaryotic genomes we have remapped 6883 EnsEMBL (ver. 32) homologous genes to the four corresponding genomes and analysed introns homologous in one genome to the U12-type introns in the other genome. We considered only those introns that were adjacent to the conserved exons. Availability : http://www.sanger.ac.uk/Software/analysis/SSAHA2/ Results Results 1. We have developed a fast and accurate method for mapping ESTs/cDNAs in finished eukaryotic genomes and for studying gene structure. 2. We have found ~800 U12-type introns in human and mouse genomes and ~400 U12-type introns in chicken and zebrafish genomes. U12 introns seem to constitute ~ 0.3% of all introns. 3. Our study shows that U12/U2-type conversion between homologous introns of the four eukaryotic genomes most likely occurs by loss/gain mechanism with a change in position of the intron. A true conversion was observed only in cases of paralogous genes. Our approach to splice site analysis differs from that of the previous work [5] as we are now able to map all ESTs/cDNAs to the best unique location on the genome avoiding potential ambiguity in splice site confirmation. It should be noted that due to very low frequency of U12-type intron occurrence, we had to make highly specific matrices for different subtypes of U12, thus, leading to potentially lower sensitivity of the method and consequent underestimation. Despite this, the number of U12-type introns in the human genome has doubled compared to the previous work [5], mainly because of the increase in number of human ESTs and improvement in quality of human genome assembly over the last four years. Table 1 shows two major trends found in the first part of the analysis. One is that the total number of non-redundant introns correlates with the length of the sequenced portion of the genome. The other is that the fraction of U12-type introns is ~0.3% of all four species, although it is significantly larger in chicken and zebrafish than in mammalian genomes, indicating that there is some intron type turnover. Comparison of homologous genes containing U12-type introns between human and mouse showed that ~50% of U12 introns are being converted to a different type. Although this trend is significantly higher than the one described by Abril et al [7], the conversion results in introns in close but different positions indicating potential loss/gain mechanism as apposed to replacement. True conversion was observed in a few cases of paralogues (Table 3), however, the study is hampered the lack of reliable database of paralogous genes. Table 1.N um ber ofintrons in four eukaryotic genom es allintrons U2 U 12 U 12 % oftotal H um an 328322 327476 846 0.26 M ouse 246665 245949 716 0.29 Chicken 124605 124169 436 0.35 Zebrafish 126367 125946 421 0.33 U2 intron matrices U2 intron matrices U12 intron matrices U12 intron matrices Donor Donor Donor Donor Branch point Branch point Branch point Branch point Acceptor Acceptor Acceptor Acceptor Figure 1. Selection of thresholds for U2- type and U12-type donor site definitions. Conclusion Conclusion Table 2.C om parative analysis ofU 12-type introns betw een hum an and m ouse genom es ConservationU12/U2 conversion G T-AG /AT-AC conversion H um an 53 37 0 M ouse 53 62 0 HUMAN ENSG00000001497 6(47) ACGgtaagaaagtgccctggacttggtg..........ctgatgggaccctctttgctggcagGTG 842 7(110) U2 MOUSE ENSMUSG00000057421 6(51) AAGGTTAgtatccttggtgcgatatgct..........ctgattggactctttttgctgtcagGTG 863 7(110) U12 CHICK ENSGALG00000004713 3(47) CAGgtaagtatagcctgatctgcttctc..........atgcttggatttttctttcactcagGTT 1354 4(110) U2 ZFISH ENSDARG00000009395 8(47) CAGgtagtgggaccaatctcgcactacg..........tccattgttttggtgtatttggcagACA 1431 9(110) U2 ========================================================================================================================================= HUMAN ENST00000289041 3(97) CGTgtatcctttgcctgctggctgacca..........gaatgaccttaatctggggttctagCCA 1719 4(109) U12 MOUSE ENSMUST00000024866 3(97) CGTgtatcctttgcctgctgcctggtgc..........aaatggccttaatctgtggttctagTCA 1180 4(106) U12 CHICK ENSGALT00000014160 3(97) CCTgtatcctttgcagtctgaacccttc..........aaatgaccttaatctatcattttagCCA 314 4(109) U12 ZFISH ENSDART00000039772 3(97) TATgtatctttttacattttcagctttt..........ttatatccttgattctctcttgaagTCA 2952 4(109) U12 HUMAN ENST00000260930 3(97) AAGgtaccgtgcagcaaagtccagatat..........tgtgcttttcttttgcattctgaagGCA 2028 4(109) U2 MOUSE ENSMUST00000001027 3(97) CAGgtaggtgcagccaagtccagttagg..........tgtgcttctcttttgcattctgaagGCA 3496 4(109) U2 ENSMUST00000040999 3(97) CAGgtacctgccctcaccagcaggcttg..........ttacttcaccacatgaactttgaagTCA 786 4(109) U2 ENSMUST00000040442 3(97) CCAgtatcctttgcactgcctggctatg..........ttctctttacataagcccatcacagCCA 2457 4(109) ??? CHICK ENSGALT00000013325 3(97) ACGgtatccttaacaagaagtctggaaa..........ctttattcaccttaatgttccaaagACA 1199 4(109) U12? ZFISH ENSDART00000043711 3(97) CAGgtattacagtcttcattttactcca..........agaagcattttttttctctgtttagTCA 301 4(109) U2 FUGU SINFRUT00000175866 3(97) CACgtatccttgcaacagctggtggcct..........agcaccgaccttcactcagcattagACA 140 4(109) U12 Table 3. An example of U12/U2-type loss/gain and true U12/U2-type conversion paralogues

Upload: jordan-robertson

Post on 27-Dec-2015

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Introduction In higher eukaryotes splicing of pre-mRNA occurs with a help of at least two different major (U2) and minor (U12) spliceosomes. Introns, spliced

IntroductionIntroductionIn higher eukaryotes splicing of pre-mRNA occurs with a help of at least two different major (U2) and

minor (U12) spliceosomes. Introns, spliced by U12 spliceosome, are rare (<0.5%) and thus, are commonly ignored by the majority of gene prediction and annotation pipelines. However, some well-known disease-related genes such as huntingtin and PTEN contain one or more U12 introns making determination of their precise gene structure challenging. Slower rate of U12 spliceosome processing is thought to contribute to regulation of gene expression.

U12 spliceosome, composed of U11, U12, U4atac, U5 and U6atac small nuclear ribonucleoproteins (snRNPs), surprisingly resembles U2 spliceosome in structure and function; however, they seem to evolve independently of each other. U12 spliceosome was initially discovered to operate on AT-AC introns [1,2]. Later, it was shown that GT-AG introns are in fact its major substrate. Sequencing of U11 and U12 snRNAs confirmed that U12 donor (5'-[AG]TATCCTT) and U12 branch point (TCCTTAAC) consensus sequences are remarkably distinct from relatively variable U2 splice sites.

The evolution of U12 and U2 introns represents an interesting case study with implications to all gene structures. Burge at al [3] suggested that comparison of orthologous genes from different species could produce the following outcomes: intron conservation, GT-AG and AT-AC subtype conversion, U12/U2 intron conversion and a loss of an intron. We focused our attention mostly on U12-type introns and also introduced the analysis of U12/U2 introns in paralogous genes.

We mapped all available human, mouse, chicken and zebrafish ESTs/cDNAs with high accuracy to the corresponding genomes using our new fast algorithm implemented in ssahaEST allowing refined splice site analysis of the genome structure. In this work we focused on detection and evolution of U12 introns in the four eukaryotic genomes.

[1] Jackson IJ (1991) Nucleic Acids Res. 19: 3795-8[2] Hall SL & Padgett RA (1994) J. Mol. Biol. 239:357-365[3] Burge C, Padgett RA & Sharp PA (1998) Molecular Cell 2: 773-85.[4] Ning Z, Cox AJ and Mullikin JC (2001) Genome Research 11:1725-9. [5] Levine A & Durbin R (2001) Nucleic Acids Res. 29:4006-13.[6] Zhu W & Brendel V (2003) Nucleic Acids Res. 31:4561-72.[7] Abril JF, Castelo R & Guigo R (2005) Genome Res. 15:111-9.

ReferencesReferences

COMPARATIVE ANALYSIS of U12 INTRONSCOMPARATIVE ANALYSIS of U12 INTRONS

Nikolai V. IvanovNikolai V. Ivanov, Zemin Ning and Richard Durbin, Zemin Ning and Richard Durbin

The Wellcome Trust Sanger Institute, Wellcome Trust Genome CampusThe Wellcome Trust Sanger Institute, Wellcome Trust Genome CampusHinxton, Cambridge CB10 1SA, UKHinxton, Cambridge CB10 1SA, UK

DiscussionDiscussion

I.Analysis of the overall splice site variation in four eukaryotic genomes

We have used ssahaEST to map ~11.6 million ESTs/cDNAs from four organisms: human, mouse, chicken and zebrafish to their corresponding genomes using U2 and U12 splice site models. Table 1 shows the outcome of this experiment. All intron counts represent non-redundant introns uniquely mapped to the genome where only one occurrence of intron start and end is taken into account. Not surprisingly, the majority of the introns (>99%) belongs to U2-type introns. We found no significant differences between the splice site matrices from one species to another. In all four sets, GT-AG introns were the dominant (~70%) U12 subtype and AT-AC introns were the minor (~30%) U12 subtype. Out of 404 human U12-type introns reported previously [5] we were able to identify 368 U12-type introns (260 GT-AG and all 108 AT-AC subtypes).

II. Cross genome comparison of U12-type introns in four eukaryotic genomes

Mapping of 6883 homologous transcripts to four eukaryotic genomes resulted in identification of 90 human, 115 mouse, chicken and zebrafish U12-type introns, for which we can look at homologues. Comparison of human and mouse genes containing these introns is shown in Table 2. Approximately half (53) of the introns were conserved in intron position and remained a U12-type. In this set we have found no examples of GT-AG and AT-AC subtype conversion. Surprisingly, most of the examples listed as U12/U2-type conversion have not been conserved at the position of the intron and therefore, could be considered as a loss of U12 and gain of U2-type intron at a different position. We found some interesting examples of a true U12/U2 type conversion in paralogous genes (Table 3). However, these cases are hard to quantify due to lack of an appropriate database for paralogous genes.

Materials and MethodsMaterials and Methods Expressed sequence tags (ESTs) were downloaded from the NCBI dbEST database (July 8 th

2005 release, ftp://ftp.ncbi.nih.gov/repository/dbEST/) for Homo sapiens (~6.1x106), Mus musculus (~4.3x106), Danio rerio (~0.63x106) and Gallus gallus (~0.55x106). Files containing large numbers of FastA formatted sequences were split into files of manageable size (~0.6x106). Alignment of the ESTs to corresponding genomes of H. sapiens (NCBI35), M. musculus (NCBI_m34), D. rerio (WTSI Zv5) and G. gallus (WashU ver. 1) was performed using newly developed ssahaEST program on an SGI Altix machine equipped with 16 IA-64 1.6Ghz processors.

ssahaEST combines a fast algorithm for k-mer positioning implemented in SSAHA program [4] and an implementation of the banded Smith-Waterman-Gotoh algorithm from phrap/cross_match package [Phil Green] with high-scoring pair (HSP) clustering and accurately trained splice site models for U2-type and U12-type introns. An intron was classified as a U12-type based on thresholds for individual scores for donor, branch point, and acceptor as well as the branch-to-acceptor distance (<50) derived from the training set 1. This set included introns that were experimentally confirmed to be spliced by U12 spliceosome and orthologous genes from closely related genomes. Matrices for U2 and U12 splice sites were generated using ML method with pseudo counts. The score and length thresholds were derived from a training set 2 compiled from the 368 human U12 introns described by Levine & Durbin [5]; 36 U12-type introns were removed as they did not fit our splice site model for the U12-type intron and had patterns different from those in training set 1 (Figure 1); thus, we cannot be confident they are true U12 introns. Similar U12-type intron definitions were described previously [3, 5, 6].

For comparative studies of U12-type introns between four eukaryotic genomes we have remapped 6883 EnsEMBL (ver. 32) homologous genes to the four corresponding genomes and analysed introns homologous in one genome to the U12-type introns in the other genome. We considered only those introns that were adjacent to the conserved exons.

Availability: http://www.sanger.ac.uk/Software/analysis/SSAHA2/

ResultsResults

1. We have developed a fast and accurate method for mapping ESTs/cDNAs in finished eukaryotic genomes and for studying gene structure.

2. We have found ~800 U12-type introns in human and mouse genomes and ~400 U12-type introns in chicken and zebrafish genomes. U12 introns seem to constitute ~ 0.3% of all introns.

3. Our study shows that U12/U2-type conversion between homologous introns of the four eukaryotic genomes most likely occurs by loss/gain mechanism with a change in position of the intron. A true conversion was observed only in cases of paralogous genes.

Our approach to splice site analysis differs from that of the previous work [5] as we are now able to map all ESTs/cDNAs to the best unique location on the genome avoiding potential ambiguity in splice site confirmation. It should be noted that due to very low frequency of U12-type intron occurrence, we had to make highly specific matrices for different subtypes of U12, thus, leading to potentially lower sensitivity of the method and consequent underestimation. Despite this, the number of U12-type introns in the human genome has doubled compared to the previous work [5], mainly because of the increase in number of human ESTs and improvement in quality of human genome assembly over the last four years.

Table 1 shows two major trends found in the first part of the analysis. One is that the total number of non-redundant introns correlates with the length of the sequenced portion of the genome. The other is that the fraction of U12-type introns is ~0.3% of all four species, although it is significantly larger in chicken and zebrafish than in mammalian genomes, indicating that there is some intron type turnover.

Comparison of homologous genes containing U12-type introns between human and mouse showed that ~50% of U12 introns are being converted to a different type. Although this trend is significantly higher than the one described by Abril et al [7], the conversion results in introns in close but different positions indicating potential loss/gain mechanism as apposed to replacement. True conversion was observed in a few cases of paralogues (Table 3), however, the study is hampered the lack of reliable database of paralogous genes.

Table 1. Number of introns in four eukaryotic genomes

all introns U2 U12 U12 % of total

Human 328322 327476 846 0.26

Mouse 246665 245949 716 0.29

Chicken 124605 124169 436 0.35

Zebrafish 126367 125946 421 0.33

U2 intron matricesU2 intron matrices

U12 intron matricesU12 intron matrices

DonorDonor

DonorDonor Branch pointBranch point

Branch pointBranch point AcceptorAcceptor

AcceptorAcceptor

Figure 1. Selection of thresholds for U2-type and U12-type donor site definitions.

ConclusionConclusion

Table 2. Comparative analysis of U12-type introns between human and mouse genomes

Conservation U12/U2 conversion GT-AG/AT-AC conversion

Human 53 37 0

Mouse 53 62 0

HUMAN ENSG00000001497 6(47) ACGgtaagaaagtgccctggacttggtg..........ctgatgggaccctctttgctggcagGTG 842 7(110) U2MOUSE ENSMUSG00000057421 6(51) AAGGTTAgtatccttggtgcgatatgct..........ctgattggactctttttgctgtcagGTG 863 7(110) U12CHICK ENSGALG00000004713 3(47) CAGgtaagtatagcctgatctgcttctc..........atgcttggatttttctttcactcagGTT 1354 4(110) U2ZFISH ENSDARG00000009395 8(47) CAGgtagtgggaccaatctcgcactacg..........tccattgttttggtgtatttggcagACA 1431 9(110) U2=========================================================================================================================================HUMAN ENST00000289041 3(97) CGTgtatcctttgcctgctggctgacca..........gaatgaccttaatctggggttctagCCA 1719 4(109) U12MOUSE ENSMUST00000024866 3(97) CGTgtatcctttgcctgctgcctggtgc..........aaatggccttaatctgtggttctagTCA 1180 4(106) U12CHICK ENSGALT00000014160 3(97) CCTgtatcctttgcagtctgaacccttc..........aaatgaccttaatctatcattttagCCA 314 4(109) U12ZFISH ENSDART00000039772 3(97) TATgtatctttttacattttcagctttt..........ttatatccttgattctctcttgaagTCA 2952 4(109) U12

HUMAN ENST00000260930 3(97) AAGgtaccgtgcagcaaagtccagatat..........tgtgcttttcttttgcattctgaagGCA 2028 4(109) U2MOUSE ENSMUST00000001027 3(97) CAGgtaggtgcagccaagtccagttagg..........tgtgcttctcttttgcattctgaagGCA 3496 4(109) U2

ENSMUST00000040999 3(97) CAGgtacctgccctcaccagcaggcttg..........ttacttcaccacatgaactttgaagTCA 786 4(109) U2ENSMUST00000040442 3(97) CCAgtatcctttgcactgcctggctatg..........ttctctttacataagcccatcacagCCA 2457 4(109) ???

CHICK ENSGALT00000013325 3(97) ACGgtatccttaacaagaagtctggaaa..........ctttattcaccttaatgttccaaagACA 1199 4(109) U12?ZFISH ENSDART00000043711 3(97) CAGgtattacagtcttcattttactcca..........agaagcattttttttctctgtttagTCA 301 4(109) U2FUGU SINFRUT00000175866 3(97) CACgtatccttgcaacagctggtggcct..........agcaccgaccttcactcagcattagACA 140 4(109) U12

Table 3. An example of U12/U2-type loss/gain and true U12/U2-type conversion

paralogues