alternative splicing: a playground of evolution mikhail gelfand institute for information...
TRANSCRIPT
Alternative splicing: A playground of evolution
Mikhail Gelfand
Institute for Information Transmission Problems, RAS
May 2004
Alternative splicing of human(and mouse) genes
5% Sharp, 1994 (Nobel lecture)
35% Mironov-Fickett-Gelfand, 1999
38% Brett-…-Bork, 2000 (ESTs/mRNA)
22% Croft et al., 2000 (ISIS database)
55% Kan et al., 2001 (11% AS patterns conserved in mouse ESTs)
42% Modrek et al., 2001 (HASDB)
~33% CELERA, 2001
59% Human Genome Consortium, 2001
28% Clark and Thanaraj, 2002
all? Kan et al., 2002 (17-28% with total minor isoform frequency > 5%)
41% (mouse) FANTOM & RIKEN, 2002
60% (mouse) Zavolan et al., 2003
• Alternative splicing of orthologous human and mouse genes
• Sequence divergence in alternative and constitutive regions
• Evolution of splicing sites • Alternative splicing and protein structure
Data
• known alternative splicing– HASDB (human, ESTs+mRNAs)– ASMamDB (mouse, mRNAs+genes)
• additional variants– UniGene (human and mouse EST clusters)
• complete genes and genomic DNA– GenBank (full-length mouse genes)– human genome
Methods
• Direct comparison of EST-derived alternatives difficult because of uneven coverage.
• Instead, align alternative isoforms from one species to the genomic DNA of other species.
• If alignable (complete exon or part of exon, no significant loss of similarity, no in-frame stops, conserve splicing sites), then conserved.
• This is an upper estimate on conservation: an isoform may be non-functional for other reasons (e.g. disruption of regulatory sites).
• Cannot analyze skipped exons.
Tools
• TBLASTN (initial identification of orthologs: mRNAs against genomic DNA)
• BLASTN (human mRNAs against genome)• Pro-EST (spliced alignment, ESTs and mRNA
against genomic DNA)• Pro-Frame (spliced alignment, proteins against
genomic DNA)– confirmation of orthology
• same exon-intron structure• >70% identity over the entire protein length
– analysis of conservation of alternative splicing• conservation of exons or parts of exons• conservation of sites
166 gene pairs
42 84 40
human mouse
Known alternative splicing:
126 124
Elementary alternatives
Cassette exon
Alternative donor site
Alternative acceptor site
Retained intron
Human genes
mRNA EST
cons. non-cons. cons. non-cons.
Cassette exons 56 25 74 26
Alt. donors 18 7 16 10
Alt. acceptors 13 5 19 15
Retained introns 4 3 5 0
Total 96 30 114 51
Total genes 45 28 41 44
Conserved elementary alternatives: 69% (EST) - 76% (mRNA)
Genes with all isoforms conserved: 57 (45%)
Mouse genes
mRNA EST
cons. non-cons. cons. non-cons.
Cassette exons 70 5 39 9
Alt. donors 24 6 17 6
Alt. acceptors 15 6 16 9
Retained introns 8 7 10 4
Total 117 24 82 28
Total genes 68 22 30 26
Conserved elementary alternatives: 75% (EST) - 83% (mRNA)
Genes with all isoforms conserved: 79 (64%)
Real or aberrant non-conserved AS?
• 24-31% human vs. 17-25% mouse elementary alternatives are not conserved
• 55% human vs 36% mouse genes have at least one non-conserved variant
• denser coverage of human genes by ESTs: – pick up rare (tissue- and stage-specific) => younger
variants– pick up aberrant (non-functional) variants
• 17-24% mRNA-derived elementary alternatives are non-conserved (compared to 25-32% EST-derived ones)
smoothelin
human
common
mouse
human-specific donor-site
mouse-specific cassette exon
autoimmune regulator
human
common
mouse
retained intron; downstream exons read in two frames
Na/K-ATPase gamma subunit (Fxyd2)
human
mouse
(deleted) intron
com
mon
alternative acceptor site within (inserted) intron
MutS homolog (DNA mismatch repair)
human
common
dual donor/acceptor site
Modrek and Lee, 2003:• conserved skipped exons:
– 98% constitutive– 98% major form– 28% minor form
• inclusion level:– highly correlated – good predictor of conservation
• Minor non-conserved form exons are not aberrant:– minor form exons are supported by multiple ESTs– 28% of minor form exons are upregulated in one specific
tissue– 70% of tissue-specific exons are not conserved
Thanaraj et al., 2003:
61% (47-86%) alternative splice junctions are conserved
• Alternative splicing of orthologous human and mouse genes
• Sequence divergence in alternative and constitutive regions
• Evolution of splicing sites • Alternative splicing and protein structure
Our preliminary observations: less synonymous, more non-synonymous divergence in alternative exons (human/mouse) => positive selection towards variability
“Contrary to our prediction, synonymous divergence between humans and non-human mammals was significantly higher in constitutive exons … Intriguingly, non-synonymous divergence was marginally significantly higher in alternative exons”
Iida and Akashi, 2000
279 proteins from SwissProt+TREMBL with “varsplic” features
constitutive alternative % alt. to all
length 199270 66054 25%
all SNPs 1126 368 25%
synonymous 576 (51%) 167 (45%) 22%
benign 401 (36%) 141 (38%) 26%
damaging 149 (13%) 60 (16%) 29%
again, there is some evidence of positive selection towards diversity. This is not due to aberrant ESTs
(only protein data are considered).
• Alternative splicing of orthologous human and mouse genes
• Sequence divergence in alternative and constitutive regions
• Evolution of splicing sites • Alternative splicing and protein structure
Alternative splicing in a multigene family: the MAGEA family of
cancer/testis specific antigens
• A locus at the X chromosome containing eleven recently duplicated genes: two subfamilies of four genes each and three single genes
• One protein-coding exon, multiple different 5’-UTR exons
• Originates from retroposed spliced mRNA• Mutations create new splicing sites or disrupt
existing sites
Phylogenetic trees (protein-coding and upstream regions)
Expression data
• pooled by organ/tissue; maximum recorded expression level retained• no data for MAGEA10; MAGEA3 and MAGEA6 likely non-distinguishable• green: normal; brown: cancer
TISSUE(ORGAN) \ MAGEA 8 9 10 11 4 1 5 3 6 2 12testis 0 1,5 3 3 2,5 3 3 3 3 22chronic myelogenous leukemia (K562) 2 3 3 3 3 3 3 20thymus (THY) 1,5 1,5 1,5 1,5 1,5 2,5 2,5 1,5 2,5 16,5placenta 2 1,5 2 2 0 2,5 2,5 1,5 14ovary 2 1,5 1,5 1,5 2 2 1,5 12pancreas 1,5 1,5 1,5 1,5 1,5 2 9,5brain (fetal, cortex, amygdala, etc.) 1,5 1,5 0 2 2 1,5 8,5umbilical vein endothelium (HUVEC) 1,5 1,5 2,5 5,510N 2,5 2,5 5uterus 1,5 1,5 1,5 4,5Burkitt's lymphoma (DAUDI) 1,5 2,5 4prostate cancer? (PC4, PC6, PC8) 2 2 4acute lymphoblastic leukemia (MOLT4) 2 2 4salivary gland 1,5 1,5 3lung 1,5 1,5 3heart 2 2spleen 2 2
Simple genes with alternatives in exon 1 (MAGEA1, MAGEA5, MAGEA3/6)
1
1b
MAGEA1
1
MAGEA5 (normal placenta)
1
MAGEA3
1a
1
1
MAGEA6 (testis, brain/medulla, cancer)
1a
Two more genes of subfamily B: multiple isoforms of MAGEA2 and a deletion in MAGEA12
MAGEA2
1
1
1
1
1
1
1
2a
4d
4d
4d
4d
4d
5
5
56
1-0
MAGEA12
1-046
6-5
Isoforms of subfamily A
1
2-1
1
1
1
1
1
1
1
1
3
2
2
2d
2
4a
4a
4c
4b
MAGEA8
MAGEA9 (testis, no cancers)
MAGEA10
MAGEA11
Multiple duplications of the initial exon in MAGEA4
1
1
1
1
1
1
1
1
1
MAGEA4 (testis and cancers; brain/medulla; also common 3’ ESTs in placenta)
Chimaeric mRNAs (splicing of readthrough transcripts)
1
initial exon of MAGEA10 exons of MAGEA5exon in intergenic space
initial exon of MAGEA12 exons of BC013171exon in intergenic space
Other examples:• galactose-1-phosphate uridylyltransferase + interleukin-11
receptor alpha chain (Magrangeas et al., 1998)• P2Y11 [receptor] + SSF1 [nuclear protein] (Communi et al.,
2001)• PrP [Prion protein] + Dpl [prion-like protein Doppel] (Moore et
al., 1999)• cytochrome P450 3A: CYP3A7 + two exons of a downstream
pseudogene read in a different frame (Finta & Zaphiropoulos, 2000)
• HHLA1 + OC90 [otoconin-90] (Kowalski et al., 1999)• TRAX [translin-associated factor X] + DISC1 [candidate
schizophirenia gene] (Millar et al., 2000)• Kua + UEV1 [polyubiquination coeffector] (Thomson et al.,
2000)• FR + GAP [Rho GTPase activating protein] (Romani et al.,
2003) - ?• methyonyl tRNA synthetase + advillin (Romani et al., 2003) - ?
Birth of donor sites (new GT in alternative intial exon 5)
Ancestral gene: GCCAGGCACGCGGATCCTGACGTTCACATCTAGGGCTMAGEA3 GCCAGGCACGTGAGTCCTGAGGTTCACATCTACGGCTMAGEA6 GCCAGGCACGTGAGTCCTGAGGTTCACATCTACGGCTMAGEA2 GCCAAGCACGCGGATCCTGACGTTCACATGTACGGCTMAGEA12 GCCAAGCACGCGGATCCTGACGTTCACATCTGTGGCTMAGEA1 GCCAGGCACTCGGATCTTGACGTCCCCATCCAGGGCTMAGEA4 --CAGGCACTCGGATCTTGACATCCACATCGAGGGCTMAGEA5 GACAGGCACACCCATTCTGACGTCCACATCCAGGGCT
Birth of an acceptor site (new AG and polyY tract in
MAGEA8-specific cassette exon 3)
MAGEA3 TTGAGGGTACC-----------CCTGGGA---CAGAATGCGGAMAGEA6 TTGAGGGTACC-----------CCTGGGA---CAGAATGCGGAMAGEA2 TTGAGGGTACT-----------CCTGGGC---CAGAATGCAGAMAGEA12 TTGAGGGTACC-----------CCTGGGC---CAGAACGCTGAMAGEA1 CTGAGGGTACC-----------CCAGGAC---CAGAACACTGAMAGEA4 TTGAGGGTACC-----------ACAGGGC---CAGAACGCAGAMAGEA5 TTGAGGGCACC-----------CTTGGGC---CAGAACACAGAMAGEA8 TTGAGGGTACCCTCGATGGTTCTCCTAGCAGGCAAAAAACAGAMAGEA9 TCGAGGGTACC-----------TCCAGGC---CAGAGAAACTCMAGEA10 CTGAGGGTACC-----------CCCAGCC---CATAACACAGAMAGEA11 TTGAGGGTTCC-----------TCCTGGC---CAGAACACAGA
Birth of an alternative donor site (enhanced match to the consensus (AG)
in cassette exon 2)
Ancestral gene: GAGCTCCAGGAACmAGGCAGTGAGGCCTTGGTCTGMAGEA3 GAGCTCCAGGAACAAGGCAGTGAGGACTTGGTCTGMAGEA6 GAGCTCCAGGAACAAGGCAGTGAGGACTTGGTCTGMAGEA2 GAGCTCCAGGAACCAGGCAGTGAGGCCTTGGTCTGMAGEA12 GAGTTCCAAGAACAAGGCAGTGAGGCCTTGGTCTGMAGEA1 GAGCTCCAGGAACCAGGCAGTGAGGCCTTGGTCTGMAGEA4 GAGCTCCAGGAACAAGGCAGTGAGGCCTTGGTCTGMAGEA5 GAGCTCCAGGAAACAGACACTGAGGCCTTGGTCTGMAGEA8 GAGCTCCAGGAACCAGGCTGTGAGGTCTTGGTCTGMAGEA9 GAGCTCCAGGAA----GCAGGCAGGCCTTGGTCTGMAGEA10 GAGCTCCAGGGACTGTGAGGTGAGGCCTTGGTCTAMAGEA11 AAGCTCCAAAAACTGAGCAGTGAGGCCTTGGTCTC
Birth of an alternative acceptor site (enhanced polyY tract in cassette exon 4)
Ancestral gene: AGGGGCCCCCATGTGGTCGACAGACACAGTGGMAGEA3 AGGGGCCCCTATGTGGTGGACAGATGCAGTGGMAGEA6 AGGGGCCCCTATGTGGTGGACAGATGCAGTGGMAGEA2 AGGGGCCCCCATCTGGTCGACAGATGCAGTGGMAGEA12 AGGGGCCCCCATGTAGTCGACAGACACAGTGGMAGEA1 AGGGACCCCCATCTGGTCTAAAGACAGAGCGGMAGEA4 AGGGACCCCCATCTGGTCTACAGACACAGTGGMAGEA5 AGGGGCCCCCATCTGGTGGATAGACAGAGTGGMAGEA8 AGGGACCCCCATGTGGGCAACAGACTCAGTGGMAGEA9 AGGGAGGCCC-TGTGTTCGACAGACACAGTGGMAGEA10 AGGGAACCCC-TCTTTTCTACAGACACAGTGGMAGEA11 AAAGAGCCCCATATGGTCCACAACTACAGTGG
Disactivation of a donor site and birth of a new site
(non-consensus G and new GT in major-isoform cassette exon 4)
Ancestral gene: GCCAAGmGTCCAGGTGAGGAACCGGAGGGAGGATTGAGGGTACCMAGEA3 GCCAAGCATCCAGGTGAAGAGACTGAGGGAGGATTGAGGGTACCMAGEA6 GCCAAGCATCCAGGTGAAGAGACTGAGGGAGGATTGAGGGTACCMAGEA2 GCCAAGCATCCAGGTGGAGAGCCTGAGGTAGGATTGAGGGTACTMAGEA12 ACCAAGCATCCAGGTGAGAAGCCTGAGGTAGGATTGAGGGTACCMAGEA1 GCCATGCGTTCGGGTGAGGAACATGAGGGAGGACTGAGGGTACCMAGEA4 GCCAAGAGTCCTGGTGAGGAATGTGAGGGAGGATTGAGGGTACCMAGEA5 GTCAGTAGTTCCGGTGAGGAACATGAGGGACGATTGAGGGCACCMAGEA8 ACCAAGAGTCTAGGTGACAACACTGAGGGAAGATTGAGGGTACCMAGEA9 GAGAGCAGTCCAGGTGAGGAACCTAAGGGAGGATCGAGGGTACCMAGEA10 GACAAGAGTCCAGGTAAGGAACCTGAGGGAAATCTGAGGGTACCMAGEA11 GCCAAGAGTCCAGGTGAGAAACCTGAGGGAGGATTGAGGGTTCC
Series of mutations sequentially activating downstream acceptor sites
(mutated AG in exon 4)
Ancestral gene: CCTCCTCACTTCTGTTTCCAGATCTCAGGGAGGTGAGGMAGEA2 CCTCCTCACTTCTGTTTCCAGATCTCAGGGAGTTGATGMAGEA12 CCTCCTCACTTCTGTTTCCAGATCTCAGGGAGTTGAGGMAGEA1 TCTTTTCACTCCTGTTTCCAGATCTGGGGCAGGTGAGGMAGEA4 CCTTCTCATTTCTGATTCCAGATCTCAGTGAGGTGAGGMAGEA5 CCATCTCATTCCTGTTTTCAGATCTCGGGGAGGTGAGGMAGEA8 GCTCCTCATTTCTCTCTTGAGATCTCAGGGAAGTGAGGMAGEA9 CCTCCTCACCTCTGTTTCTGGATCTCAGGGAGGTGAGGMAGEA10 CCTTCTTACTTTTGTTTTGGAATCTCAGGGAGGTGAGAMAGEA11 CCC-CTTACTTCTGTTTTGGAATCTTGGGCAGGTGAGC
• Alternative splicing of orthologous human and mouse genes
• Sequence divergence in alternative and constitutive regions
• Evolution of splicing sites
• Alternative splicing and protein structure
Data
• Alternatively spliced genes (proteins) from SwissProt– human– mouse
• Protein structures from PDB• Domains from InterPro
– SMART– Pfam– Prosite– etc.
a)
6%
10%
15%
37%
40%
34%
21%
19%
6%13%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Expected Observed
Non-domain functional units partially
Domains partially
No annotated unit affected
Non-domain functional units completely
Domains completely
Alternative splicing avoids disrupting domains (and non-domain units)
Control:
fix the domain structure; randomly place alternative regions
… and this is not simply a consequence of the (disputed) exon-domain correlation
0
1
Ra
tio
(ob
serv
ered
/ex
pec
ted
)
Mouse Human Mouse Human Mouse Human
nonAS_Exons AS_Exons AS
AS&Exon boundaries and SMART domains
inside domains
outside domains
Positive selection towards domain shuffling (not simply avoidance of disrupting domains)
a)
6%
10%
15%
37%
40%
34%
21%
19%
6%13%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Expected Observed
Non-domain functional units partially
Domains partially
No annotated unit affected
Non-domain functional units completely
Domains completely
b)
Domains completely
Non-domain units
completely
No annotated
units affected
Expected Observed
Short (<50 aa) alternative splicing events within domains target protein functional sites
a)
6%
10%
15%
37%
40%
34%
21%
19%
6%13%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Expected Observed
Non-domain functional units partially
Domains partially
No annotated unit affected
Non-domain functional units completely
Domains completely
c)
Prosite
patterns
unaffected
Prosite
patterns
affected
FT
positions
unaffected
FT
positions
affected
Expected Observed
An attempt of integration
• AS is often young (as opposed to degenerating)
• young AS isoforms are often minor and tissue-specific
• … but still functional– although unique isoforms may be result of aberrant
splicing
• AS regions show evidence for positive selection – excess damaging SNPs– excess non-synonymous codon substitutions
• MAGEA - not aberrant, because explainable by effects of mutations
What to do
• Each isoform (alternative region) can be characterized:– by conservation (between genomes)– if conserved, by selection (positive vs negative)
• human-mouse, also add rat
– pattern of SNPs (synonymous, benign, damaging)– tissue-specificity
• in particular, whether it is cancer-specific
– degree of inclusion (major/minor)– functionality (for isoforms)
• whether it generates a frameshift• how bad it is (the distance between the stop-codon and
the last exon-exon junction)
What to expect (hypotheses)
• Cancer-specific isoforms will be less functional and more often non-conserved
• Non-conserved isoforms will contain a larger fraction of non-functional isoforms; and this may influence evolutionary conclusions
• Still, after removal of non-functional isoforms, one should see positive selection in alternative regions (more non-synonymous substitutions compared to constant regions etc.); especially in tissue-specific ones.
Plans
• careful and detailed analysis of human-mouse-(rat)-((dog)) AS isoforms (human and mouse ESTs)
• conservation of AS regulatory sites• mosquito-drosophila• more families of paralogs; add mouse data• AS of transcription factors and receptors
Acknowledgements
• Discussions– Vsevolod Makeev (GosNIIGenetika)– Eugene Koonin (NCBI)– Igor Rogozin (NCBI)– Dmitry Petrov (Stanford)
• Support– Ludwig Institute of Cancer Research– Howard Hughes Medical Institute
Authors
• Andrei Mironov (GosNIIGenetika) – spliced alignment• Shamil Sunyaev (EMBL, now Harvard University Medical
School) – protein structure • Vasily Ramensky (Institute of Molecular Biology) – SNPs• Irena Artamonova (Institute of Bioorganic Chemistry) –
human/mouse comparison, MAGEA family• Dmitry Malko (GosNIIGenetika) – mosquito/drosophila
comparison • Eugenia Kriventseva (EBI, now BASF) – protein structure• Ramil Nurtdinov (Moscow State University) – human/mouse
comparison• Ekaterina Ermakova (Moscow State University) – evolution of
alternative/constitutive regions
ReferencesNurtdinov RN, Artamonova II, Mironov AA, Gelfand MS (2003)
Low conservation of alternative splicing patterns in the human and mouse genomes. Human Molecular Genetics 12: 1313-1320.
Kriventseva EV, Koch I, Apweiler R, Vingron M, Bork P, Gelfand MS, Sunyaev S. (2003) Increase of functional diversity by alternative splicing. Trends in Genetics 19: 124-128.
Brudno M, Gelfand MS, Spengler S, Zorn M, Dubchak I, Conboy JG (2001) Computational analysis of candidate intron regulatory elements for tissue-specific alternative pre-mRNA splicing. Nucleic Acids Research 29: 2338-2348.
Dralyuk I, Brudno M, Gelfand MS, Zorn M, Dubchak I (2000) ASDB: database of alternatively spliced genes. Nucleic Acids Research 28: 296-297.
Mironov AA, Fickett JW, Gelfand MS (1999). Frequent alternative splicing of human genes. Genome Research 9: 1288-1293.