eukaryotic gene finding
DESCRIPTION
Eukaryotic Gene Finding. Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/. Prokaryotes small genomes high gene density no introns (or splicing) no RNA processing similar promoters overlapping genes. Eukaryotes large genomes low gene density introns (splicing) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/1.jpg)
Eukaryotic Gene Finding
Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/
![Page 2: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/2.jpg)
Prokaryotic vs. Eukaryotic Genes
Prokaryotessmall genomes
high gene density
no introns (or splicing)
no RNA processing
similar promoters
overlapping genes
Eukaryoteslarge genomes
low gene density
introns (splicing)
RNA processing
heterogeneous promoters
polyadenylation
![Page 3: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/3.jpg)
![Page 4: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/4.jpg)
exonic enhancers
5’ splice signal 3’ splice signalpolyY
branch signal
intronic enhancers
exonic repressor
U2A
F6
5
U2A
F3
5
U1
snRN
P
SR proteins
U1
snRN
P
U2
snRN
Pintronic repressor
5’ splice signal
exon definitionintron definition
Pre-mRNA Splicing
...
...
(assembly of spliceosome, catalysis)
![Page 5: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/5.jpg)
![Page 6: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/6.jpg)
Some Statistics
• On average, a vertebrate gene is about 30KB long
• Coding region takes about 1KB• Exon sizes can vary from double digit
numbers to kilobases• An average 5’ UTR is about 750 bp• An average 3’UTR is about 450 bp but both
can be much longer.
![Page 7: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/7.jpg)
5' splice signal
3' splice signal
Human Splice Signal Motifs
![Page 8: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/8.jpg)
![Page 9: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/9.jpg)
![Page 10: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/10.jpg)
Semi-Markov HMM Model
![Page 11: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/11.jpg)
Genscan HSMM
![Page 12: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/12.jpg)
GenScan States
• N - intergenic region• P - promoter• F - 5’ untranslated region
• Esngl – single exon )intronless( )translation
start -> stop codon(
• Einit – initial exon )translation start ->
donor splice site(
• Ek – phase k internal exon )acceptor splice
site -> donor splice site(
• Eterm – terminal exon )acceptor splice site -
> stop codon(
• Ik – phase k intron: 0 – between codons; 1
– after the first base of a codon; 2 – after the second base of a codon
![Page 13: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/13.jpg)
GenScan features
• Model both strands at once• Each state may output a string of symbols
(according to some probability distribution).• Explicit intron/exon length modeling• Advanced splice site modeling• Parameters learned from annotated genes• Separate parameter training for different CpG
content groups
![Page 14: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/14.jpg)
![Page 15: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/15.jpg)
GenScan Signal Modeling
• PSSM: P(S) = P1(S1)•P2(S2) •…•Pn(Sn)
– PolyA signal– Translation initiation/termination signal– Promoters
• WAM: P(S) = P1(S1) •P2(S2|S1)•…•Pn(Sn|Sn-1)– 5’ and 3’ splice sites
![Page 16: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/16.jpg)
HMM-based Gene Finding
GENSCAN (Burge 1997)
FGENESH (Solovyev 1997)
HMMgene (Krogh 1997)
GENIE (Kulp 1996)
GENMARK (Borodovsky & McIninch 1993)
VEIL (Henderson, Salzberg, & Fasman 1997)
![Page 17: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/17.jpg)
GenomeScan
• Combine probabilistic ‘extrinsic’ information (BLAST hits) with a probabilistic model of gene structure/composition (GenScan)
• Focus on ‘typical case’ when homologous but not identical proteins are available.
• Idea: We can enhance our gene prediction by using external information: DNA regions with homology to known proteins are more likely to be coding exons.
![Page 18: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/18.jpg)
![Page 19: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/19.jpg)
![Page 20: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/20.jpg)
GeneWise [Birney, Amitai]
• Motivation: Use good DB of protein world (PFAM) to help us annotate genomic DNA
• GeneWise algorithm aligns a profile HMM directly to the DNA
![Page 21: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/21.jpg)
Sample GeneWise Output
![Page 22: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/22.jpg)
Developing GeneWise Model
• Start with a PFAM domain HMM
• Replace AA emissions with codon emissions
)|()|()|( ii MaaPaacodonPMcodonP
•Allow for sequencing errors (deletions/insertions)•Add a 3-state intron model
![Page 23: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/23.jpg)
GeneWise Model
![Page 24: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/24.jpg)
GeneWise Intron Model
central
PY tract
spacer
5’ site 3’ site
![Page 25: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/25.jpg)
GeneWise Model
• Viterbi algorithm -> “best” alignment of DNA to protein domain
• Alignment gives exact exon-intron boundaries
• Parameters learned from species-specific statistics
![Page 26: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/26.jpg)
GeneWise problems
• Only provides partial prediction, and only where the homology lies– Does not find “more” genes
• Pseudogenes, Retrotransposons picked up
• CPU intensive– Solution: Pre-filter with BLAST
![Page 27: Eukaryotic Gene Finding](https://reader035.vdocuments.net/reader035/viewer/2022062301/568144a3550346895db1683b/html5/thumbnails/27.jpg)
Summary
• Genes are complex structures which are difficult to predict with the required level of accuracy/confidence
• Different approaches to gene finding:– Ab Initio : GenScan– Ab Initio modified by BLAST homologies:
GenomeScan– Homology guided: GeneWise