cs262 lecture 9, win07, batzoglou gene recognition
Post on 19-Dec-2015
223 views
TRANSCRIPT
CS262 Lecture 9, Win07, Batzoglou
Gene structure
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
exon = protein-codingintron = non-coding
Codon:A triplet of nucleotides that is converted to one amino acid
CS262 Lecture 9, Win07, Batzoglou
• Classes of Gene predictors Ab initio
• Only look at the genomic DNA of target genome
De novo• Target genome + aligned informant genome(s)
EST/cDNA-based & combined approaches• Use aligned ESTs or cDNAs + any other kind of evidence
Gene Finding
EXON EXON EXON EXON EXON
Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg
CS262 Lecture 9, Win07, Batzoglou
Signals for Gene Finding
1. Regular gene structure
2. Exon/intron lengths
3. Codon composition
4. Motifs at the boundaries of exons, introns, etc.Start codon, stop codon, splice sites
5. Patterns of conservation
6. Sequenced mRNAs
7. (PCR for verification)
CS262 Lecture 9, Win07, Batzoglou
Nucleotide Composition
• Base composition in exons is characteristic due to the genetic code
Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG
CS262 Lecture 9, Win07, Batzoglou
Splice Sites
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
CS262 Lecture 9, Win07, Batzoglou
HMMs for Gene Recognition
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
exon exon exonintronintronintergene intergene
Intergene State
First Exon State
IntronState
CS262 Lecture 9, Win07, Batzoglou
HMMs for Gene Recognition
exon exon exonintronintronintergene intergene
Intergene State
First Exon State
IntronState
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
CS262 Lecture 9, Win07, Batzoglou
Duration HMMs for Gene Recognition
TAA A A A A A A A A A A AA AAT T T T TT TT T T TT T T TG GGG G G G GGGG G G G GCC C C C C C
Exon1 Exon2 Exon3
Duration d
iPINTRON(xi | xi-1…xi-w)
PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)
j+2
P5’SS(xi-3…xi+4)
PSTOP(xi-4…xi+3)
CS262 Lecture 9, Win07, Batzoglou
Genscan
• Burge, 1997
• First competitive HMM-based gene finder, huge accuracy jump
• Only gene finder at the time, to predict partial genes and genes in both strands
Features– Duration HMM– Four different parameter sets
• Very low, low, med, high GC-content
CS262 Lecture 9, Win07, Batzoglou
Using Comparative Information
• Hox cluster is an example where everything is conserved
CS262 Lecture 9, Win07, Batzoglou
Patterns of Conservation
30% 1.3%0.14%
58%14%
10.2%
Genes Intergenic
Mutations Gaps Frameshifts
Separation
2-fold10-fold75-fold
CS262 Lecture 9, Win07, Batzoglou
Comparison-based Gene Finders
• Rosetta, 2000• CEM, 2000
– First methods to apply comparative genomics (human-mouse) to improve gene prediction
• Twinscan, 2001– First HMM for comparative gene prediction in two genomes
• SLAM, 2002– Generalized pair-HMM for simultaneous alignment and gene prediction in
two genomes
• NSCAN, 2006– Best method to-date based on a phylo-HMM for multiple genome gene
prediction
CS262 Lecture 9, Win07, Batzoglou
Twinscan
1. Align the two sequences (eg. from human and mouse)
2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )
New “alphabet”: 4 x 3 = 12 lettersS = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }
3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }
Emission distributions ek(b) estimated from real genes from human/mouse
eI(x|) < eE(x|): matches favored in exonseI(x-) > eE(x-): gaps (and mismatches) favored in introns
Example
Human: ACGGCGACGUGCACGUMouse: ACUGUGACGUGCACUUAlignment: ||:|:|||||||||:|
CS262 Lecture 9, Win07, Batzoglou
SLAM – Generalized Pair HMM
d
e
Exon GPHMM1.Choose exon lengths (d,e).2.Generate alignment of length d+e.
CS262 Lecture 9, Win07, Batzoglou
NSCAN—Multiple Species Gene Prediction
• GENSCAN
• TWINSCAN
• N-SCAN
Target GGTGAGGTGACCAAGAACGTGTTGACAGTA
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence
Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA
...
),...,,...,|( 1 oiioiii TTP III
),...,|( 1 oiii TTTP
),...,,,...,|,( 11 oiioiiii TTTP III
Target sequence:
Informant sequences (vector):
Joint prediction (use phylo-HMM):
CS262 Lecture 9, Win07, Batzoglou
NSCAN—Multiple Species Gene Prediction
X
C Y
Z H
M R
)|()|()|(
)|()|()|()(
),,,,,,(
1
ZRPZMPYZP
YHPXYPXCPAP
ZYXRMCHP
X
C
Y
Z
H
M R
)|()|()|(
)|()|()|()(
),,,,,,(
ZRPZMPXCP
YZPYXPHYPHP
ZYXRMCHP
CS262 Lecture 9, Win07, Batzoglou
Performance Comparison
GENSCANGeneralized HMMModels human sequence
TWINSCANGeneralized HMMModels human/mouse alignments
N-SCANPhylo-HMMModels multiple sequence evolution
NSCAN human/mouse
>Human/multiple
informants
CS262 Lecture 9, Win07, Batzoglou
• 2-level architecture• No Phylo-HMM that models alignments
CONTRAST
Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg
SVM SVM
CRF
X
Y
a b a b
CS262 Lecture 9, Win07, Batzoglou
• log P(y | x) ~ wTF(x, y)
• F(x, y) = i f(yi-1, yi, i, x)
• f(yi-1, yi, i, x):
1{yi-1 = INTRON, yi = EXON_FRAME_1}
1{yi-1 = EXON_FRAME_1, xhuman,i-2,…, xhuman,i+3 = ACCGGT)
1{yi-1 = EXON_FRAME_1, xhuman,i-1,…, xdog,i+1 = ACC, AGC)
(1-c)1{a<SVM_DONOR(i)<b} (optional) 1{EXON_FRAME_1, EST_EVIDENCE}
CONTRAST - Features
CS262 Lecture 9, Win07, Batzoglou
• Accuracy increases as we add informants
• Diminishing returns after ~5 informants
CONTRAST – SVM accuracies
SN SP
CS262 Lecture 9, Win07, Batzoglou
CONTRAST - Decoding
Viterbi Decoding:
maximize P(y | x)
Maximum Expected Boundary Accuracy Decoding:
maximize i,B 1{yi-1, yi is exon boundary B} Accuracy(yi-1, yi, B | x)
Accuracy(yi-1, yi, B | x) = P(yi-1, yi is B | x) – (1 – P(yi-1, yi is B | x))
CS262 Lecture 9, Win07, Batzoglou
CONTRAST - Training
Maximum Conditional Likelihood Training:
maximize L(w) = Pw(y | x)
Maximum Expected Boundary Accuracy Training:
ExpectedBoundaryAccuracy(w) = i Accuracyi
Accuracyi = B 1{(yi-1, yi is exon boundary B} Pw(yi-1, yi is B | x) -
B’ ≠ B P(yi-1, yi is exon boundary B’ | x)
CS262 Lecture 9, Win07, Batzoglou
N-SCAN(Mouse)
CONTRAST(Mouse)
CONTRAST(Eleven
Informants)
Gene Sn 35.6 50.8 58.6
Gene Sp 25.1 29.3 35.5
Exon Sn 84.2 90.8 92.8
Exon Sp 64.6 70.5 72.5
Nucleotide Sn 90.8 96.0 96.9
Nucleotide Sp 67.9 70.0 72.0
Performance Comparison
N-SCAN(Mouse)
CONTRAST(Mouse)
CONTRAST(Eleven
Informants)
Gene Sn 46.8 60.7 65.4Gene Sp 31.7 40.6 46.2Exon Sn 89.7 92.6 93.9Exon Sp 66.9 74.8 76.2
Nucleotide Sn 93.7 95.7 96.7
Nucleotide Sp 69.3 74.3 75.8
De Novo
EST-assisted
HumanMacaqueMouseRatRabbitDogCowArmadilloElephantTenrecOpossumChicken