cs262 lecture 9, win07, batzoglou gene recognition

31
262 Lecture 9, Win07, Batzoglou Gene Recognition

Post on 19-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

CS262 Lecture 9, Win07, Batzoglou

Gene Recognition

CS262 Lecture 9, Win07, Batzoglou

Gene structure

exon1 exon2 exon3intron1 intron2

transcription

translation

splicing

exon = protein-codingintron = non-coding

Codon:A triplet of nucleotides that is converted to one amino acid

CS262 Lecture 9, Win07, Batzoglou

Needles in a Haystack

CS262 Lecture 9, Win07, Batzoglou

• Classes of Gene predictors Ab initio

• Only look at the genomic DNA of target genome

De novo• Target genome + aligned informant genome(s)

EST/cDNA-based & combined approaches• Use aligned ESTs or cDNAs + any other kind of evidence

Gene Finding

EXON EXON EXON EXON EXON

Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg

CS262 Lecture 9, Win07, Batzoglou

Signals for Gene Finding

1. Regular gene structure

2. Exon/intron lengths

3. Codon composition

4. Motifs at the boundaries of exons, introns, etc.Start codon, stop codon, splice sites

5. Patterns of conservation

6. Sequenced mRNAs

7. (PCR for verification)

CS262 Lecture 9, Win07, Batzoglou

Next Exon:Frame 0

Next Exon:Frame 1

CS262 Lecture 9, Win07, Batzoglou

Exon and Intron Lengths

CS262 Lecture 9, Win07, Batzoglou

Nucleotide Composition

• Base composition in exons is characteristic due to the genetic code

Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG

CS262 Lecture 9, Win07, Batzoglou

atg

tga

ggtgag

ggtgag

ggtgag

caggtg

cagatg

cagttg

caggccggtgag

CS262 Lecture 9, Win07, Batzoglou

Splice Sites

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

CS262 Lecture 9, Win07, Batzoglou

HMMs for Gene Recognition

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

exon exon exonintronintronintergene intergene

Intergene State

First Exon State

IntronState

CS262 Lecture 9, Win07, Batzoglou

HMMs for Gene Recognition

exon exon exonintronintronintergene intergene

Intergene State

First Exon State

IntronState

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

CS262 Lecture 9, Win07, Batzoglou

Duration HMMs for Gene Recognition

TAA A A A A A A A A A A AA AAT T T T TT TT T T TT T T TG GGG G G G GGGG G G G GCC C C C C C

Exon1 Exon2 Exon3

Duration d

iPINTRON(xi | xi-1…xi-w)

PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)

j+2

P5’SS(xi-3…xi+4)

PSTOP(xi-4…xi+3)

CS262 Lecture 9, Win07, Batzoglou

Genscan

• Burge, 1997

• First competitive HMM-based gene finder, huge accuracy jump

• Only gene finder at the time, to predict partial genes and genes in both strands

Features– Duration HMM– Four different parameter sets

• Very low, low, med, high GC-content

CS262 Lecture 9, Win07, Batzoglou

Using Comparative Information

CS262 Lecture 9, Win07, Batzoglou

Using Comparative Information

• Hox cluster is an example where everything is conserved

CS262 Lecture 9, Win07, Batzoglou

Patterns of Conservation

30% 1.3%0.14%

58%14%

10.2%

Genes Intergenic

Mutations Gaps Frameshifts

Separation

2-fold10-fold75-fold

CS262 Lecture 9, Win07, Batzoglou

Comparison-based Gene Finders

• Rosetta, 2000• CEM, 2000

– First methods to apply comparative genomics (human-mouse) to improve gene prediction

• Twinscan, 2001– First HMM for comparative gene prediction in two genomes

• SLAM, 2002– Generalized pair-HMM for simultaneous alignment and gene prediction in

two genomes

• NSCAN, 2006– Best method to-date based on a phylo-HMM for multiple genome gene

prediction

CS262 Lecture 9, Win07, Batzoglou

Twinscan

1. Align the two sequences (eg. from human and mouse)

2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )

New “alphabet”: 4 x 3 = 12 lettersS = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }

3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }

Emission distributions ek(b) estimated from real genes from human/mouse

eI(x|) < eE(x|): matches favored in exonseI(x-) > eE(x-): gaps (and mismatches) favored in introns

Example

Human: ACGGCGACGUGCACGUMouse: ACUGUGACGUGCACUUAlignment: ||:|:|||||||||:|

CS262 Lecture 9, Win07, Batzoglou

SLAM – Generalized Pair HMM

d

e

Exon GPHMM1.Choose exon lengths (d,e).2.Generate alignment of length d+e.

CS262 Lecture 9, Win07, Batzoglou

NSCAN—Multiple Species Gene Prediction

• GENSCAN

• TWINSCAN

• N-SCAN

Target GGTGAGGTGACCAAGAACGTGTTGACAGTA

Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence

Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA

...

),...,,...,|( 1 oiioiii TTP III

),...,|( 1 oiii TTTP

),...,,,...,|,( 11 oiioiiii TTTP III

Target sequence:

Informant sequences (vector):

Joint prediction (use phylo-HMM):

CS262 Lecture 9, Win07, Batzoglou

NSCAN—Multiple Species Gene Prediction

X

C Y

Z H

M R

)|()|()|(

)|()|()|()(

),,,,,,(

1

ZRPZMPYZP

YHPXYPXCPAP

ZYXRMCHP

X

C

Y

Z

H

M R

)|()|()|(

)|()|()|()(

),,,,,,(

ZRPZMPXCP

YZPYXPHYPHP

ZYXRMCHP

CS262 Lecture 9, Win07, Batzoglou

Performance Comparison

GENSCANGeneralized HMMModels human sequence

TWINSCANGeneralized HMMModels human/mouse alignments

N-SCANPhylo-HMMModels multiple sequence evolution

NSCAN human/mouse

>Human/multiple

informants

CS262 Lecture 9, Win07, Batzoglou

• 2-level architecture• No Phylo-HMM that models alignments

CONTRAST

Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg

SVM SVM

CRF

X

Y

a b a b

CS262 Lecture 9, Win07, Batzoglou

CONTRAST

CS262 Lecture 9, Win07, Batzoglou

• log P(y | x) ~ wTF(x, y)

• F(x, y) = i f(yi-1, yi, i, x)

• f(yi-1, yi, i, x):

1{yi-1 = INTRON, yi = EXON_FRAME_1}

1{yi-1 = EXON_FRAME_1, xhuman,i-2,…, xhuman,i+3 = ACCGGT)

1{yi-1 = EXON_FRAME_1, xhuman,i-1,…, xdog,i+1 = ACC, AGC)

(1-c)1{a<SVM_DONOR(i)<b} (optional) 1{EXON_FRAME_1, EST_EVIDENCE}

CONTRAST - Features

CS262 Lecture 9, Win07, Batzoglou

• Accuracy increases as we add informants

• Diminishing returns after ~5 informants

CONTRAST – SVM accuracies

SN SP

CS262 Lecture 9, Win07, Batzoglou

CONTRAST - Decoding

Viterbi Decoding:

maximize P(y | x)

Maximum Expected Boundary Accuracy Decoding:

maximize i,B 1{yi-1, yi is exon boundary B} Accuracy(yi-1, yi, B | x)

Accuracy(yi-1, yi, B | x) = P(yi-1, yi is B | x) – (1 – P(yi-1, yi is B | x))

CS262 Lecture 9, Win07, Batzoglou

CONTRAST - Training

Maximum Conditional Likelihood Training:

maximize L(w) = Pw(y | x)

Maximum Expected Boundary Accuracy Training:

ExpectedBoundaryAccuracy(w) = i Accuracyi

Accuracyi = B 1{(yi-1, yi is exon boundary B} Pw(yi-1, yi is B | x) -

B’ ≠ B P(yi-1, yi is exon boundary B’ | x)

CS262 Lecture 9, Win07, Batzoglou

N-SCAN(Mouse)

CONTRAST(Mouse)

CONTRAST(Eleven

Informants)

Gene Sn 35.6 50.8 58.6

Gene Sp 25.1 29.3 35.5

Exon Sn 84.2 90.8 92.8

Exon Sp 64.6 70.5 72.5

Nucleotide Sn 90.8 96.0 96.9

Nucleotide Sp 67.9 70.0 72.0

Performance Comparison

N-SCAN(Mouse)

CONTRAST(Mouse)

CONTRAST(Eleven

Informants)

Gene Sn 46.8 60.7 65.4Gene Sp 31.7 40.6 46.2Exon Sn 89.7 92.6 93.9Exon Sp 66.9 74.8 76.2

Nucleotide Sn 93.7 95.7 96.7

Nucleotide Sp 69.3 74.3 75.8

De Novo

EST-assisted

HumanMacaqueMouseRatRabbitDogCowArmadilloElephantTenrecOpossumChicken

CS262 Lecture 9, Win07, Batzoglou

Performance Comparison