exam 2 study guide gene prediction - iowa state...

8
Gene Prediction 10/21/05 D Dobbs ISU - BCB 444/544X 1 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 1 10/21/05 Gene Prediction (formerly Gene Prediction - 3) 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 2 Announcements Exam 2 - next Friday Posted online: Exam 2 Study Guide 544 Reading Assignment (2 papers) 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 3 Announcements 544 Semester Projects - Information needed: Please send email to me (or David) [email protected] Briefly describe: Your background & current grad research Is there a problem related to your research you would like to learn more about & develop as project for this course? or What would your ‘dream’ project be? 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 4 Announcements 2 Bioinformatics Seminars today (Fri Oct 21) 12:10 PM BCB Faculty Seminar in E164 Lagomarcino “Protein Networks” Bob Jernigan, BBMB & Director,Baker Center for Bioinformatics & Biological Statistics http://www.bcb.iastate.edu/courses/BCB691-F2005.html#Oct%2021 4:10 PM GDCB Special Seminar in 1414 MBB “Integrating the Unknown-eome with Abiotic Stress Response Networks in ArabidopsisRon Mittler, Dept. of Biochem & Mol Biology University of Nevada, Reno 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 5 Gene Prediction & Regulation Mon - Gene structure review: Eukaryotes vs prokaryotes Wed - Regulatory regions: Promoters & enhancers Fri - Predicting genes - Predicting regulatory regions (?) Next week: Predicting RNA structure (miRNAs, too) 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 6 Optional Reading Reviews: 1) Zhang MQ (2002) Computational prediction of eukaryotic protein- coding genes. Nat Rev Genet 3:698-709 http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html 2) Wasserman WW & Sandelin (2004) Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5:276-287 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html

Upload: vuxuyen

Post on 16-Mar-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exam 2 Study Guide Gene Prediction - Iowa State Universityweb.cs.iastate.edu/~cs544/Lectures/GenePrediction.pdf · ... Gene Prediction 2 Announcements Exam 2 - next Friday Posted

Gene Prediction 10/21/05

D Dobbs ISU - BCB 444/544X 1

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 1

10/21/05

Gene Prediction

(formerly Gene Prediction - 3)

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 2

AnnouncementsExam 2 - next FridayPosted online: Exam 2 Study Guide

544 Reading Assignment (2 papers)

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 3

Announcements544 Semester Projects - Information needed:

Please send email to me (or David)[email protected]

Briefly describe:• Your background & current grad research• Is there a problem related to your research you would

like to learn more about & develop as project forthis course?

or• What would your ‘dream’ project be?

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 4

Announcements2 Bioinformatics Seminars today (Fri Oct 21)12:10 PM BCB Faculty Seminar in E164 Lagomarcino

“Protein Networks”Bob Jernigan, BBMB & Director,Baker Center

for Bioinformatics & Biological Statisticshttp://www.bcb.iastate.edu/courses/BCB691-F2005.html#Oct%2021

4:10 PM GDCB Special Seminar in 1414 MBB“Integrating the Unknown-eome with AbioticStress Response Networks in Arabidopsis”Ron Mittler, Dept. of Biochem & Mol BiologyUniversity of Nevada, Reno

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 5

Gene Prediction & Regulation

Mon - Gene structure review: Eukaryotes vs prokaryotes

Wed - Regulatory regions: Promoters & enhancers

Fri - Predicting genes - Predicting regulatory regions (?)

• Next week: Predicting RNA structure (miRNAs, too)

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 6

Optional Reading

Reviews:

1) Zhang MQ (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698-709

http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html

2) Wasserman WW & Sandelin (2004) Applied bioinformatics for theidentification of regulatory elements. Nat Rev Genet 5:276-287http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html

Page 2: Exam 2 Study Guide Gene Prediction - Iowa State Universityweb.cs.iastate.edu/~cs544/Lectures/GenePrediction.pdf · ... Gene Prediction 2 Announcements Exam 2 - next Friday Posted

Gene Prediction 10/21/05

D Dobbs ISU - BCB 444/544X 2

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 7

Review last lecture: Gene Regulation(formerly Gene Prediction-2)

cDNAs & ESTsUniGene

Regulatory regionsEukaryotes vs prokaryotes

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 8

DNA RNA

cDNA

Phenotypeprotein

[1] Transcription[2] RNA processing (splicing)[3] RNA export[4] RNA surveillance

Pevsner p160

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 9

UniGene: unique genes via ESTs

• Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene

• UniGene clusters contain many ESTs

• UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution

Pevsner p164 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 10

Today: Gene Prediction(formerly Gene Prediction - 3)

Predicting genes

Mon - Predicting regulatory regions Focus on promoters Introduction to RNA

Later: Genome browsers

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 11

Gene Prediction

• Overview of steps & strategies• What sequence signals can be used?• What other types of information can be used?

• Algorithms • HMMs, discriminant functions, neural nets

• Gene prediction software • 3 major types• many,many programs!

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 12

Predicting Genes - Basic steps:• Obtain genomic sequence• Translate in all 6 reading frames

• Compare with protein sequence database• Perform database similarity search with EST & cDNA databases, if available

• Use gene prediction program to locate genes• Analyze gene regulatory sequences

Page 3: Exam 2 Study Guide Gene Prediction - Iowa State Universityweb.cs.iastate.edu/~cs544/Lectures/GenePrediction.pdf · ... Gene Prediction 2 Announcements Exam 2 - next Friday Posted

Gene Prediction 10/21/05

D Dobbs ISU - BCB 444/544X 3

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 13

Overview of gene prediction strategies

What sequence signals can be used? Transcription: TF binding sites, promoter,

initiation site, terminator Processing signals: splice donor/acceptors, polyA signal Translation: start (AUG = Met) & stop (UGA,UUA, UAG)

ORFs, codon usageWhat other types of information can be used? cDNAs & ESTs (experimental data,pairwise alignment) homology (sequence comparison, BLAST)

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 14

Automated gene prediction strategies

1) Similarity-based or Comparative• BLAST - Do other organisms have similar sequence?

(Is sequence similar to known gene or protein)

2) Ab initio = “from the beginning”• Predict without explicit comparison with cDNA or proteins via

“rule-based” gene models - but rules are derived fromstatistical analysis of datasets

3) Combined "evidence-based"• Combine gene models with alignment to known ESTs &

protein sequences

BEST RESULTS? Combined

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 15

Examples of gene prediction software

1) Similarity-based or Comparative• BLAST• SGP2 (extension of GeneID)

2) Ab initio = “from the beginning”• GeneID - (used in lab this week)• GENSCAN - (used in lab this week)• GeneMark.hmm - (should try this!)

3) Combined "evidence-based”• GeneSeqer (Brendel et al., ISU)

BEST? GENSCAN, GeneMark.hmm, GeneSeqerbut depends on organism & specific task

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 16

Gene prediction: Eukaryotes vs prokaryotes

Gene prediction is easier in microbial genomes

Why? Smaller genomesSimpler gene structuresMore sequenced genomes!

(for comparative approaches)

Methods? Previously, mostly HMM-based Now: similarity-based methods

because so many genomes available

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 17

GeneSeqer - Brendel et al.

http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 18

Thanks to Volker Brendel, ISUfor following Figs & Slides

Slightly modified from:BSSI Genome Informatics Module

http://www.bioinformatics.iastate.edu/BBSI/course_desc_2005.html#moduleB

V Brendel [email protected]

Page 4: Exam 2 Study Guide Gene Prediction - Iowa State Universityweb.cs.iastate.edu/~cs544/Lectures/GenePrediction.pdf · ... Gene Prediction 2 Announcements Exam 2 - next Friday Posted

Gene Prediction 10/21/05

D Dobbs ISU - BCB 444/544X 4

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 19

GT AG

exon intron

Splice sites

Donor site Acceptor site

Signals: Pre-mRNA Splicing

TranslationProtein

SplicingmRNA Cap- -Poly(A)

Transcriptionpre-mRNA Cap- -Poly(A)

Genomic DNA

Start codon Stop codon

Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 20

Brendel - Spliced Alignment I:Compare with cDNA or EST probes

Genomic DNA

Start codon Stop codon

mRNA -Poly(A)Cap-

5’-UTR 3’-UTR

Start codon Stop codon

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 21

Brendel - Spliced Alignment II:Compare with protein probes

Genomic DNA

Start codon Stop codon

Protein

Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 22

Brendel Spliced Alignment Algorithm

• Perform pairwise alignment with large gapsin one sequence (introns)• Align genomic DNA with cDNA, EST or protein

• Score semi-conserved sequences at splicejunctions

• Score coding constraints in translated exons

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 23

8888

8383

107107

104104

316316

311311

GTGT

AGAGZea maysZea mays

8653865386118611

9297929792479247

23019230192292922929

GTGTAGAG

ArabidopsisArabidopsisthalianathaliana

157157

163163

176176

172172

221221

217217

GTGT

AGAGAspergillusAspergillus

119119

118118

118118

122122

170170

179179

GTGT

AGAGS. S. pombepombe

20789207892062620626

20500205002032520325

37029370293686436864

GTGTAGAG

C. C. eleganselegans

524524

536536

670670

671671

989989

10011001

GTGT

AGAGDrosophilaDrosophila

107107103103

238238228228

288288284284

GTGTAGAG

Gallus gallusGallus gallus

147147140140

408408386386

450450442442

GTGTAGAG

Rattus norvegicusRattus norvegicus

521521

504504

11851185

11391139

12121212

11941194

GTGT

AGAGMus musculusMus musculus

3037303729792979

5277527751945194

6586658665556555

GTGTAGAG

Home sapiensHome sapiens

Number of True Splice Sites / PhaseNumber of True Splice Sites / Phase1 2 31 2 3

TypeTypeSpeciesSpecies

Donor (GT)Donor (GT) & Acceptor (AG) Sites& Acceptor (AG) SitesUsed for Model TrainingUsed for Model Training

Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 24

• Information Content Information Content IIii ::

I f fi iB

B U C A G

iB= +!

"2 2

, , ,

log ( )

• Extent of Splice Signal Window:

I Ii I! + 196. "

i : ith position in sequenceĪ : average information content over all positions i > 20 nt from splice siteσĪ : average standard deviation of Ī

Splice Site Detection

Brendel 2005

Page 5: Exam 2 Study Guide Gene Prediction - Iowa State Universityweb.cs.iastate.edu/~cs544/Lectures/GenePrediction.pdf · ... Gene Prediction 2 Announcements Exam 2 - next Friday Posted

Gene Prediction 10/21/05

D Dobbs ISU - BCB 444/544X 5

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 25

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

HumanT2_GT

HumanFi_AG

HumanT2_AG

HumanF1_AG

A. thalianaT2_GT

A. thalianaF1_AG

A. thalianaFi_AG

A. thalianaT2_AG

Results?

Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 26

Bayesian Splice Site Prediction

where H indexes the hypotheses of GT or AG at - True site in reading phase 1, 2, or 0 - False within-exon site in reading phase 1, 2, or 0 - False within-intron site

Let S = s-l s-l+1 s-l+2…s-1GT s1 s2 s3 …sr

!=H

HPHSPHPHSPSHP }){}|{/(}{}|{}|{

11,/}{}|{}{}{

1

1

1!!""

+!=

!!

+!=

! ==iii s

r

li

slii

r

li

l ffspsspspSP

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 27

Bayes Factor as Decision Criterion

H0: H=T:}){1(

}{

})|{1(

}|{

Tp

Tp

STp

STpBF

!!=

- 2-class model: }|{}|{ FSpTSpBF =

- 7 class model: !!

!!

=

=

=

==

ix x

ix xx

x x

x xx

Fp

FpFSp

Tp

TpTSpBF

,0,2,1

,0,2,1

0,2,1

0,2,1

}{

}{}|{

}{

}{}|{

Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 28

in terms of Critical Value in terms of Critical Value cc = 2 = 2 lnlnBFBF

•• Positive evidence for HPositive evidence for H00 if 2 if 2 ≤≤ c c ≤≤ 6 6

•• Strong support for H Strong support for H00 if 6 if 6 ≤≤ c c ≤≤ 10 10

•• Very strong support for H Very strong support for H00 if c > 10 if c > 10

Interpretation of Bayes FactorInterpretation of Bayes Factor

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 29

Evaluation of Splice Site Prediction

• Sensitivity: S TP APn= = !/ 1 "

• Normalized specificity: !"

" #=

$

$ +

1

1

• Specificity: S TP PPAN

PP rp = = ! =

!

! +/ 1

1

1"

#

# "r

AN

AP=

• Misclassification rates:! =FN

AP! =

FP

AN

= Coverage

ActualTrue False

PP=TP+FP

PN=FN+TN

AP=TP+FN AN=FP+TN

PredictedTrue

False TNFNFPTP

Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 30

48.173.291.041.962.081.2

93.297.699.392.396.498.6

99.595.687.199.296.487.1

036036

9027

10196

613

614

GT

AG

7CA. thaliana

40.464.385.458.276.988.5

92.797.199.197.298.899.5

97.894.284.898.896.290.2

036036

7460

10132

400

400

GT

AG

7CC. elegans

34.153.675.028.741.459.4

94.897.699.194.897.098.5

95.490.083.995.792.185.1

036036

11501

14920

329

329

GT

AG

2CDrosophila

16.434.857.6 9.715.725.6

90.596.398.588.492.996.1

98.591.766.396.390.376.1

036036

44411

65103

921

920

GT

AG

2CHomo sapiens

Sp(%)

σ

(%)Sn(%)

BayesFactor

Test Site Set True False

SiteModelSpecies

Brendel 2005

Page 6: Exam 2 Study Guide Gene Prediction - Iowa State Universityweb.cs.iastate.edu/~cs544/Lectures/GenePrediction.pdf · ... Gene Prediction 2 Announcements Exam 2 - next Friday Posted

Gene Prediction 10/21/05

D Dobbs ISU - BCB 444/544X 6

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 31

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

σ σ

SnSn

HumanGT site

HumanAG site

SnSn

C. elegansGT site

C. elegansAG site

SnSn

A. thalianaAG site

A. thalianaGT site

σ σ

σ σ

Brendel 2005

Performance?

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 32

en en+1

in in+1

PΔG

PA(n)PΔG

(1-PΔG)PD(n+1)

(1-PΔG)PD(n+1)

(1-PΔG)(1-PD(n+1))

1-PA(n)

PΔG

Markov Model for Spliced Alignment

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 33

Performance vs other methods

• Comparison with ab initio gene predictionprograms?

• Depends on:• Availability of ESTs• Availability of protein homologs

Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 34

Target protein alignment score

0.000.100.200.300.400.500.600.700.800.901.00

0 10 20 30 40 50 60 70 80 90 100

Exon

(Sn

+ Sp

) / 2

GeneSeqer

NAPGENSCAN

Brendel 2005

GENSCAN - Burge, MIT

GeneSeqer vs NAP vs GENSCAN (Exon prediction)

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 35

0.000.100.200.300.400.500.600.700.800.901.00

0 10 20 30 40 50 60 70 80 90 100Target protein alignment score

Intro

n (S

n +

Sp) /

2

GeneSeqer

NAP

GENSCAN

Brendel 2005

GENSCAN - Burge, MIT

GeneSeqer vs NAP vs GENSCAN (Intron prediction)

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 36

GeneSeqerGenomicSequence

EST or proteindatabase

(Suffix Array/Suffix Tree)

Fast Search

SplicedAlignment

OutputAssembly

Brendel 2005

Page 7: Exam 2 Study Guide Gene Prediction - Iowa State Universityweb.cs.iastate.edu/~cs544/Lectures/GenePrediction.pdf · ... Gene Prediction 2 Announcements Exam 2 - next Friday Posted

Gene Prediction 10/21/05

D Dobbs ISU - BCB 444/544X 7

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 37Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 38Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 39Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 40Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 41Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 42Brendel 2005

Page 8: Exam 2 Study Guide Gene Prediction - Iowa State Universityweb.cs.iastate.edu/~cs544/Lectures/GenePrediction.pdf · ... Gene Prediction 2 Announcements Exam 2 - next Friday Posted

Gene Prediction 10/21/05

D Dobbs ISU - BCB 444/544X 8

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 43

Gene Structure Annotation - Problems

False positive intergenic region:• 2 annotated genes actually correspond to a single gene

False negative intergenic region:• One annotated gene structure actually contains 2 genes

False negative gene prediction:• Missing gene (no annotation)

Other:• partially incorrect gene annotation• missing annotation of alternative transcripts

Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 44Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 45

Other ResourcesCurrent Protocols in Bioinformaticshttp://www.4ulr.com/products/currentprotocols/bioinformatics.html

Finding Genes4.1 An Overview of Gene Identification: Approaches, Strategies, and

Considerations4.2 Using MZEF To Find Internal Coding Exons4.3 Using GENEID to Identify Genes4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm4.6 Eukaryotic Gene Prediction Using GeneMark.hmm4.7 Application of FirstEF to Find Promoters and First Exons in the Human

Genome4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences