10/21/05 d dobbs isu - bcb 444/544x: gene prediction1 10/21/05 gene prediction (formerly gene...

46
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Pred iction 1 10/21/05 Gene Prediction (formerly Gene Prediction - 3)

Upload: millicent-haynes

Post on 30-Dec-2015

229 views

Category:

Documents


0 download

TRANSCRIPT

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 1

10/21/05

Gene Prediction

(formerly Gene Prediction - 3)

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 2

AnnouncementsExam 2 - next FridayPosted online: Exam 2 Study Guide

544 Reading Assignment (2 papers)

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 3

Announcements

544 Semester Projects - Information needed:

Please send email to me (or David)[email protected]

Briefly describe: • Your background & current grad research• Is there a problem related to your research you would like to learn more about & develop as project for this course? or • What would your ‘dream’ project be?

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 4

Announcements2 Bioinformatics Seminars today (Fri Oct 21)

12:10 PM BCB Faculty Seminar in E164 Lagomarcino

“Protein Networks”Bob Jernigan, BBMB & Director,Baker

Center for Bioinformatics & Biological Statistics

http://www.bcb.iastate.edu/courses/BCB691-F2005.html#Oct%2021

4:10 PM GDCB Special Seminar in 1414 MBB“Integrating the Unknown-eome with

Abiotic Stress Response Networks in Arabidopsis”

Ron Mittler, Dept. of Biochem & Mol Biology

University of Nevada, Reno

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 5

Gene Prediction & Regulation

Mon - Gene structure review: Eukaryotes vs prokaryotes

Wed - Regulatory regions: Promoters & enhancers

Fri - Predicting genes - Predicting regulatory regions (?)

• Next week: Predicting RNA structure (miRNAs, too)

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 6

Reading Assignment

Mount Bioinformatics• Chp 9 Gene Prediction & Regulation

• pp 361-385 Predicting Promoters• Ck Errata: http://www.bioinformaticsonline.org/help/errata2.html

* Brown Genomes 2 (NCBI textbooks online)• Sect 9 Overview: Assembly of Transcription Initiation Complex • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.chapter.7002

• Sect 9.1-9.3 DNA binding proteins, Transcription initiation• http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.section.7016

* NOTE: Don’t worry about the details!!

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 7

Optional Reading

Reviews:

1) Zhang MQ (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698-709

http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html

2) Wasserman WW & Sandelin (2004) Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5:276-287http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 8

Review last lecture: Gene Regulation

(formerly Gene Prediction-2)

cDNAs & ESTsUniGene

Regulatory regions Eukaryotes vs

prokaryotes

10/21/05 9D Dobbs ISU - BCB 444/544X: Gene Prediction

DNA RNA

cDNA

Phenotypeprotein

[1] Transcription[2] RNA processing (splicing)[3] RNA export[4] RNA surveillance

Pevsner p160

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 10

UniGene: unique genes via ESTs

• Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene

• UniGene clusters contain many ESTs

• UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution

Pevsner p164

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 11

Today: Gene Prediction(formerly Gene Prediction - 3)

Predicting genes

Mon - Predicting regulatory regions Focus on promoters Introduction to RNA

Later: Genome browsers

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 12

Gene Prediction

• Overview of steps & strategies• What sequence signals can be used?• What other types of information can be used?

• Algorithms • HMMs, discriminant functions, neural nets

• Gene prediction software • 3 major types• many,many programs!

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 13

Predicting Genes - Basic steps:

• Obtain genomic sequence• Translate in all 6 reading frames

• Compare with protein sequence database• Perform database similarity search with EST & cDNA databases, if available

• Use gene prediction program to locate genes• Analyze gene regulatory sequences

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 14

Overview of gene prediction strategies

What sequence signals can be used? Transcription: TF binding sites, promoter,

initiation site, terminator Processing signals: splice donor/acceptors, polyA signal Translation: start (AUG = Met) & stop (UGA,UUA, UAG)

ORFs, codon usage What other types of information can be used? cDNAs & ESTs (experimental data,pairwise alignment) homology (sequence comparison, BLAST)

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 15

Automated gene prediction strategies

1)Similarity-based or Comparative• BLAST - Do other organisms have similar sequence?

(Is sequence similar to known gene or protein)

2)Ab initio = “from the beginning”• Predict without explicit comparison with cDNA or proteins

via “rule-based” gene models - but rules are derived from statistical analysis of datasets

3)Combined "evidence-based"• Combine gene models with alignment to known ESTs &

protein sequences

BEST RESULTS? Combined

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 16

Examples of gene prediction software

1)Similarity-based or Comparative • BLAST • SGP2 (extension of GeneID)

2)Ab initio = “from the beginning”• GeneID - (used in lab this week)• GENSCAN - (used in lab this week)• GeneMark.hmm - (should try this!)

3)Combined "evidence-based”• GeneSeqer (Brendel et al., ISU)

BEST? GENSCAN, GeneMark.hmm, GeneSeqer

but depends on organism & specific task

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 17

Gene prediction: Eukaryotes vs prokaryotes

Gene prediction is easier in microbial genomes

Why? Smaller genomesSimpler gene structuresMore sequenced genomes!

(for comparative approaches)

Methods? Previously, mostly HMM-based Now: similarity-based methods

because so many genomes available

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 18

GeneSeqer - Brendel et al.

http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 19

Thanks to Volker Brendel, ISU for following Figs & Slides

Slightly modified from:BSSI Genome Informatics

Modulehttp://www.bioinformatics.iastate.edu/BBSI/

course_desc_2005.html#moduleB

V Brendel [email protected]

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 20

GT AG

exon intron

Splice sites

Donor site Acceptor site

Signals: Pre-mRNA Splicing

TranslationProtein

SplicingmRNA Cap- -Poly(A)

Transcriptionpre-mRNA Cap- -Poly(A)

Genomic DNA

Start codon Stop codon

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 21

Brendel - Spliced Alignment I:Compare with cDNA or EST probes

Genomic DNA

Start codon Stop codon

mRNA -Poly(A)Cap-

5’-UTR 3’-UTR

Start codon Stop codon

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 22

Brendel - Spliced Alignment II:Compare with protein probes

Genomic DNA

Start codon Stop codon

Protein

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 23

Brendel Spliced Alignment Algorithm

• Perform pairwise alignment with large gaps in one sequence (introns)• Align genomic DNA with cDNA, EST or protein

• Score semi-conserved sequences at splice junctions

• Score coding constraints in translated exons

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 24

8888

8383107107

104104316316

311311GTGT

AGAGZea maysZea mays

86538653

8611861192979297

924792472301923019

2292922929GTGT

AGAGArabidopsis thalianaArabidopsis thaliana

157157

163163176176

172172221221

217217GTGT

AGAGAspergillusAspergillus

119119

118118118118

122122170170

179179GTGT

AGAGS. pombeS. pombe

2078920789

20626206262050020500

20325203253702937029

3686436864GTGT

AGAGC. elegansC. elegans

524524

536536670670

671671989989

10011001GTGT

AGAGDrosophilaDrosophila

107107

103103238238

228228288288

284284GTGT

AGAGGallus gallusGallus gallus

147147

140140408408

386386450450

442442GTGT

AGAGRattus norvegicusRattus norvegicus

521521

50450411851185

1139113912121212

11941194GTGT

AGAGMus musculusMus musculus

30373037

2979297952775277

5194519465866586

65556555GTGT

AGAGHome sapiensHome sapiens

Number of True Splice Sites / PhaseNumber of True Splice Sites / Phase

1 2 31 2 3TypeTypeSpeciesSpecies

Donor (GT) & Acceptor (AG) SitesDonor (GT) & Acceptor (AG) SitesUsed for Model TrainingUsed for Model Training

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 25

• Information Content Information Content IIii ::

I f fi iBB U C A G

iB= +∈∑2 2, , ,

log ( )

• Extent of Splice Signal Window:

I Ii I≤ +196. σ

i : ith position in sequenceĪ : average information content over all positions i > 20 nt from splice siteσĪ : average standard deviation of Ī

Splice Site Detection

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 26

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

HumanT2_GT

HumanFi_AG

HumanT2_AG

HumanF1_AG

A. thalianaT2_GT

A. thalianaF1_AG

A. thalianaFi_AG

A. thalianaT2_AG

Results?

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 27

Bayesian Splice Site Prediction

where H indexes the hypotheses of GT or AG at - True site in reading phase 1, 2, or 0 - False within-exon site in reading phase 1, 2, or 0 - False within-intron site

Let S = s-l s-l+1 s-l+2…s-1GT s1 s2 s3 …sr

∑=H

HPHSPHPHSPSHP }){}|{/(}{}|{}|{

11,/}{}|{}{}{

11

1−−∏∏

+−=−−

+−=− ==

iii s

r

lislii

r

lil ffspsspspSP

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 28

Bayes Factor as Decision Criterion

H0: H=T: }){1(

}{

})|{1(

}|{

Tp

Tp

STp

STpBF

−−=

- 2-class model: }|{}|{ FSpTSpBF =

- 7 class model: ∑∑

∑∑

=

=

=

==ix x

ix xx

x x

x xx

Fp

FpFSp

Tp

TpTSpBF

,0,2,1

,0,2,1

0,2,1

0,2,1

}{

}{}|{

}{

}{}|{

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 29

in terms of Critical Value in terms of Critical Value cc = 2 ln = 2 lnBFBF

• Positive evidence for HPositive evidence for H00 if 2 if 2 c c 6 6

• Strong support for HStrong support for H00 if 6 if 6 c c 10 10

• Very strong support for HVery strong support for H00 if c > 10 if c > 10

Interpretation of Bayes FactorInterpretation of Bayes Factor

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 30

Evaluation of Splice Site Prediction

• Sensitivity:S TP APn = = −/ 1 α

• Normalized specificity:σ αα β

=−

− +1

1

• Specificity:S TP PPAN

PP rp = = − =−

− +/ 1

11

β αα β r

AN

AP=

• Misclassification rates:α =FN

APβ =

FP

AN

= Coverage

ActualTrue False

PP=TP+FP

PN=FN+TN

AP=TP+FNAN=FP+TN

PredictedTrue

False TNFN

FPTP

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 31

48.173.291.041.962.081.2

93.297.699.392.396.498.6

99.595.687.199.296.487.1

036036

9027

10196

613

614

GT

AG

7CA. thaliana

40.464.385.458.276.988.5

92.797.199.197.298.899.5

97.894.284.898.896.290.2

036036

7460

10132

400

400

GT

AG

7CC. elegans

34.153.675.028.741.459.4

94.897.699.194.897.098.5

95.490.083.995.792.185.1

036036

11501

14920

329

329

GT

AG

2CDrosophila

16.434.857.6 9.715.725.6

90.596.398.588.492.996.1

98.591.766.396.390.376.1

036036

44411

65103

921

920

GT

AG

2CHomo sapiens

Sp(%)

(%)

Sn (%)

BayesFactor

Test Site Set True False

SiteModelSpecies

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 32

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

σ σ

SnSn

HumanGT site

HumanAG site

SnSn

C. elegansGT site

C. elegansAG site

SnSn

A. thalianaAG site

A. thalianaGT site

σ σ

σ σ

Brendel 2005

Performance?

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 33

en en+1

in in+1

PG

PA(n)PG

(1-PG)PD(n+1)

(1-PG)PD(n+1)

(1-PG)(1-PD(n+1))

1-PA(n)

PG

Markov Model for Spliced Alignment

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 34

Performance vs other methods

• Comparison with ab initio gene prediction programs?

• Depends on:• Availability of ESTs• Availability of protein homologs

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 35

Target protein alignment score

0.000.100.200.300.400.500.600.700.800.901.00

0 10 20 30 40 50 60 70 80 90 100

Exo

n (S

n +

Sp)

/ 2

GeneSeqer

NAP

GENSCAN

Brendel 2005

GENSCAN - Burge, MIT

GeneSeqer vs NAP vs GENSCAN (Exon prediction)

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 36

0.000.100.200.300.400.500.600.700.800.901.00

0 10 20 30 40 50 60 70 80 90 100Target protein alignment score

Intr

on (

Sn

+ S

p) /

2

GeneSeqer

NAP

GENSCAN

Brendel 2005

GENSCAN - Burge, MIT

GeneSeqer vs NAP vs GENSCAN (Intron prediction)

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 37

GeneSeqer

Genomic Sequence

EST or protein database

(Suffix Array/ Suffix Tree)

Fast Search

Spliced Alignment

Output

Assembly

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 38Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 39Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 40Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 41Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 42Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 43Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 44

Gene Structure Annotation - Problems

False positive intergenic region:• 2 annotated genes actually correspond to a single

gene

False negative intergenic region:• One annotated gene structure actually contains 2

genes

False negative gene prediction:• Missing gene (no annotation)

Other:• partially incorrect gene annotation• missing annotation of alternative transcripts

Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 45Brendel 2005

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 46

Other Resources

Current Protocols in Bioinformaticshttp://www.4ulr.com/products/currentprotocols/bioinformatics.html

Finding Genes 4.1 An Overview of Gene Identification: Approaches, Strategies, and

Considerations 4.2 Using MZEF To Find Internal Coding Exons 4.3 Using GENEID to Identify Genes 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome

4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences