110/29/07bcb 444/544 f07 isu dobbs #28- promoter prediction bcb 444/544 lecture 28 gene prediction -...

68
1 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

Upload: berenice-watson

Post on 17-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

1BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

BCB 444/544

Lecture 28

Gene Prediction - finish it

Promoter Prediction

#28_Oct29

Page 2: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

2BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Mon Oct 29 - Lecture 28

Promoter & Regulatory Element Prediction

• Chp 9 - pp 113 - 126

Wed Oct 30 - Lecture 29

Phylogenetics Basics

• Chp 10 - pp 127 - 141

Thurs Oct 31 - Lab 9

Gene & Regulatory Element Prediction

Fri Oct 30 - Lecture 29

Phylogenetic Tree Construction Methods & Programs

• Chp 11 - pp 142 - 169

Required Reading (before lecture)

Page 3: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

3BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Assignments & Announcements

Mon Oct 29 - HW#5 - will be posted today

HW#5 = Hands-on exercises with phylogenetics and tree-building software

Due: Mon Nov 5 (not Fri Nov 1 as previously posted)

Page 4: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

4BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

BCB 544 "Team" Projects

Last week of classes will be devoted to Projects

• Written reports due: • Mon Dec 3 (no class that day)

• Oral presentations (20-30') will be: • Wed-Fri Dec 5,6,7

• 1 or 2 teams will present during each class period

See Guidelines for Projects posted online

Page 5: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

5BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

BCB 544 Only: New Homework Assignment

544 Extra#2

Due: √PART 1 - ASAP

PART 2 - meeting prior to 5 PM Fri Nov 2

Part 1 - Brief outline of Project, email to Drena & Michael

after response/approval, then:

Part 2 - More detailed outline of project

Read a few papers and summarize status of problem

Schedule meeting with Drena & Michael to discuss

ideas

Page 6: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

6BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Seminars this Week

BCB List of URLs for Seminars related to Bioinformatics:http://www.bcb.iastate.edu/seminars/index.html

• Nov 1 Thurs - BBMB Seminar 4:10 in 1414 MBB

• Todd Yeates UCLA TBA -something cool about structure and evolution?

• Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI

• Bob Jernigan BBMB, ISU

•Control of Protein Motions by Structure

Page 7: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

7BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Chp 8 - Gene Prediction

SECTION III GENE AND PROMOTER PREDICTION

Xiong: Chp 8 Gene Prediction

• Categories of Gene Prediction Programs

• Gene Prediction in Prokaryotes

• Gene Prediction in Eukaryotes

Page 8: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

8BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Computational Gene Prediction: Approaches

• Ab initio methods

• Search by signal: find DNA sequences involved in gene

expression

• Search by content: Test statistical properties

distinguishing coding from non-coding DNA

• Similarity-based methods

• Database search: exploit similarity to proteins, ESTs,

cDNAs

• Comparative genomics: exploit aligned genomes

• Do other organisms have similar sequence?

• Hybrid methods - best

Page 9: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

9BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Computational Gene Prediction: Algorithms

1. Neural Networks (NNs) (more on these later…)

e.g., GRAIL

2. Linear discriminant analysis (LDA) (see text)

e.g., FGENES, MZEF

3. Markov Models (MMs) & Hidden Markov Models (HMMs)

e.g., GeneSeqer - uses MMs

GENSCAN - uses 5th order HMMs - (see text)

HMMgene - uses conditional maximum likelihood (see text)

This is a new slide

Page 10: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

10BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Signals Search

Approach: Build models (PSSMs, profiles, HMMs, …) and search against DNA. Detected instances provide evidence for genes

This is a new slide

Page 11: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

11BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Content Search

Observation: Encoding a protein affects statistical properties of DNA sequence:

• Nucleotide.amino acid distribution• GC content (CpG islands, exon/intron)• Uneven usage of synonymous codons (codon bias)• Hexamer frequency - most discriminative of

these for identifying coding potential

Method: Evaluate these differences (coding statistics) to differentiate between coding and non-coding regions

This is a new slide

Page 12: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

12BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Human Codon UsageThis is a new slide

Page 13: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

13BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Predicting Genes based on Codon Usage Differences

Algorithm:Process sliding window• Use codon frequencies

to compute probability of coding versus non-coding

• Plot log-likelihood ratio:

Coding Profile of ß-globin gene

Exons( )⎟⎟⎠

⎞⎜⎜⎝

⎛− )|(

|log

codingnonSP

codingSP

This is a new slide

Page 14: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

14BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

In different genomes: Translate DNA into all 6 reading frames and search against proteins (TBLASTX,BLASTX, etc.)

Within same genome: Search with EST/cDNA database (EST2genome, BLAT, etc.).

Problems: • Will not find “new” or RNA genes (non-coding

genes).• Limits of similarity are hard to define• Small exons might be overlooked

Similarity-Based Methods: Database Search

ATTGCGTAGGGCGCTTAACGCATCCCGCGA

This is a new slide

Page 15: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

15BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Similarity-Based Methods:Comparative Genomics

Idea: Functional regions are more conserved than non-functional ones; high similarity in alignment indicates gene

Advantages:• May find uncharacterized or RNA genes

Problems:• Finding suitable evolutionary distance• Finding limits of high similarity (functional

regions)

human

mouse

GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | |C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-

This is a new slide

Page 16: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

16BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Human Mouse

Human-Mouse Homology

Comparison of 1196 orthologous genes

• Sequence identity between genes in human vs mouseExons: 84.6%Protein: 85.4%Introns: 35%5’ UTRs: 67%3’ UTRs: 69%

This is a new slide

Page 17: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

17BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Thanks to Volker Brendel, ISU for the following Figs & Slides

Slightly modified from:

BSSI Genome Informatics Modulehttp://www.bioinformatics.iastate.edu/BBSI/course_desc_2005.html#moduleB

V Brendel [email protected]

Brendel et al (2004) Bioinformatics 20: 1157

Page 18: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

18BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

• Perform pairwise alignment with large gaps in one sequence (due to introns)• Align genomic DNA with cDNA, ESTs, protein

sequences

• Score semi-conserved sequences at splice junctions• Using Bayesian probability model & 1st order MM

• Score coding constraints in translated exons• Using Bayesian model

Spliced Alignment Algorithm

GeneSeqer - Brendel et al.- ISUhttp://deepc2.psi.iastate.edu/cgi-bin/gs.cgi

Intron

GT AG

Splice sites

Donor

Acceptor

Brendel et al (2004) Bioinformatics 20: 1157http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/7/1157

Brendel 2005

Page 19: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

19BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

• Information Content Information Content IIii ::

I f fi iBB U C A G

iB= +∈∑2 2, , ,

log ( )

• Extent of Splice Signal Window:

I Ii I≤ +196. σ

i: ith position in sequenceĪ: avg information content over all positions >20 nt from splice siteσĪ: avg sample standard deviation of Ī

Splice Site Detection

Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal?

YES

Brendel 2005

Page 20: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

20BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-50 -40 -30 -20 -10 0 10 20 30 40 50

HumanT2_GT

HumanT2_AG

Information Content vs Position

Which sequences are exons & which are introns? How can you tell?

Brendel 2005

Page 21: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

21BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

en en+1

in in+1

PG

PA(n)PG

(1-PG)PD(n+1)

(1-PG)PD(n+1)

(1-PG)(1-PD(n+1))

1-PA(n)

PG

Markov Model for Spliced Alignment

Brendel 2005

Page 22: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

22BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Evaluation of Splice Site Prediction

Fig 5.11Baxevanis & Ouellette 2005

This is a new slide

TP = positive instance correctly predicted as positiveFP = negative instance incorrectly predicted as positiveTN = negative instance correctly predicted as negativeFN = positive instance incorrectly predicted as negative

Right!

Page 23: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

23BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Evaluation of Predictions

• Normalized specificity:σα

α β=

−− +1

1

ActualTrue False

PP=TP+FP

PN=FN+TN

AP=TP+FNAN=FP+TN

PredictedTrue

False TNFN

FPTP

• Specificity: rAN

AP=

• Misclassification rates: α =FN

APβ =

FP

AN

Coverage• Sensitivity:

Predicted Positives True

Positives

False Positives

Recall

Do not memorize this!

Page 24: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

24BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Evaluation of Predictions - in EnglishActual

True False

PP=TP+FP

PN=FN+TN

AP=TP+FNAN=FP+TN

PredictedTrue

False TNFN

FPTP

• Specificity:

• Sensitivity: = Coverage

In English? Sensitivity is the fraction of all positive instances having a true positive prediction.

= Recall

In English? Specificity is the fraction of all predicted positives that are, in fact, true positives.

IMPORTANT: in medical jargon, Specificity is sometimes defined differently (what we define here as "Specificity" is sometimes referred to as "Positive predictive value")

IMPORTANT: Sensitivity alone does not tell us much about performance because a 100% sensitivity can be achieved trivially by labeling all test cases positive!

Page 25: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

25BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Best Measures for Comparison?

• ROC curves (Receiver Operating Characteristic (?!!)

http://en.wikipedia.org/wiki/Roc_curve

• Correlation CoefficientMatthews correlation coefficient (MCC)

MCC = 1 for a perfect prediction 0 for a completely random assignment

-1 for a "perfectly incorrect" prediction

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Do not memorize this!

In signal detection theory, a receiver operating characteristic

(ROC), or ROC curve is a plot of sensitivity vs (1 - specificity) for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently

by plotting fraction of true positives (TPR = true positive rate) vs fraction of false positives (FPR = false positive rate)

This slide has been changed

Page 26: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

26BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07Brendel 2005

GeneSeqer: Input http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi

Page 27: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

27BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07Brendel 2005

GeneSeqer: Output

Page 28: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

28BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07Brendel 2005

GeneSeqer: Gene Evidence Summary

Page 29: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

29BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Gene Prediction - Problems & Status?

Common errors?• False positive intergenic regions:• 2 annotated genes actually correspond to a single gene

• False negative intergenic region:• One annotated gene structure actually contains 2 genes

• False negative gene prediction:• Missing gene (no annotation)

• Other:• Partially incorrect gene annotation• Missing annotation of alternative transcripts

Current status?• For ab initio prediction in eukaryotes: HMMs have better overall

performance for detecting intron/exon boundaries• Limitation? Training data: predictions are organism specific

• Combined ab initio/homology based predictions: Improved accurracy• Limitation? Availability of identifiable sequence homologs in databases

Page 30: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

30BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Recommended Gene Prediction Software

• Ab initio• GENSCAN: http://genes.mit.edu/GENSCAN.html

• GeneMark.hmm: http://exon.gatech.edu/GeneMark/• others: GRAIL, FGENES, MZEF, HMMgene

• Similarity-based• BLAST, GenomeScan, EST2Genome, Twinscan

• Combined:• GeneSeqer, http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi • ROSETTA

Consensus: because results depend on organisms & specific task, Always use more than one program!

• Two servers hat report consensus predictions• GeneComber• DIGIT

Page 31: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

31BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Other Gene Prediction Resources: at ISU http://www.bioinformatics.iastate.edu/bioinformatics2go/

Page 32: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

32BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Other Gene Prediction Resources: GaTech, MIT, Stanford, etc.

Current Protocols in Bioinformatics (BCB/ISU owns a copy - currently in my lab!)

Chapter 4 Finding Genes

4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations 4.2 Using MZEF To Find Internal Coding Exons 4.3 Using GENEID to Identify Genes 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences

Lists of Gene Prediction Softwarehttp://www.bioinformaticsonline.org/links/ch_09_t_1.html

http://cmgm.stanford.edu/classes/genefind/

Page 33: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

33BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Chp 9 - Promoter & Regulatory Element Prediction

SECTION III GENE AND PROMOTER PREDICTION

Xiong: Chp 9 Promoter & Regulatory Element Prediction

• Promoter & Regulatory Elements in

Prokaryotes

• Promoter & Regulatory Elements in

Eukaryotes

• Prediction Algorithms

Page 34: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

34BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotic genomes • Are packaged in chromatin & sequestered in a nucleus• Are larger and have multiple linear chromosomes• Contain mostly non-protein coding DNA (98-99%)

Prokarytic genomes • DNA is associated with a nucleoid, but no nucleus• Much larger, usually single, circular chromosome• Contain mostly protein encoding DNA

Eukaryotes vs Prokaryotes: Genomes

Page 35: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

35BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotes vs Prokryotes: Gene Structure

Page 36: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

36BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotic genes

• Are larger and more complex than in prokaryotes

• Contain introns that are “spliced” out to generate mature mRNAs*

• Often undergo alternative splicing, giving rise to multiple RNAs*

• Are transcribed by 3 different RNA polymerases

(instead of 1, as in prokaryotes)

* In biology, statements such as this include an implicit “usually” or “often”

Eukaryotes vs Prokaryotes: Genes

Page 37: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

37BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Primary level of control?

• Prokaryotes: Transcription initiation

• Eukaryotes: Transcription is also very important, but• Expression is regulated at multiple levels many of which are post-transcriptional:

• RNA processing, transport, stability• Translation initiation• Protein processing, transport, stability• Post-translational modification (PTM) • Subcellular localization

Recent important discoveries: small regulatory RNAs (miRNA, siRNA) are abundant and play very important roles in controlling gene expression in eukaryotes, often at post-transcriptional levels

Eukaryotes vs Prokaryotes: Levels of Gene Regulation

Page 38: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

38BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotes vs Prokaryotes: Regulatory Elements

• Prokaryotes:• Promoters & operators (for operons) - cis-acting DNA signals• Activators & repressors - trans-acting proteins

(we won't discuss these…)

• Eukaryotes:• Promoters & enhancers (for single genes) - cis-acting •Transcription factors - trans-acting

• Important difference? •What the RNA polymerase actually binds

Page 39: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

39BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Prokaryotic Promoters

• RNA polymerase complex recognizes promoter sequences located very close to and on 5’ side (“upstream”) of tansription initiation site

• Prokaryotic RNA polymerase complex binds directly to

promoter, by virtue of its sigma subunitsigma subunit - no requirement for “transcription factors” binding first

• Prokaryotic promoter sequences are highly conserved:

• -10 region • -35 region

Page 40: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

40BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotic Promoters

• Eukaryotic RNA polymerase complexes do not bind directly to promoter sequences

• Transcription factors must bind first and serve as landmarks recognized by RNA polymerase complexes

• Eukaryotic promoter sequences are less highly conserved, but many promoters (for RNA polymerase II) contain :

• -30 region "TATA" box

• -100 region "CCAAT" box

Page 41: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

41BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotic Promoters vs Enhancers

Both promoters & enhancers are binding sites for transcription factors (TFs)

• Promoters• essential for initiation of transcription• located “relatively” close to start site (usually <200 bp

upstream, but can be located within gene, rather than upstream!)

• Enhancers• needed for regulated transcription (differential expression

in specific cell types, developmental stages, in response to environment, etc.)

• can be very far from start site (sometimes > 100 kb)

Page 42: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

42BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotic genes are transcribed by

3 different RNA polymerases (Location of promoter regions, TFBSs & TFs differ, too)

BIOS Scientific Publishers Ltd, 1999

Brown Fig 9.18

mRNA

rRNA

tRNA, 5S RNA

Page 43: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

43BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Prokaryotic Genes & Operons

• Genes with related functions are often clustered within operons (e.g., lac operon)

• Operons = genes with related functions that are transcribed and regulated as a single unit; one promoter controls expression of several proteins

• mRNAs produced from operons are “polycistronic” - a single mRNA encodes several proteins; i.e., there are multiple ORFs, each with its own AUG (START) & STOP codons, linked within one mRNA molecule

Page 44: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

44BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Promoter of lac operon in E. coli

(Transcribed by prokaryotic RNA polymerase)

BIOS Scientific Publishers Ltd, 1999Brown Fig 9.17

Page 45: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

45BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotic genes

• Genes with related functions are occasionally, but not usually clustered; instead, they share common regulatory regions (promoters, enhancers, etc.)

• Chromatin structure must also be “active” for transcription to occur

Page 46: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

46BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotic genes have large & complex regulatory regions

•Cis-acting regulatory elements include:Promoters, enhancers, silencers

•Trans-acting regulatory factors include:Transcription factors (TFs), chromatin remodeling complexes, small RNAs

BIOS Scientific Publishers Ltd, 1999

Brown Fig 9.17

Page 47: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

47BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotic Promoters: DNA sequences required for initiation, usually <200 bp from start site

Eukaryotic RNA polymerases bind by recognizing a complex of TFs bound at promotor

~250 bp

Pre-mRNA

First, TFs must bind short motifs (TFBSs) within promoters; then RNA polymerase can bind and initiate transcription of RNA

Page 48: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

48BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Eukaryotic promoters & enhancer regions often contain many different TFBS motifs

Fig 9.13Mount 2004

Page 49: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

49BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Simplified View of Promoters in Eukaryotes

Fig 5.12Baxevanis & Ouellette 2005

Page 50: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

50BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotic Activators vs Repressors

Regions far from the promoter can act as "enhancers" or "repressors" of transcription by serving as binding sites for activator or repressor proteins (TFs)

Activator proteins (TFs) bind to enhancers & interact with RNAP to stimulate transcriptionRepressors block the action of activators

repressor prevents binding of activator

enhancer Gene

repressor100 - 50,000 bp

promoterRNAP

enhancer proteins interact with RNAP

transcription

Page 51: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

51BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotic Transcription Factors (TFs)

• Transcription factors = proteins that interact with the RNA polymerase complex to activate or repress transcription

• TFs often contain both:• a trans-activating domain • a DNA binding domain or motif

• TFs recognize and bind specific short DNA sequence motifs called “transcription factor binding sites” (TFBSs)

• Databases for TFs &TFBSs include:• TRANSFAC, http://www.generegulation.com/cgibin/pub/databases/transfac

• JASPAR

Here motif = amino acid sequence in protein

Here motif = nucleotide sequence in DNA

Page 52: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

52BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Zinc Finger Proteins - Transcription Factors

• Common in eukaryotic proteins

• ~ 1% of mammalian genes encode zinc-finger proteins

(ZFPs)

• In C. elegans, there are > 500 !

• Can be used as highly specific DNA binding modules

• Potentially valuable tools for directed genome

modification (esp. in plants) & human gene therapy - one clinical trial will begin soon!

BIOS Scientific Publishers Ltd, 1999

Brown Fig 9.12

• Did you go to Dave Segal's seminar?• Your TAs Pete & Jeff work on

designing better ZFPs!

Page 53: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

53BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Promoter Prediction Algorithms & Software

Xiong -

Page 54: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

54BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotes vs Prokaryotes: Promoter Prediction

Promoter prediction is much easier in prokaryotes

Why? Highly conservedSimpler gene structuresMore sequenced genomes!

(for comparative approaches)

Methods? Previously: mostly HMM-basedNow: similarity-based comparative

methodsbecause so many genomes

available

Xiong textbook:1) "Manual method"= rules of Wang et al (see text)2) BPROM - uses linear discriminant function

Page 55: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

55BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Eukaryotes vs Prokaryotes: Promoter Prediction

Promoter prediction is much easier in prokaryotes

Why? Highly conservedSimpler gene structuresMore sequenced genomes!

(for comparative approaches)

Methods? Previously: mostly HMM-basedNow: similarity-based comparative

methodsbecause so many genomes

available

Xiong textbook:1) "Manual method"= rules of Wang et al (see text)2) BPROM - uses linear discriminant function

Page 56: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

56BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Predicting Promoters in Eukaryotes

Closely related to gene prediction! • Obtain genomic sequence• Use sequence-similarity based comparison

(BLAST, MSA) to find related genesBut: "regulatory" regions are much less well-

conserved than coding regions• Locate ORFs • Identify Transcription Start Site (TSS)

(if possible!)• Use Promoter Prediction Programs • Analyze motifs, etc. in DNA sequence (TRANSFAC, JASPAR)

Page 57: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

57BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Predicting promoters: Steps & Strategies

Identify TSS --if possible?

• One of biggest problems is determining exact TSS!Not very many full-length cDNAs!

• Good starting point? (human & vertebrate genes)Use FirstEF

found within UCSC Genome Browseror submit to FirstEF web server

Fig 5.10Baxevanis & Ouellette 2005

Page 58: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

58BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Automated Promoter Prediction Strategies

1) Pattern-driven algorithms (ab initio)

2) Sequence-driven algorithms (homology based)

3) Combined "evidence-based"

BEST RESULTS? Combined, sequential

Page 59: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

59BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

1) Pattern-driven Algorithms

• Success depends on availability of collections of annotated transcription factor binding sites (TFBSs)

• Tend to produce very large numbers of false positives (FPs)

• Why? • Binding sites for specific TFs are often variable• Binding sites are short (typically 6-10 bp)• Interactions between TFs (& other proteins) influence

both affinity & specificity of TF binding • One binding site often recognized by multiple TFs

• Biology is complex: gene activation is often specific to organism/cell/stage/environmental condition; promoter and enhancer elements must mediate this

Page 60: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

60BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

• Take sequence context/biology into account Eukaryotes: clusters of TFBSs are common

Prokaryotes: knowledge of σ (sigma) factors helps• Probability of "real" binding site higher if annotated

transcription start site (TSS) is nearby But: What about enhancers? (no TSS nearby!) & only a small fraction of TSSs have been experimentally

determinined

• Do the wet lab experiments! But: Promoter-bashing can be tedious…

Ways to Reduce FPs in ab initio Prediction

Page 61: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

61BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

2) Sequence-driven Algorithms

• Assumption: Common functionality can be deduced from sequence conservation (Homology)

• Alignments of co-regulated genes should highlight elements involved in regulation

Careful: How determine co-regulation?

1. Orthologous genes from difference species2. Genes experimentally shown to be co-

regulated (using microarrays??)

Comparative promoter prediction:• Phylogenetic footprinting• Expression Profiling

Page 62: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

62BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Phylogenetic Footprinting

• Based on increasing availability of whole genome DNA sequences from many different species

• Selection of organisms for comparison is important• not too close, not too far: good = human vs mouse

• To reduce FPs, must extract non-coding sequences and then align them; prediction depends on good alignment

• use MSA algorithms (e.g., CLUSTAL)• more sensitive methods

• Gibbs sampling • Expectation Maximization (EM) methods

• Examples of programs:• Consite, rVISTA, PromH(W), Bayes aligner, Footprinter

Page 63: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

63BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Expression Profiling

• Based on increasing availability of whole genome mRNA expression data, esp., microarray data

• High-throughput simultaneous monitoring of expression levels of thousands of genes

• Assumptions: (sometimes valid, sometimes NOT)1. Co-expression implies co-regulation2. Co-regulated genes share common regulatory elements

• Drawbacks: 1. Signals are short & weak! Requires Gibbs sampling or EM: e.g., MEME, AlignACE,

Melina• Prediction depends on determining which genes are co-

expressed - usually by clustering - which an be error prone

1. Examples of programs:• INCLUSive - combined microarray analysis & motif

detection• PhyloCon - combined phylo footprinting & expression

profiling)

Page 64: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

64BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Need sets of co-regulated genes

• For comparative (phylogenetic) methods• Must choose appropriate species• Different genomes evolve at different rates• Classical alignment methods have trouble with translocations or inversions than

change order of functional elements• If background conservation of entire region is high,

comparison is useless• Not enough data (but Prokaryotes >>> Eukaryotes)

Complexity: many regulatory elements are not conserved across species!

Problems with Sequence-driven Algorithms

Page 65: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

65BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

TRANSFAC Matrix Entry: for TATA box

Fields:• Accession & ID • Brief description• TFs associated with this entry• Weight matrix • Number of sites

used to build • Other info

Fig 5.13Baxevanis & Ouellette 2005

Page 66: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

66BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Global Alignment of Human & Mouse Obese Gene Promoters (200 bp upstream from TSS)

Fig 5.14Baxevanis & Ouellette 2005

Page 67: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

67BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Annotated Lists of Promoter Databases & Promoter Prediction Software

• URLs from Mount textbook:Table 9.12 http://www.bioinformaticsonline.org/links/ch_09_t_2.html

• Table in Wasserman & Sandelin Nat Rev Genet article http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm

• URLs from Baxevanis & Ouellette textbook:http://www.wiley.com/legacy/products/subject/life/bioinformatics/ch05.htm#links

More lists:• http://www.softberry.

com/berry.phtml?topic=index&group=programs&subgroup=promoter

• http://bioinformatics.ubc.ca/resources/links_directory/?subcategory_id=104

• http://www3.oup.co.uk/nar/database/subcat/1/4/

Page 68: 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

68BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07

Check out Optional Review &Try Associated Tutorial:

Wasserman WW & Sandelin A (2004) Applied bioinformatics for identification of regulatory elements. Nat Rev Genet 5:276-287

http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html

Check this out: http://www.phylofoot.org/NRG_testcases/

Bottom line: this is a very "hot" area - new software for computational prediction of gene regulatory elements published every day!