10/21/05 d dobbs isu - bcb 444/544x: gene prediction1 10/21/05 gene prediction (formerly gene...
TRANSCRIPT
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 1
10/21/05
Gene Prediction
(formerly Gene Prediction - 3)
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 2
AnnouncementsExam 2 - next FridayPosted online: Exam 2 Study Guide
544 Reading Assignment (2 papers)
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 3
Announcements
544 Semester Projects - Information needed:
Please send email to me (or David)[email protected]
Briefly describe: • Your background & current grad research• Is there a problem related to your research you would like to learn more about & develop as project for this course? or • What would your ‘dream’ project be?
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 4
Announcements2 Bioinformatics Seminars today (Fri Oct 21)
12:10 PM BCB Faculty Seminar in E164 Lagomarcino
“Protein Networks”Bob Jernigan, BBMB & Director,Baker
Center for Bioinformatics & Biological Statistics
http://www.bcb.iastate.edu/courses/BCB691-F2005.html#Oct%2021
4:10 PM GDCB Special Seminar in 1414 MBB“Integrating the Unknown-eome with
Abiotic Stress Response Networks in Arabidopsis”
Ron Mittler, Dept. of Biochem & Mol Biology
University of Nevada, Reno
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 5
Gene Prediction & Regulation
Mon - Gene structure review: Eukaryotes vs prokaryotes
Wed - Regulatory regions: Promoters & enhancers
Fri - Predicting genes - Predicting regulatory regions (?)
• Next week: Predicting RNA structure (miRNAs, too)
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 6
Reading Assignment
Mount Bioinformatics• Chp 9 Gene Prediction & Regulation
• pp 361-385 Predicting Promoters• Ck Errata: http://www.bioinformaticsonline.org/help/errata2.html
* Brown Genomes 2 (NCBI textbooks online)• Sect 9 Overview: Assembly of Transcription Initiation Complex • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.chapter.7002
• Sect 9.1-9.3 DNA binding proteins, Transcription initiation• http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.section.7016
* NOTE: Don’t worry about the details!!
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 7
Optional Reading
Reviews:
1) Zhang MQ (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698-709
http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html
2) Wasserman WW & Sandelin (2004) Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5:276-287http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 8
Review last lecture: Gene Regulation
(formerly Gene Prediction-2)
cDNAs & ESTsUniGene
Regulatory regions Eukaryotes vs
prokaryotes
10/21/05 9D Dobbs ISU - BCB 444/544X: Gene Prediction
DNA RNA
cDNA
Phenotypeprotein
[1] Transcription[2] RNA processing (splicing)[3] RNA export[4] RNA surveillance
Pevsner p160
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 10
UniGene: unique genes via ESTs
• Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene
• UniGene clusters contain many ESTs
• UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution
Pevsner p164
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 11
Today: Gene Prediction(formerly Gene Prediction - 3)
Predicting genes
Mon - Predicting regulatory regions Focus on promoters Introduction to RNA
Later: Genome browsers
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 12
Gene Prediction
• Overview of steps & strategies• What sequence signals can be used?• What other types of information can be used?
• Algorithms • HMMs, discriminant functions, neural nets
• Gene prediction software • 3 major types• many,many programs!
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 13
Predicting Genes - Basic steps:
• Obtain genomic sequence• Translate in all 6 reading frames
• Compare with protein sequence database• Perform database similarity search with EST & cDNA databases, if available
• Use gene prediction program to locate genes• Analyze gene regulatory sequences
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 14
Overview of gene prediction strategies
What sequence signals can be used? Transcription: TF binding sites, promoter,
initiation site, terminator Processing signals: splice donor/acceptors, polyA signal Translation: start (AUG = Met) & stop (UGA,UUA, UAG)
ORFs, codon usage What other types of information can be used? cDNAs & ESTs (experimental data,pairwise alignment) homology (sequence comparison, BLAST)
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 15
Automated gene prediction strategies
1)Similarity-based or Comparative• BLAST - Do other organisms have similar sequence?
(Is sequence similar to known gene or protein)
2)Ab initio = “from the beginning”• Predict without explicit comparison with cDNA or proteins
via “rule-based” gene models - but rules are derived from statistical analysis of datasets
3)Combined "evidence-based"• Combine gene models with alignment to known ESTs &
protein sequences
BEST RESULTS? Combined
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 16
Examples of gene prediction software
1)Similarity-based or Comparative • BLAST • SGP2 (extension of GeneID)
2)Ab initio = “from the beginning”• GeneID - (used in lab this week)• GENSCAN - (used in lab this week)• GeneMark.hmm - (should try this!)
3)Combined "evidence-based”• GeneSeqer (Brendel et al., ISU)
BEST? GENSCAN, GeneMark.hmm, GeneSeqer
but depends on organism & specific task
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 17
Gene prediction: Eukaryotes vs prokaryotes
Gene prediction is easier in microbial genomes
Why? Smaller genomesSimpler gene structuresMore sequenced genomes!
(for comparative approaches)
Methods? Previously, mostly HMM-based Now: similarity-based methods
because so many genomes available
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 18
GeneSeqer - Brendel et al.
http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 19
Thanks to Volker Brendel, ISU for following Figs & Slides
Slightly modified from:BSSI Genome Informatics
Modulehttp://www.bioinformatics.iastate.edu/BBSI/
course_desc_2005.html#moduleB
V Brendel [email protected]
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 20
GT AG
exon intron
Splice sites
Donor site Acceptor site
Signals: Pre-mRNA Splicing
TranslationProtein
SplicingmRNA Cap- -Poly(A)
Transcriptionpre-mRNA Cap- -Poly(A)
Genomic DNA
Start codon Stop codon
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 21
Brendel - Spliced Alignment I:Compare with cDNA or EST probes
Genomic DNA
Start codon Stop codon
mRNA -Poly(A)Cap-
5’-UTR 3’-UTR
Start codon Stop codon
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 22
Brendel - Spliced Alignment II:Compare with protein probes
Genomic DNA
Start codon Stop codon
Protein
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 23
Brendel Spliced Alignment Algorithm
• Perform pairwise alignment with large gaps in one sequence (introns)• Align genomic DNA with cDNA, EST or protein
• Score semi-conserved sequences at splice junctions
• Score coding constraints in translated exons
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 24
8888
8383107107
104104316316
311311GTGT
AGAGZea maysZea mays
86538653
8611861192979297
924792472301923019
2292922929GTGT
AGAGArabidopsis thalianaArabidopsis thaliana
157157
163163176176
172172221221
217217GTGT
AGAGAspergillusAspergillus
119119
118118118118
122122170170
179179GTGT
AGAGS. pombeS. pombe
2078920789
20626206262050020500
20325203253702937029
3686436864GTGT
AGAGC. elegansC. elegans
524524
536536670670
671671989989
10011001GTGT
AGAGDrosophilaDrosophila
107107
103103238238
228228288288
284284GTGT
AGAGGallus gallusGallus gallus
147147
140140408408
386386450450
442442GTGT
AGAGRattus norvegicusRattus norvegicus
521521
50450411851185
1139113912121212
11941194GTGT
AGAGMus musculusMus musculus
30373037
2979297952775277
5194519465866586
65556555GTGT
AGAGHome sapiensHome sapiens
Number of True Splice Sites / PhaseNumber of True Splice Sites / Phase
1 2 31 2 3TypeTypeSpeciesSpecies
Donor (GT) & Acceptor (AG) SitesDonor (GT) & Acceptor (AG) SitesUsed for Model TrainingUsed for Model Training
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 25
• Information Content Information Content IIii ::
I f fi iBB U C A G
iB= +∈∑2 2, , ,
log ( )
• Extent of Splice Signal Window:
I Ii I≤ +196. σ
i : ith position in sequenceĪ : average information content over all positions i > 20 nt from splice siteσĪ : average standard deviation of Ī
Splice Site Detection
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 26
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
HumanT2_GT
HumanFi_AG
HumanT2_AG
HumanF1_AG
A. thalianaT2_GT
A. thalianaF1_AG
A. thalianaFi_AG
A. thalianaT2_AG
Results?
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 27
Bayesian Splice Site Prediction
where H indexes the hypotheses of GT or AG at - True site in reading phase 1, 2, or 0 - False within-exon site in reading phase 1, 2, or 0 - False within-intron site
Let S = s-l s-l+1 s-l+2…s-1GT s1 s2 s3 …sr
∑=H
HPHSPHPHSPSHP }){}|{/(}{}|{}|{
11,/}{}|{}{}{
11
1−−∏∏
+−=−−
+−=− ==
iii s
r
lislii
r
lil ffspsspspSP
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 28
Bayes Factor as Decision Criterion
H0: H=T: }){1(
}{
})|{1(
}|{
Tp
Tp
STp
STpBF
−−=
- 2-class model: }|{}|{ FSpTSpBF =
- 7 class model: ∑∑
∑∑
=
=
=
==ix x
ix xx
x x
x xx
Fp
FpFSp
Tp
TpTSpBF
,0,2,1
,0,2,1
0,2,1
0,2,1
}{
}{}|{
}{
}{}|{
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 29
in terms of Critical Value in terms of Critical Value cc = 2 ln = 2 lnBFBF
• Positive evidence for HPositive evidence for H00 if 2 if 2 c c 6 6
• Strong support for HStrong support for H00 if 6 if 6 c c 10 10
• Very strong support for HVery strong support for H00 if c > 10 if c > 10
Interpretation of Bayes FactorInterpretation of Bayes Factor
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 30
Evaluation of Splice Site Prediction
• Sensitivity:S TP APn = = −/ 1 α
• Normalized specificity:σ αα β
=−
− +1
1
• Specificity:S TP PPAN
PP rp = = − =−
− +/ 1
11
β αα β r
AN
AP=
• Misclassification rates:α =FN
APβ =
FP
AN
= Coverage
ActualTrue False
PP=TP+FP
PN=FN+TN
AP=TP+FNAN=FP+TN
PredictedTrue
False TNFN
FPTP
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 31
48.173.291.041.962.081.2
93.297.699.392.396.498.6
99.595.687.199.296.487.1
036036
9027
10196
613
614
GT
AG
7CA. thaliana
40.464.385.458.276.988.5
92.797.199.197.298.899.5
97.894.284.898.896.290.2
036036
7460
10132
400
400
GT
AG
7CC. elegans
34.153.675.028.741.459.4
94.897.699.194.897.098.5
95.490.083.995.792.185.1
036036
11501
14920
329
329
GT
AG
2CDrosophila
16.434.857.6 9.715.725.6
90.596.398.588.492.996.1
98.591.766.396.390.376.1
036036
44411
65103
921
920
GT
AG
2CHomo sapiens
Sp(%)
(%)
Sn (%)
BayesFactor
Test Site Set True False
SiteModelSpecies
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 32
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
σ σ
SnSn
HumanGT site
HumanAG site
SnSn
C. elegansGT site
C. elegansAG site
SnSn
A. thalianaAG site
A. thalianaGT site
σ σ
σ σ
Brendel 2005
Performance?
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 33
en en+1
in in+1
PG
PA(n)PG
(1-PG)PD(n+1)
(1-PG)PD(n+1)
(1-PG)(1-PD(n+1))
1-PA(n)
PG
Markov Model for Spliced Alignment
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 34
Performance vs other methods
• Comparison with ab initio gene prediction programs?
• Depends on:• Availability of ESTs• Availability of protein homologs
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 35
Target protein alignment score
0.000.100.200.300.400.500.600.700.800.901.00
0 10 20 30 40 50 60 70 80 90 100
Exo
n (S
n +
Sp)
/ 2
GeneSeqer
NAP
GENSCAN
Brendel 2005
GENSCAN - Burge, MIT
GeneSeqer vs NAP vs GENSCAN (Exon prediction)
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 36
0.000.100.200.300.400.500.600.700.800.901.00
0 10 20 30 40 50 60 70 80 90 100Target protein alignment score
Intr
on (
Sn
+ S
p) /
2
GeneSeqer
NAP
GENSCAN
Brendel 2005
GENSCAN - Burge, MIT
GeneSeqer vs NAP vs GENSCAN (Intron prediction)
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 37
GeneSeqer
Genomic Sequence
EST or protein database
(Suffix Array/ Suffix Tree)
Fast Search
Spliced Alignment
Output
Assembly
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 44
Gene Structure Annotation - Problems
False positive intergenic region:• 2 annotated genes actually correspond to a single
gene
False negative intergenic region:• One annotated gene structure actually contains 2
genes
False negative gene prediction:• Missing gene (no annotation)
Other:• partially incorrect gene annotation• missing annotation of alternative transcripts
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 46
Other Resources
Current Protocols in Bioinformaticshttp://www.4ulr.com/products/currentprotocols/bioinformatics.html
Finding Genes 4.1 An Overview of Gene Identification: Approaches, Strategies, and
Considerations 4.2 Using MZEF To Find Internal Coding Exons 4.3 Using GENEID to Identify Genes 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome
4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences