gene prediction

25
gene prediction roderic guigó i serra IMIM/UPF/CRG

Upload: kuper

Post on 21-Jan-2016

56 views

Category:

Documents


0 download

DESCRIPTION

gene prediction. roderic guigó i serra IMIM/UPF/CRG. number of genes in chromosome 22. initial annotation545Dunham et al., 1999 genscan+RT-PCR590Das et al., 2001 genscan+microarrays730Shoemaker et al., 2001 reviewed annotation726chr22 team, sanger, 2001 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: gene prediction

gene prediction

roderic guigó i serraIMIM/UPF/CRG

Page 2: gene prediction

number of genes in chromosome 22

• initial annotation 545 Dunham et al., 1999

• genscan+RT-PCR 590 Das et al., 2001

• genscan+microarrays 730 Shoemaker et al., 2001

• reviewed annotation 726 chr22 team, sanger, 2001

• mouse shotgun data +20 (our data)

• geneid predictions 794

• genscan predictions 1128

Page 3: gene prediction

number of genes in human genome

• Consortium 30.000-40.000 2001

• Celera 27.000-38.000 2001

• Consortium+Celera 50.000 Hogenesch et al.

2001

• DBsearches 65.000-75.000 Wrigth et al., 2001

• HumanGenomeSciences 90.000-120.000 Haseltine,

2001

Page 4: gene prediction

decodificació del genomaACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGGCCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCATTCTGCCCCTGCCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGCTTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCTCTCAGTTCTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGTCCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAGATGGGGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAGCGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCACCTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCGTCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGTTCACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCCTTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC

the human genome sequence

Page 5: gene prediction

QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP

the amino acid sequence of the proteins

Page 6: gene prediction

EXONS

INTRONS

ELEMENTREGULADOR‘UPSTREAM’

ELEMENTREGULADOR

‘DOWNSTREAM’

PROMOTOR

Estructura dels Gens

Page 7: gene prediction

Del DNA al RNA

Page 8: gene prediction

Del RNA a la Proteïna

Page 9: gene prediction

Mecanisme Molecular

Page 10: gene prediction

Prediction of splice sites

Page 11: gene prediction

accuracy of gene prediction programs

Page 12: gene prediction

accuracy of gene prediction programs

Page 13: gene prediction

accuracy of gene prediction programs

Page 14: gene prediction

• rosseta (Batzoglou et al., 2000)

• cem (Bafna and Huson, 2000)

• sgp1 (Wiehe et al., 2000)

• twinscan (Korf et al., 2001)

• slam ( Patcher et al., 2001)

• sgp2 (Guigó et al., in preparation)

comparative gene prediciton

Page 15: gene prediction

QuerySequence

tblastxHSPs

geneidExons

HSPsProjectio

ns

SGPExons

syntenic gene prediction (sgp2)

Page 16: gene prediction

benchmarking sgp2 - accuracy

scimog

mit

Page 17: gene prediction

Predicting “novel” genes in the human genome

golden path annotationsgolden path annotationsadditional blastn matches to ENSEMBL + REFSEQadditional blastn matches to ENSEMBL + REFSEQtblastx

geneidexons

tblastx

sgpgenes

Golden Path Oct 7, 2000 freeze. RepeatMaskedTraceDB, as on February 2001

Page 18: gene prediction

“novel” genes ?

• 48,890 genic regions (known genes or similar)

• 15,489 genes longer than 100 aa predicted by sgp

• 13,302 non redundant predictions

• 8,416 supported by tblastx hits to mouse 1.5

• 3,331 predicted genes with at least two exons suported by tblastx hits

• + 719 predicted genes supported by tblastx hits covering at least 75% of the prediction

4,050 supported sgp predictions

25% of them not overlapping genscan predictions

Page 19: gene prediction

validation of predictions

EST identity 18%

NR similarity 31%

CDD (NCBI) 24%

Mouse ESTs 28%

Rat ESTs 19%

Tetraodon 15%

at least one of the above

56%

Page 20: gene prediction

Experimental validation

Page 21: gene prediction

chr22

chr21

human genome vs. Mouse traceDB

Page 22: gene prediction

SN SP CC SNe SPe SNSP ME WE

chr22.assem. 0.87 0.65 0.75 0.69 0.54 0.62 0.14 0.33

chr22.shot. 0.82 0.66 0.72 0.63 0.54 0.58 0.20 0.31

human genome vs. Mouse assemblies

Page 23: gene prediction

chr22 chr21

776 Predicted 420

-655 known -326

-25 low complexity -5

-26 short -11

-19 intronless -34

45 36

testing novel predictions experimentally

In total 81 predictions. For 40 of them, adjacent exon pairs were selected for rt-pcr

Page 24: gene prediction

Positive controls N Success rate

refseq 78 96%

Known tissue specific genes

20 25%

Low expressing genes

13 Not ready

Twinscan with EST support

Not ready

Test sets

Twinscan Not ready

SGP 40 28%

preliminary results

Page 25: gene prediction

aknowledgments

IMIM-UPF-CRG, Barcelona

• Josep F. Abril

• Genís Parra

• Roderic Guigó

GlaxoSmithKline, King of Prussia

• Pankaj Agarwal

Max Plank Institute for Chemical Ecology, Jena

• Thomas Wiehe

Whitehead Institute/MIT Center for Genome Research, Cambridge

• Gwen Acton

• Dan Brown

• Kerstin

Mouse Sequence Consortium