gene prediction roderic guigó i serra imim/upf/crg

25
gene prediction roderic guigó i serra IMIM/UPF/CRG

Upload: angel-quinn

Post on 17-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gene prediction roderic guigó i serra IMIM/UPF/CRG

gene prediction

roderic guigó i serraIMIM/UPF/CRG

Page 2: Gene prediction roderic guigó i serra IMIM/UPF/CRG

number of genes in chromosome 22

• initial annotation 545 Dunham et al., 1999

• genscan+RT-PCR 590 Das et al., 2001

• genscan+microarrays 730 Shoemaker et al., 2001

• reviewed annotation 726 chr22 team, sanger, 2001

• mouse shotgun data +20 (our data)

• geneid predictions 794

• genscan predictions 1128

Page 3: Gene prediction roderic guigó i serra IMIM/UPF/CRG

number of genes in human genome

• Consortium 30.000-40.000 2001

• Celera 27.000-38.000 2001

• Consortium+Celera 50.000 Hogenesch et al.

2001

• DBsearches 65.000-75.000 Wrigth et al., 2001

• HumanGenomeSciences 90.000-120.000 Haseltine,

2001

Page 4: Gene prediction roderic guigó i serra IMIM/UPF/CRG

decodificació del genomaACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGGCCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCATTCTGCCCCTGCCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGCTTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCTCTCAGTTCTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGTCCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAGATGGGGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAGCGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCACCTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCGTCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGTTCACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCCTTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC

the human genome sequence

Page 5: Gene prediction roderic guigó i serra IMIM/UPF/CRG

QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP

the amino acid sequence of the proteins

Page 6: Gene prediction roderic guigó i serra IMIM/UPF/CRG

EXONS

INTRONS

ELEMENTREGULADOR‘UPSTREAM’

ELEMENTREGULADOR

‘DOWNSTREAM’

PROMOTOR

Estructura dels Gens

Page 7: Gene prediction roderic guigó i serra IMIM/UPF/CRG

Del DNA al RNA

Page 8: Gene prediction roderic guigó i serra IMIM/UPF/CRG

Del RNA a la Proteïna

Page 9: Gene prediction roderic guigó i serra IMIM/UPF/CRG

Mecanisme Molecular

Page 10: Gene prediction roderic guigó i serra IMIM/UPF/CRG

Prediction of splice sites

Page 11: Gene prediction roderic guigó i serra IMIM/UPF/CRG

accuracy of gene prediction programs

Page 12: Gene prediction roderic guigó i serra IMIM/UPF/CRG

accuracy of gene prediction programs

Page 13: Gene prediction roderic guigó i serra IMIM/UPF/CRG

accuracy of gene prediction programs

Page 14: Gene prediction roderic guigó i serra IMIM/UPF/CRG

• rosseta (Batzoglou et al., 2000)

• cem (Bafna and Huson, 2000)

• sgp1 (Wiehe et al., 2000)

• twinscan (Korf et al., 2001)

• slam ( Patcher et al., 2001)

• sgp2 (Guigó et al., in preparation)

comparative gene prediciton

Page 15: Gene prediction roderic guigó i serra IMIM/UPF/CRG

QuerySequence

tblastxHSPs

geneidExons

HSPsProjectio

ns

SGPExons

syntenic gene prediction (sgp2)

Page 16: Gene prediction roderic guigó i serra IMIM/UPF/CRG

benchmarking sgp2 - accuracy

scimog

mit

Page 17: Gene prediction roderic guigó i serra IMIM/UPF/CRG

Predicting “novel” genes in the human genome

golden path annotationsgolden path annotationsadditional blastn matches to ENSEMBL + REFSEQadditional blastn matches to ENSEMBL + REFSEQtblastx

geneidexons

tblastx

sgpgenes

Golden Path Oct 7, 2000 freeze. RepeatMaskedTraceDB, as on February 2001

Page 18: Gene prediction roderic guigó i serra IMIM/UPF/CRG

“novel” genes ?

• 48,890 genic regions (known genes or similar)

• 15,489 genes longer than 100 aa predicted by sgp

• 13,302 non redundant predictions

• 8,416 supported by tblastx hits to mouse 1.5

• 3,331 predicted genes with at least two exons suported by tblastx hits

• + 719 predicted genes supported by tblastx hits covering at least 75% of the prediction

4,050 supported sgp predictions

25% of them not overlapping genscan predictions

Page 19: Gene prediction roderic guigó i serra IMIM/UPF/CRG

validation of predictions

EST identity 18%

NR similarity 31%

CDD (NCBI) 24%

Mouse ESTs 28%

Rat ESTs 19%

Tetraodon 15%

at least one of the above

56%

Page 20: Gene prediction roderic guigó i serra IMIM/UPF/CRG

Experimental validation

Page 21: Gene prediction roderic guigó i serra IMIM/UPF/CRG

chr22

chr21

human genome vs. Mouse traceDB

Page 22: Gene prediction roderic guigó i serra IMIM/UPF/CRG

SN SP CC SNe SPe SNSP ME WE

chr22.assem. 0.87 0.65 0.75 0.69 0.54 0.62 0.14 0.33

chr22.shot. 0.82 0.66 0.72 0.63 0.54 0.58 0.20 0.31

human genome vs. Mouse assemblies

Page 23: Gene prediction roderic guigó i serra IMIM/UPF/CRG

chr22 chr21

776 Predicted 420

-655 known -326

-25 low complexity -5

-26 short -11

-19 intronless -34

45 36

testing novel predictions experimentally

In total 81 predictions. For 40 of them, adjacent exon pairs were selected for rt-pcr

Page 24: Gene prediction roderic guigó i serra IMIM/UPF/CRG

Positive controls N Success rate

refseq 78 96%

Known tissue specific genes

20 25%

Low expressing genes

13 Not ready

Twinscan with EST support

Not ready

Test sets

Twinscan Not ready

SGP 40 28%

preliminary results

Page 25: Gene prediction roderic guigó i serra IMIM/UPF/CRG

aknowledgments

IMIM-UPF-CRG, Barcelona

• Josep F. Abril

• Genís Parra

• Roderic Guigó

GlaxoSmithKline, King of Prussia

• Pankaj Agarwal

Max Plank Institute for Chemical Ecology, Jena

• Thomas Wiehe

Whitehead Institute/MIT Center for Genome Research, Cambridge

• Gwen Acton

• Dan Brown

• Kerstin

Mouse Sequence Consortium