nikolaj blom center for biological sequence analysis biocentrum-dtu
DESCRIPTION
”Resources of Biomolecular Data: Sequences, Structures and Functionality” PhD course #27803. Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob @cbs.dtu.dk. Outline. Magnitudes and Scales Resources: Data Sources & Tools - PowerPoint PPT PresentationTRANSCRIPT
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Nikolaj BlomCenter for Biological Sequence Analysis
BioCentrum-DTUTechnical University of Denmark
”Resources of Biomolecular
Data: Sequences, Structures and Functionality”
PhD course #27803
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seOutline
Magnitudes and ScalesResources: Data Sources & Tools• Primary DNA sources• Sequence Repositories• Structure Repositories• Functional Categorization• Integration of Databases• The Human Genome
• Genome Browsers• Prediction Tools
• Evaluation of Prediction Servers
Starting points• Link collections
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seResources: Sources & Tools
There is A LOT OF biomolecular databases/sourcesA LOT OF overlap of information/redundancyA LOT OF TOOLSPersonal picks/preferences• User-friendliness• Update intervals• Curation efforts / error
correction• Linkage to other DBs
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seFaster than Moore’s Faster than Moore’s law...law...
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Human Genome
Published
HUGO: Nature, 15.feb.2001
Celera: Science,
16.feb.2001
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Magnitudes and Scales
Human genome 3,200,000,000 bp • Single basepair full
genome is 9 orders of magnitude
Genome = Football field: ~3 billion leaves of grassSingle base A T G C (or SNP) = 1 leaf of grass Genome browsing• Zooming from whole
stadium to single leaf
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seHow we got the sequence
Sanger chain termination method
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
sePrimary DNA sourcesTrace files repositoriesSingle read: 500-1000 bp (~golf ball size / jig saw puzzle)Variable quality• WashU-Merck Human EST Project / Trace files• ”Base-calling” non-trivial
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seAssembly is Non-trivial!Assembly is Non-trivial!
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seSequence repositories - GenBank et al.
GenBank / EMBL / DDBJ • Highly redundant (many versions of same
gene)• Cross-updated daily• Version history is recorded
• Previous sequence records can be retrieved
• Contigs/HTGS (100-200 kb) finishing at different stages
• Draft Finished
• Includes genomic DNA, cDNA, ESTs, translated peptides
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seNon-redundant and Curated databases
Non-redundant• Manual or automatic curation• DNA
• RefSeq (NCBI; semi-automated)• Ensembl gene index (automated)
• Protein• RefSeq (NCBI; semi-automated)• TrEMBL (EMBL; automated)
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seCurated database: UniProt/SwissProt
SIB - Swiss Institute of Bioinformatics Protein Knowledgebase / Sequence Database• Highly curated• Experimental evidence
evaluated (e.g. modifications)
• All 80,000 entries checked by Amos Bairoch himself ;-)
ExPASy - Expert Protein Analysis System• Proteomics tools: links +
local servers
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seStructure databases / Protein Data Bank (PDB)
X-ray , NMR biomolecular structuresProtein Data Bank (PDB)>22,000 structures (April 2003)http://www.rcsb.org/pdb/
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seFunctional Categorization
Gene Ontology (GO) • Hierarchical• Controlled
vocabulary
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seFunctional Categorization
Gene Ontology (GO) http://www.geneontology.org/
• Molecular Function - the tasks performed by individual gene products; examples are transcription factor and DNA helicase
• Biological Process - broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions
• Cellular Component - subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seIntegration of databases - Webs of web-sites
Links, links, links...SRS = Sequence Retrieval System• Powerful,
complex query language
BioDAS – Distributed Annotation System
http://srs.ebi.ac.uk/
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seFor ’my gene’, how do I:
Get an overview of the sequence information known? (GeneCards)Examine the ’Genome Neighbourhood’? (Genome Browsers)Predict protein post-translational modifications (PTMs)? (Prediction servers)• (Evaluate the value of predicted features)
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGeneCards http://nciarray.nci.nih.gov/cards/
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGeneCards-II
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGeneCards-III
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGeneCards-IV
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGeneCards-V
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGenetic/Medical Information
OMIM, Online Mendelian Inheritance in Man (NCBI)• The OMIM database is a catalog of human
genes and genetic disorders• >13,000 entries (April, 2002)• Examples: cystic fibrosis, prions, amyloid
precursor protein• Condensed, highly curated descriptions of
genetics/disease/animal models/references
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seOMIM-I (http://www3.ncbi.nlm.nih.gov/Omim/)
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seOMIM-II
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seOMIM-III
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seFor ’my gene’, how do I:
Get an overview of the sequence information known? (GeneCards)Examine the ’Genome Neighbourhood’? (Genome Browsers)Predict protein post-translational modifications (PTMs)? (Prediction servers)• (Evaluate the value of predicted features)
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGenome Browsing
Three public• Open access• Use same genome build/assembly
• NCBI (U.S.)• UCSC (Santa Cruz, U.S.)• EnsEmbl (EBI, EU)
One private• Restricted, commercial• Academic, free usage: 1 Mbase/week• Proprietary assembly
• Celera Genomics (U.S.)
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seCelera Human/Mouse Genomes
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGenome Browsers - Portals to the Genomic World
NCBI – National Center for Biotechnology Information (U.S.) • http://www.ncbi.nlm.nih.gov/Genomes/index.html
UCSC – Univ. California – Santa Cruz (U.S.)• http://genome.ucsc.edu/
EnsEmbl – European Molecular Biology Laboratory (E.U.)• http://www.ensembl.org/
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seNCBI
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seNCBI
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seUCSC – Genome Browser
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seUCSC – Genome Browser II
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seEnsEmbl – Genome Browser
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seEnsEmbl – Genome Browser
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seEnsEmbl – Genome Browser
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seEnsEmbl – Genome Browser
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seEnsEmbl – Genome Browser
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seEnsEmbl – Genome Browser
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seFor ’my gene’, how do I:
Get an overview of the sequence information known? (GeneCards)Examine the ’Genome Neighbourhood’? (Genome Browsers)Predict protein post-translational modifications (PTMs) or Gene Structure? (Prediction servers)• ...and evaluate the reliability of prediction
methods
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seCBS Services/Toolbox http://www.cbs.dtu.dk/services/
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seNetPhos – a prediction server
http://www.cbs.dtu.dk/services/NetPhos/
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seNetPhos – a prediction server
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seEvaluating Prediction Servers
Performance on independent/cross-validated data presented?Published in peer-reviewed journal?Cited by others? • Science Citation Index
Linked to from credible web sites? • Google Page-rank• ”link:URL” search
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seEvaluating Prediction Servers
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se2can Bioinformatics Education
At EBI – European Bioinformatics Institutehttp://www.ebi.ac.uk/2can/index.htmlTutorials, resource links, etc.
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seStarting Points
General Bioinformatics• NCBI, National Center for Biotechnology
Information, U.S.• EBI, European Bioinformatics Institute
Prediction Tools• CBS, DK• Expasy (Protein analysis), Switzerland
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seDynamic Resources
Pros• Includes most recent developments• Updated regularly• User interface improves(usually)
Cons• Difficult to keep pace• Tutorials and lectures hard to recycle ;-(• Difficult to use at irregular intervals
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGenome Browsers - Portals to the Genomic World
Three main entry points:• NCBI, UCSC, EnsEmbl
• Essentially contain same information• High degree of linking to secondary databases• Advisable to become familiar with only one
genome browser• Learn to navigate and make queries
GeneCards and OMIM• well suited for getting a quick overview of a
gene of interest
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
sePrediction Servers
Evaluate scientific ’soundness’• Look for indications of quality (citations,
etc.)
Remember that prediction servers provide...well, predictions!
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
The End