genome, protein and model organism databases
DESCRIPTION
Genome, Protein and Model Organism Databases. Anne Estreicher Swiss-Prot Group Swiss Institute of Bioinformatics Geneva – Switzerland [email protected]. Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China August 17 - August 29, 2009. - PowerPoint PPT PresentationTRANSCRIPT
Genome, ProteinGenome, Proteinandand
Model Organism Databases Model Organism Databases
Anne Estreicher Swiss-Prot Group
Swiss Institute of BioinformaticsGeneva – Switzerland
Bioinformatic and Comparative Genome Analysis Course
HKU-Pasteur Research Centre - Hong Kong, China
August 17 - August 29, 2009
Bioinformatic and Comparative Genome Analysis Course
HKU-Pasteur Research Centre - Hong Kong, China
August 17 - August 29, 2009
OutlineOutline
1. Introduction (definitions, history…)
2. From DNA sequence to genomic tools
3. The flow of information: from DNA to proteins
4. Protein sequence databases
5. MODs at a glance
• A collection of related data, which are– structured – searchable – updated periodically– cross-referenced
• Includes also associated tools necessary for access/query, download, etc.
What is a database ?What is a database ?
Why do we need databases ?Why do we need databases ?
Data need to be stored, curated and made available for analysis and knowledge discovery
Efficient way of sharing data, independently of regular publications
Essential resources for both experimental and computational biologists
Databases in biology : not a Databases in biology : not a new issue …new issue …
• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65
proteins)
The first protein sequence "database" by Margaret Dayhoff (1965)
contained 65 proteins
Databases: not a new issue…Databases: not a new issue…
• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65
proteins)• Mid 70s Improvements in DNA sequencing• 1979 Los Alamos Sequence Library (Walter Goad)• 1980 ~ 80 genes fully sequenced
-> Need to store the data and to make them available for analysis (in format acceptable for human eyes and machines)
-> ARCHIVE
-> RACE for the central position in life sciences…And the winner is…
Databases: not a new issue…Databases: not a new issue…
EMBL-Bank - Europe 1980GenBank - USA 1982
DDBJ - Asia 1986
leading to the establishment of the INSDC (International Nucleotide Sequence
Database Collaboration) -> daily exchanges of data
www.insdc.org
EMBL-BANKEMBL-BANK -- GenBankGenBank -- DDBJDDBJ
• Main resources for DNA and RNA sequences;
• Used to be retrieved from publications -> direct submissions from individual researchers, genome sequencing projects and patent applications:
“Journal publishers generally require sequence deposition prior to publication so that an accession number can be included in the paper.”
1. True for nucleic acid, not for protein sequences;2. Not always put into practice
=> Not submitted sequences are LOST!!!=> Not submitted sequences are LOST!!!
• Archives (primary databases)
• data belong to submitters
EMBL-BANKEMBL-BANK -- GenBankGenBank -- DDBJDDBJ
Archive (primary databases) => data belong to the submitter
Minimal checks, such as vector contamination Annotation by the submitters
Databases: not a new issue…Databases: not a new issue…
• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65
proteins)• 1979 Los Alamos Sequence Library (Walter Goad) –
DNA • 1982 EMBL-Bank - DNA• 1984 GenBank – DNA• 1986 DDBJ - DNA
Databases: not a new issue…Databases: not a new issue…
• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65
proteins)• 1979 Los Alamos Sequence Library (Walter Goad) –
DNA • 1982 EMBL-Bank - DNA• 1984 GenBank – DNA• 1986 DDBJ - DNA
-> ARCHIVES (primary databases) may not be sufficient-> need to annotate the data to produce KNOWLEDGE • 1986 Swiss-Prot – protein sequences – a paradigm for
annotated (secondary) databases
The Swiss-Prot conceptThe Swiss-Prot concept
non-redundant: Protein products of
1 gene / 1 species -> 1 entry1 gene / 1 species -> 1 entry,
Manually annotated (=> curator judgement on data !),
Highly cross-referenced (1st life-science database to provide cross-references) (links to > 130 databases from www.uniprot.org).
Databases: not a new issue…Databases: not a new issue…• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65
proteins)• 1979 Los Alamos Sequence Library (Walter Goad) – DNA • 1982 EMBL-Bank - DNA• 1984 GenBank – DNA
Protein information resource (PIR) – Protein sequences
• 1986 DDBJ – DNASwiss-Prot – protein sequences
• 1996 TrEMBL (Translated EMBL) – Protein sequencesComplement of Swiss-Prot to cope with the
increasing amount of new sequences; AUTOMATIC ANNOTATION !
0
50'000
100'000
150'000
200'000
250'000
300'000
350'000
400'000
450'000
500'000
2 7 12 17 22 27 32 37 42 47 52 57
19863’939 entries
UniProtKB/Swiss-Prot growthUniProtKB/Swiss-Prot growthN
um
ber
of
en
trie
s
Releasenumber
1996: creation of TrEMBLTrEMBLSwiss-Prot: 52’205 entriesTrEMBL: 61’137 entries
Swiss-Prot rel. 57.5 (07-Jul-2009): 470’369Swiss-Prot rel. 57.5 (07-Jul-2009): 470’369 entriesentries
0
1'000'000
2'000'000
3'000'000
4'000'000
5'000'000
6'000'000
7'000'000
8'000'000
9'000'000
UniProtKB growthUniProtKB growth
Releasenumber
TrEMBL rel.40.5 (07-Jul-2009): 8TrEMBL rel.40.5 (07-Jul-2009): 8’’594594’’382382 entries entries Swiss-Prot rel.57.5 (07-Jul-2009): 470’369 Swiss-Prot rel.57.5 (07-Jul-2009): 470’369 entriesentries
1986 1996 2009
TrEMBL growthTrEMBL growth (sequences/day)
2004
1’5002006-2007 3’5002008
>5’0002009
~8’000
TrEMBLTrEMBLAutomated curation
Swiss-ProtSwiss-ProtManual curation
Nu
mb
er
of
en
trie
s
New challengeNew challenge
Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery
Life sciences used to be rich in hypotheses, well-off in knowledge and poor in data;
Today they are very rich in data, not so well-off in knowledge and very poor in hypotheses.
Complex system
(R)evolution of these last 20 years(R)evolution of these last 20 years
List of parts
??
Science (1993) 262, 502
Danger !
EMBL Database GrowthEMBL Database Growthhttp://www.ebi.ac.uk/embl/Services/DBStats/
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.htmlhttp://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat
In 4 months, 374 new In 4 months, 374 new genomesgenomes
and 77 were completedand 77 were completed~ 100 genomes/month~ 100 genomes/month
(in 2008 -> ~50 genomes/month)
+ ~2’360 viral (& viroid) genomes=> Total ~ 5’600 genomes
http://genomesonline.org/index2.htm
http://www.genomesonline.org/gold.cgi
http://www.genomesonline.org/gold.cgi
Metagenomics:Metagenomics:study of genetic material recovered directly
from environmental samples
• Global Ocean Sampling (C. Venter)
• Whale fall
• Soil, sand beach, New-York air, …
• Human fluids, mouse gut
• …
Venter’s Sorcerer II
Flood in the world of Flood in the world of proteins…proteins…
1965: first protein sequence "database" by Margaret Dayhoff (65 proteins)
July 2009: ~ 20 millions unique protein sequence (source UniParc - http://www.uniprot.org/uniparc/)
UniParc:non-redundant database that contains most of the publicly available protein sequences in the world (includes sequences from EMBL-Bank/DDBJ/GenBank nucleotide sequence databases, Ensembl, FlyBase, H-Invitational Database (H-Inv), International Protein Index (IPI), Patent Offices (EPO, JPO and USPTO), PIR-PSD, Protein Data Bank (PDB), Protein Research Foundation (PRF), RefSeq, Saccharomyces Genome database (SGD), TAIR Arabidopsis thaliana Information Resource, TROME, UniProtKB/Swiss-Prot and TrEMBL, Vertebrate Genome Annotation database (VEGA) and WormBase).
New challengeNew challenge
Flood of data
Flood of databases…
NAR 1st issue of the year is always
dedicated to databases + "clean" list of
databases provided
(! not exhaustive !)
The NAR Online Molecular Biology Database collection in 2009
A total of 1’170 databases (19 obsolete removed)
http://www.oxfordjournals.org/nar/database/a/
NAR "clean" list of databaseshttp://www.oxfordjournals.org/nar/database/a/
Most recent NAR paper about the database
(not available for all db, some described in
other journals)
A "clean" list of can be found in the NAR online molecular biology database
collection
http://www.oxfordjournals.org/nar/database/a/
BIOLOGICAL DATABASE CATEGORIES BIOLOGICAL DATABASE CATEGORIES
• Databases of nucleic acid sequences (RNA, DNA)• Databases of protein sequences• Databases of protein motifs and protein domains• Databases of structures• Databases of genomes• Databases of genes• Databases of expression profiles• Databases of SNPs and mutations• Databases of metabolic pathways • Databases of protein interactions• Databases of taxonomy• …
Databases containing sequences or data directly derived from sequences.
DNA sequences :DNA sequences :
What ?What ?Where ?Where ?How ?How ?
& genomic tools& genomic tools
NCBINCBIUCSCUCSC
Accession numberMolecule typeDate of submissionDefinition
Nucleotide sequence
Stable accession number (should
always be cited in publications)
Possible molecule types:genomic DNA and RNA
mRNA other DNA and RNA rRNA transcribed RNAtRNA unassigned DNA and RNA viral cRNA
GenBank entry AF415175http://www.ncbi.nlm.nih.gov/nuccore/16589063
Accession numberMolecule typeDate of submissionDefinition
Nucleotide sequence
Taxonomy
Accession numberMolecule typeDate of submissionDefinition
Nucleotide sequence
Taxonomy
References
Nucleotide sequence
Taxonomy
References
Features:Information provided by the submitterMay include annotation of the sequence
Accession numberMolecule typeDate of submissionDefinition
OrganismMolecule typeChromosomal locationTissue typeGene nameCDS annotation=> protein sequence + Protein IDentifier (PID: stable identifier & version number)
Protein sequence
Gives access to the nucleic acid sequence of the CDS (not of the entire mRNA)
"Features" may provide much more informationdepending upon the sequence and the submitter…
3’end of chromosome Y
EMBL #AJ271736
Very similar view, links and Very similar view, links and options from the 3 sites:options from the 3 sites:
EMBL-Bank – GenBank - DDBJEMBL-Bank – GenBank - DDBJhttp://www.ddbj.nig.ac.jp/http://www.ebi.ac.uk/embl/ http://www.ncbi.nlm.nih.gov/
How to find a DNA sequence How to find a DNA sequence at the NCBI…at the NCBI…
http://www.ncbi.nlm.nih.gov/
Databases @ NCBIhttp://www.ncbi.nlm.nih.gov/Database/datamodel/index.html
The Entrez system:The Entrez system:integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others
=> Maximal=> Maximal interconnectivityinterconnectivity
Databases @ NCBIhttp://www.ncbi.nlm.nih.gov/Database/datamodel/index.html
Simple search with aSimple search with aEMBL-Bank/GenBank/DDBJ EMBL-Bank/GenBank/DDBJ
accession numberaccession number
Searching fromSearching froma bibliographic reference…a bibliographic reference…
Search results 2 and 3-> accession numbers provided by the authors in the article-> GenBank records
Search result 1-> corresponds to the RefSeq database…
RefSeq (Reference Sequence)RefSeq (Reference Sequence)
• Provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins;
• Most data extracted from GenBank -> choice of a reference sequence and annotation (no documented comparison between sequences)
• Some entries based on predictions (accession: XM_; XR_; XP_; ZP_);
• Currently, 8'665 species represented;
• Annotation: Manual annotation (only in entries tagged as "reviewed"); Collaboration; Propagation from other sources; Computation.
CURATION
GENOME ANNOTATION No
INFERRED No
MODEL No
PREDICTED No
PROVISIONAL No
REVIEWEDYes Yes (sequence +
functional information and features)
VALIDATED Yes Yes (initial sequence)
WGS No
RefSeq (Reference Sequence)RefSeq (Reference Sequence)
RefSeq entry NM_015595: SGEF mRNA RefSeq entry NM_015595: SGEF mRNA
Accession numberDefinitionTaxonomyList of references
RefSeq entry NM_015595: SGEF mRNA RefSeq entry NM_015595: SGEF mRNA
Gene nameExon annotationCDS annotation and sequence
RefSeq entry NM_015595: SGEF mRNA RefSeq entry NM_015595: SGEF mRNA
Sequence
Searching withSearching withthe gene name…the gene name…
Etc.
Etc.
GenBank
Refseq
NCBI Entrez systemNCBI Entrez system Looks for the request in all NCBI databases
Cannot be ignored -> no simple way to search only in your favourite NCBI database
Searching using BLAST…Searching using BLAST…
RefSeq
UniGene:Clusters of transcript sequences that appear to come from the same transcription locus
!?UniSTS:62643 maps to multiple loci in Homo sapiens
Information on tissue expression
UniGene Mapping of
known genes
UniGene Mapping of
known genes
Mapping of RNA (EMBL/GenBank/DD
BJ& RefSeq)
UniGene
Mapping of RNA (EMBL/GenBank/DDBJ
& RefSeq)
Mapping of RefSeq RNAMapping of
known genes
UniGene
Mapping of RNA (EMBL/GenBank/DDBJ
& RefSeq)
Mapping of RefSeq RNA
This view by default can be customized
Mapping of known genes
1. Choose desired option;2. Add it (and/remove undesired)3. Apply the new display
Zoom out -> a better view of the genomic
context of the sequence of interest
Original view
Map viewer~ 110 organisms
represented in Genome database.(www.ncbi.nlm.nih.gov/sites/entrez?
db=genome)
Genomic tools on the Genomic tools on the UCSC server:UCSC server:BLAT searchBLAT search
And:A.GambiaeA.MelliferaS.cerevisiae
a total of 47 organisms
http://genome.ucsc.edu/cgi-bin/hgBlat
Feb. 2009 assembly: not all data implemented !May be better to use former assembly for the time being.
Genome browser @ UCSC
cDNAsequen
ce
Chromosomal location
Consensus CDS& other sequences from reliable resources
gDNA sequence
http://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi
Annotation of genes is provided by multiple public resources, using different methods, and resulting in information that is similar but not always identical.
CCDS database goal: provide a standard set of gene annotations.
Collaborative project involving teams (manual and automated annotation): * European Bioinformatics Institute (EBI) * National Center for Biotechnology Information (NCBI) * Wellcome Trust Sanger Institute (WTSI) * University of California, Santa Cruz (UCSC)
Currently available only for human and mouse genomes (July 2009):20'159 human CCDS (including isoforms) -> 17'054 CCDS genes17'707 mouse CCDS (including isoforms) -> 16'889 CCDS genes
Chromosomal location
Consensus CDS& other sequences from reliable resources
gDNA sequence
(Human) ESTs(including unspliced)
(Human) spliced ESTs
(Human) mRNAs
All sequences can be retrieved
The view can be completely
customized…
…including with various tools
allowing comparative
genomics
…and including your own data !
http://genome.ucsc.edu/
Back to the Blat Back to the Blat viewerviewer
Arrows >>>> show the direction of transcription
2 transcripts from the same locus:BDNF (Brain-Derived Neurotrophic Factor) BDNFOS (BDNF Opposite Strand)
Exons
View of alternative exons
Alternative exons
Constitutive exons
Interested by this exon ?
Just zoom in…
Genome browser @ UCSC has many great options, give it a
try!
http://genome.ucsc.edu/
Typical problems
or
Why wonderful tools will never replace the brain of a life
scientist !
… Once upon a time, there was a gene on chromosome 11…
2 essential genome resources are missing from this lecture:
Ensembl (http://www.ensembl.org/index.html): automated annotation of many genomes;
Vega (http://vega.sanger.ac.uk/index.html):High quality manual annotation of genomes (currently Homo sapiens, Mus musculus, Danio rerio, Gorilla gorilla, Macropus eugenii, Sus scrofa, Canis familiaris).
Please go and visit them!
The flow of informationThe flow of information
From DNA sequencesFrom DNA sequencesto protein to protein
sequences:sequences:
A little biologyA little biologyandand
A few databasesA few databases
Increase in complexity 5-10 x
Alternative promoter usage Alternative splicing
Trans-splicingmRNA editing …
Increase in complexity2-5 x
~ 100’000human
transcripts
~ 20’500 human protein-encoding
genes
~ 1'000'000 human proteins
TranscriptoTranscriptomeme
From genome to proteome:From genome to proteome:the example of humanthe example of human
GenomeGenome ProteomeProteome
Post-translational modifications (PTMs)
Most PTMs cannot be predicted from DNA
sequences
The hectic life of a protein The hectic life of a protein sequence…sequence…
cDNAs, ESTs, genomes, …
DDBJDDBJ
Data not submitted to public databases, delayed or cancelled…
…if a Coding Sequence (CDS)is submitted
Protein sequence databases
Nucleic acid databases
Gene predictionRefSeq, Ensembl
+ some MODs
no CDS
EMBL GenBankwww.insdc.orgInternational Nucleotide Sequence Database Collaboration
Sequences from
publicationsJournal scan
Direct submissions
!!!!
99% of the protein sequences found in databases come from the translation
nucleotide sequences=> Experimental evidence may be
lacking!
EMBL (DNA)EMBL (DNA)
TrEMBL TrEMBL Translated EMBL
Translated CDS
Reference + tissue
Protein name
Translated CDS
Product name
Tissue
Reference
Automated extraction of protein
sequence (translated CDS),
gene name and references +Automated annotation.
A similar pipeline is used at the NCBI to go from GenBankGenBank
to GenPeptGenPept
!!!!
The quality of UniProtKB/TrEMBL (& GenPept) entries depends upon the
quality of the submissions in the original EMBL-Bank/GenBank/DDBJ
entry.
EMBLEMBL
TrEMBLTrEMBL
EMBL (DNA)EMBL (DNA)
TrEMBLTrEMBL
Translated CDS
Reference
Protein name
Translated CDS
Product name
Tissue
Reference
Automated extraction of protein
sequence (translated CDS),
gene name and references.Automated annotation.
Swiss-ProtSwiss-ProtManual annotation
of the sequence and review of
associated biological
information
Protein nameSS
Many more references
Translated CDS+ SAPs+ isoforms+ …
Full annotation
Sequence
Sequence
features
Ontologies
References
Nomenclature
Splice variants
Annotations
Evidence for protein existence:Annotation in UniProtKB
5 levels of evidence: 1. evidence at protein level, 2. evidence at transcript level, 3. inferred by homology, 4. predicted,5. uncertain.
http://www.uniprot.org/uniprot/P35613
http://www.uniprot.org/uniprot/Q9Y471
http://www.uniprot.org/uniprot/Q9Y471
2D-gel dbs 2DBase-EcoliANU-2DPAGEAarhus/Ghent-2DPAGE (no server)
COMPLUYEAST-2DPAGECornea-2DPAGE DOSAC-COBS-2DPAGEECO2DBASE (no server)
HSC-2DPAGEOGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGEREPRODUCTION-2DPAGESiena-2DPAGESWISS-2DPAGEWorld-2DPAGE
Family and domain dbsGene3DHAMAPInterProPANTHERPfamPIRSFPRINTSProDomPROSITESMARTTIGRFAMs
Organism-specific dbsAGDBuruListCGDCTDCYGD DictyBaseEchoBASEEcoGeneeuHCVdbFlyBaseGenAtlasGeneCardsGeneDB_SpombeGeneFarmGrameneH-InvDB HGNCHPA LegioListLepromaListiListMaizeGDBMGIMIMMypuListOrphanet PharmGKBPhotoListPseudoCAPRGDSagaListSGDSubtiListTAIRTubercuListWormBaseWormPepXenbaseZFIN
Protein family/group dbsCAZyMEROPSPeroxiBasePptaseDBREBASETCDB
Genome annotation dbsEnsemblGeneIDGenomeReviewsKEGGNMPDRTIGRUCSCVectorBase
Enzyme and pathway dbsBioCycBRENDAPathway_Interaction_DBReactome
OthersBindingDBPMAP-CutDBDrugBank NextBio
Sequence dbsEMBLIPIPIRUniGeneRefSeq
3D structure dbsDisProtHSSPPDBPDBsumSMR
PTM dbsGlycoSuiteDBPhosphoSitePhosSite
UniProtKB/Swiss-Prot:115 explicit links
and 19 implicit links!
Proteomic dbsPeptideAtlasPRIDEProMEX
Protein-protein interaction dbsDIPIntAct
Phylogenomic dbsHOGENOMHOVERGENOMA
Polymorphism dbsdbSNP
Gene expression dbsArrayExpressBgeeCleanExGermOnline
Ontologies GO
Protein Information Resource
European Bioinformatics Institute European Molecular Biology Laboratory
Swiss Institute of
Bioinformatics
The UniProt The UniProt consortiumconsortium
UniProt mission:
Provide a comprehensive high-quality and freely accessible resource of protein sequence and functional annotation.
New release every 3 weeks
Update frequencyUpdate frequencyA crucial issue !! A crucial issue !!
• Sometimes very difficult, or even impossible, to find;
• Crucial not only for the database itself, but also for tools using databases.
Update frequencyUpdate frequency
http://www.matrixscience.com/search_intro.html
Mascot MS/MS identification tool is fine, but it cannot be used from this website !
Solution: Download the database of interest and make sure you work with an up-to-date version.
Never hesitate to ask for an Never hesitate to ask for an updateupdate
UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (9’232’223 entries)
UniParcUniParc: protein sequence archive (equivalent to
EMBL-Bank/GenBank/DDBJ at the protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @ EBI, not downloadable) (20’070’606 entries)
UniParc entry contains all records for a unique sequence in major publicly available databases.
TrEMBL entry merged into Swiss-Prot => does not
exist anymore
UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (9’232’223 entries)
UniParcUniParc: protein sequence archive (EMBL equivalent at the
protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @ EBI, not downloadable) (20’070’606 entries)
UniRefUniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 8’474’689 entries; UniRef90 5’668'669 entries; UniRef50 2'729'565 entries)
UniRef100, 90 and 50UniRef100, 90 and 50
One UniRef100 entry -> merge of identical sequences (including subfragments, splice variants). Based on UniProtKB sequences and selected UniParc records (such as Ensembl & RefSeq).
One UniRef90 entry -> sequences that have at least 90% or more identity. Built from UniRef100.
One UniRef50 entry -> sequences that are at least 50% identical. Built from UniRef100.
UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (7’097’874 entries)
UniParcUniParc: protein sequence archive (EMBL equivalent at the
protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @ EBI, not downloadable) (17’646’564 entries)
UniRefUniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 6,652,983 entries; UniRef90 4’438’653 entries; UniRef50 2’104’702 entries)
UniMESUniMES: protein sequences derived from metagenomic projects (Global Ocean Sampling (GOS)) (Blast, download) (UniMes 6'028'191 entries)
What is "Non-Redundancy" ?What is "Non-Redundancy" ?
• UniParcUniParc– One UniParc entry for all entries corresponding to
100% identical sequences (100% identity over the entire length) (from many different databases).
• UniRefUniRef– One UniRef100 entry for all entries corresponding to
100% identical sequences (including fragments) from UniProtKB, Ensembl, Refseq, PDB.
• UniProtKB/Swiss-ProtUniProtKB/Swiss-Prot– One Swiss-Prot entry for all the protein products of
one gene, including fragments, variations/polymorphisms, splice variants, sequencing errors…
Comparing searches:Comparing searches:NCBI and UniProtNCBI and UniProt
GenPept
GenPept
Swiss-Prot
RefSeq
Identical sequences
AAC34135 CAH72619Identical
sequencesAAF05316 BAG55035 CAH72618 AAI17423 AAF89753
NP_612564 O00206
Search for the human Toll-like
receptor 4 Entrez Entrez Protein (NCBI)Protein (NCBI)
Swiss-Prot
Search for the human Toll-like
receptor 4 in
UniProtKBUniProtKB
Sequences retrieved in Entrez Protein:
O00206AAF05316CAH72618 CAH72619BAG55035AAI17423 AAF89753
NP_612564* AAC34135
*Based on A126770, BC117422,AL160272
and AA598398
Major protein sequence resourcesMajor protein sequence resources
UniProtKB: Swiss-Prot + TrEMBL
EntrezProtein: Swiss-Prot+GenPept+PIR+PDB+PRF+RefSeq
PIR PDB PRF
UniProtKB/Swiss-Prot: manually annotated protein sequences (~12’000 species)
UniProtKB/TrEMBL: submitted CDS (EMBL); automated annotation (~202’000 species)
GenPept: submitted CDS (GenBank)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation
Resources integrated in the
entries
Resources integrated in the
search engine
Model Organism Databases Model Organism Databases (MODs) at a glance(MODs) at a glance
Model organismModel organism
Species extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the organism model will provide insight into the workings of other organisms.
Model organisms MODs
Mus musculus MGI http://www.informatics.jax.org/Rattus norvegicus RGD http://rgd.mcw.edu/Oryza sativa RAP-DB http://rapdb.dna.affrc.go.jp/Arabidopsis thaliana TAIR http://www.arabidopsis.org/Drosophila melanogaster FlyBase http://flybase.org/Schizosaccharomyces pombe S. pombe GeneDB http://www.genedb.org/genedb/pombe/Saccharomyces cerevisiae SGD http://www.yeastgenome.org/Caenorhabditis elegans WormBase http://www.wormbase.org/ Dictyostelium discoideum dictyBase http://dictybase.org/ Bacillus subtilis SubtiList http://genolist.pasteur.fr/SubtiList/ Escherichia coli ecogene http://ecogene.org/ Danio rerio (zebrafish) ZFIN http://zfin.org/
Just a few examples, not an exhaustive list!
Methanocaldococcus jannaschii -> no MOD
Model organism databases (MODs)Model organism databases (MODs)
Genome annotation;Gene models;Gene mapping;Official nomenclature;Gene expression;Functional annotation;Interactions;Information about mutants/knockout/transgenic animals;Phenotypes;(cross-)references;Species-specific reagents…
Key resources for information on a given organismService provided to/from a given community
MODs do not necessarily store sequences,but give access to them
Link to cDNA sequences
http://gmod.org/wiki/Main_Page
The world of databases is a
jungle
A few points to rememberA few points to rememberwhen using databaseswhen using databases
- Content ;
- Primary / secondary / meta-databases ;- Curated / non-curated ;- manual / automated curation ;- Redundant / non-redundant.
- Update frequency;
- Stable identifiers ;
- Strategy ;- Dataflow ;- Collaborations between databases.
Test a few genomic Test a few genomic databases and toolsdatabases and tools
NCBI:http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeEBI:http://www.ebi.ac.uk/genomes/TIGR: http://cmr.jcvi.org/tigr-scripts/CMR/shared/Genomes.cgi
Genome annotation and analysis tools:http://www.ensembl.org/index.htmlhttp://vega.sanger.ac.uk/index.htmlhttp://genome.ucsc.edu/ -> BLAT, Galaxy, Custom tracks, …http://www.jgi.doe.gov/software/ -> Genome portal, Integrated Microbial Genomes (IMG) and other tools
Generic Model Organism Database http://gmod.org/wiki/Main_Page
Genomes and genomic tools: a few sites
Find your favorite (completely sequenced) organism in a genome db;Follow the links to see the options on different sites;Find the sequences;Look at the annotation of your favorite gene;Compare the entries corresponding to this gene across sites;Test search engines (restrict searches, compare results, …)
Whenever possible use on-line tutorials, such as:http://www.ensembl.org/info/website/tutorials/index.html
Visit GMOD, see the tools (http://gmod.org/wiki/GMOD_Components)
Play around with the BLAT search, customize display, follow the links, …
Genomes and genomic tools:Hands-on
Go and visit databases cited in this lecture;
The databases/tools that should be "familiar" to all are:http://genome.ucsc.edu/cgi-bin/hgBlathttp://www.ensembl.org/index.htmlgene/genome databases/tools on http://www.ncbi .nlm.nih.gov/
If none of the databases are of interest for you, go to the NAR database (http://www.oxfordjournals.org/nar/database/a/) and find databases that are closest to your interests;
Play around…
Hands on protein sequence databases and UniProt:http://education.expasy.org/cours/HK09/Protein_database_TP.html(corrections: http://education.expasy.org/cours/HK09/Protein_database_TP_correction.html)
Genomes and genomic tools:Hands-on
Thank You !Thank You !