genome, protein and model organism databases

Genome, ProteinGenome, Proteinandand

Model Organism Databases Model Organism Databases

Anne Estreicher Swiss-Prot Group

Swiss Institute of BioinformaticsGeneva – Switzerland

[email protected]

Bioinformatic and Comparative Genome Analysis Course

HKU-Pasteur Research Centre - Hong Kong, China

August 17 - August 29, 2009

Bioinformatic and Comparative Genome Analysis Course

HKU-Pasteur Research Centre - Hong Kong, China

August 17 - August 29, 2009

OutlineOutline

1. Introduction (definitions, history…)

2. From DNA sequence to genomic tools

3. The flow of information: from DNA to proteins

4. Protein sequence databases

5. MODs at a glance

• A collection of related data, which are– structured – searchable – updated periodically– cross-referenced

• Includes also associated tools necessary for access/query, download, etc.

What is a database ?What is a database ?

Why do we need databases ?Why do we need databases ?

Data need to be stored, curated and made available for analysis and knowledge discovery

Efficient way of sharing data, independently of regular publications

Essential resources for both experimental and computational biologists

Databases in biology : not a Databases in biology : not a new issue …new issue …

• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65

proteins)

The first protein sequence "database" by Margaret Dayhoff (1965)

contained 65 proteins

Databases: not a new issue…Databases: not a new issue…


proteins)• Mid 70s Improvements in DNA sequencing• 1979 Los Alamos Sequence Library (Walter Goad)• 1980 ~ 80 genes fully sequenced

-> Need to store the data and to make them available for analysis (in format acceptable for human eyes and machines)

-> ARCHIVE

-> RACE for the central position in life sciences…And the winner is…


EMBL-Bank - Europe 1980GenBank - USA 1982

DDBJ - Asia 1986

leading to the establishment of the INSDC (International Nucleotide Sequence

Database Collaboration) -> daily exchanges of data

www.insdc.org

EMBL-BANKEMBL-BANK -- GenBankGenBank -- DDBJDDBJ

• Main resources for DNA and RNA sequences;

• Used to be retrieved from publications -> direct submissions from individual researchers, genome sequencing projects and patent applications:

“Journal publishers generally require sequence deposition prior to publication so that an accession number can be included in the paper.”

1. True for nucleic acid, not for protein sequences;2. Not always put into practice

=> Not submitted sequences are LOST!!!=> Not submitted sequences are LOST!!!

• Archives (primary databases)

• data belong to submitters

EMBL-BANKEMBL-BANK -- GenBankGenBank -- DDBJDDBJ

Archive (primary databases) => data belong to the submitter

Minimal checks, such as vector contamination Annotation by the submitters



proteins)• 1979 Los Alamos Sequence Library (Walter Goad) –

DNA • 1982 EMBL-Bank - DNA• 1984 GenBank – DNA• 1986 DDBJ - DNA



proteins)• 1979 Los Alamos Sequence Library (Walter Goad) –

DNA • 1982 EMBL-Bank - DNA• 1984 GenBank – DNA• 1986 DDBJ - DNA

-> ARCHIVES (primary databases) may not be sufficient-> need to annotate the data to produce KNOWLEDGE • 1986 Swiss-Prot – protein sequences – a paradigm for

annotated (secondary) databases

The Swiss-Prot conceptThe Swiss-Prot concept

non-redundant: Protein products of

1 gene / 1 species -> 1 entry1 gene / 1 species -> 1 entry,

Manually annotated (=> curator judgement on data !),

Highly cross-referenced (1st life-science database to provide cross-references) (links to > 130 databases from www.uniprot.org).

Databases: not a new issue…Databases: not a new issue…• 1954 First protein sequence (insulin by F. Sanger)• 1965 Atlas of Protein Sequence and Structure (65

proteins)• 1979 Los Alamos Sequence Library (Walter Goad) – DNA • 1982 EMBL-Bank - DNA• 1984 GenBank – DNA

Protein information resource (PIR) – Protein sequences

• 1986 DDBJ – DNASwiss-Prot – protein sequences

• 1996 TrEMBL (Translated EMBL) – Protein sequencesComplement of Swiss-Prot to cope with the

increasing amount of new sequences; AUTOMATIC ANNOTATION !

0

50'000

100'000

150'000

200'000

250'000

300'000

350'000

400'000

450'000

500'000

2 7 12 17 22 27 32 37 42 47 52 57

19863’939 entries

UniProtKB/Swiss-Prot growthUniProtKB/Swiss-Prot growthN

um

ber

of

en

trie

s

Releasenumber

1996: creation of TrEMBLTrEMBLSwiss-Prot: 52’205 entriesTrEMBL: 61’137 entries

Swiss-Prot rel. 57.5 (07-Jul-2009): 470’369Swiss-Prot rel. 57.5 (07-Jul-2009): 470’369 entriesentries

0

1'000'000

2'000'000

3'000'000

4'000'000

5'000'000

6'000'000

7'000'000

8'000'000

9'000'000

UniProtKB growthUniProtKB growth

Releasenumber

TrEMBL rel.40.5 (07-Jul-2009): 8TrEMBL rel.40.5 (07-Jul-2009): 8’’594594’’382382 entries entries Swiss-Prot rel.57.5 (07-Jul-2009): 470’369 Swiss-Prot rel.57.5 (07-Jul-2009): 470’369 entriesentries

1986 1996 2009

TrEMBL growthTrEMBL growth (sequences/day)

2004

1’5002006-2007 3’5002008

>5’0002009

~8’000

TrEMBLTrEMBLAutomated curation

Swiss-ProtSwiss-ProtManual curation

Nu

mb

er

of

en

trie

s

New challengeNew challenge

Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery

Life sciences used to be rich in hypotheses, well-off in knowledge and poor in data;

Today they are very rich in data, not so well-off in knowledge and very poor in hypotheses.

Complex system

(R)evolution of these last 20 years(R)evolution of these last 20 years

List of parts

??

Science (1993) 262, 502

Danger !

EMBL Database GrowthEMBL Database Growthhttp://www.ebi.ac.uk/embl/Services/DBStats/

http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.htmlhttp://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat

In 4 months, 374 new In 4 months, 374 new genomesgenomes

and 77 were completedand 77 were completed~ 100 genomes/month~ 100 genomes/month

(in 2008 -> ~50 genomes/month)

+ ~2’360 viral (& viroid) genomes=> Total ~ 5’600 genomes

http://genomesonline.org/index2.htm

http://www.genomesonline.org/gold.cgi

Metagenomics:Metagenomics:study of genetic material recovered directly

from environmental samples

• Global Ocean Sampling (C. Venter)

• Whale fall

• Soil, sand beach, New-York air, …

• Human fluids, mouse gut

• …

Venter’s Sorcerer II

Flood in the world of Flood in the world of proteins…proteins…

1965: first protein sequence "database" by Margaret Dayhoff (65 proteins)

July 2009: ~ 20 millions unique protein sequence (source UniParc - http://www.uniprot.org/uniparc/)

UniParc:non-redundant database that contains most of the publicly available protein sequences in the world (includes sequences from EMBL-Bank/DDBJ/GenBank nucleotide sequence databases, Ensembl, FlyBase, H-Invitational Database (H-Inv), International Protein Index (IPI), Patent Offices (EPO, JPO and USPTO), PIR-PSD, Protein Data Bank (PDB), Protein Research Foundation (PRF), RefSeq, Saccharomyces Genome database (SGD), TAIR Arabidopsis thaliana Information Resource, TROME, UniProtKB/Swiss-Prot and TrEMBL, Vertebrate Genome Annotation database (VEGA) and WormBase).

New challengeNew challenge

Flood of data

Flood of databases…

NAR 1st issue of the year is always

dedicated to databases + "clean" list of

databases provided

(! not exhaustive !)

The NAR Online Molecular Biology Database collection in 2009

A total of 1’170 databases (19 obsolete removed)

http://www.oxfordjournals.org/nar/database/a/

NAR "clean" list of databaseshttp://www.oxfordjournals.org/nar/database/a/

Most recent NAR paper about the database

(not available for all db, some described in

other journals)

A "clean" list of can be found in the NAR online molecular biology database

collection

http://www.oxfordjournals.org/nar/database/a/

BIOLOGICAL DATABASE CATEGORIES BIOLOGICAL DATABASE CATEGORIES

• Databases of nucleic acid sequences (RNA, DNA)• Databases of protein sequences• Databases of protein motifs and protein domains• Databases of structures• Databases of genomes• Databases of genes• Databases of expression profiles• Databases of SNPs and mutations• Databases of metabolic pathways • Databases of protein interactions• Databases of taxonomy• …

Databases containing sequences or data directly derived from sequences.

DNA sequences :DNA sequences :

What ?What ?Where ?Where ?How ?How ?

& genomic tools& genomic tools

NCBINCBIUCSCUCSC

Accession numberMolecule typeDate of submissionDefinition

Nucleotide sequence

Stable accession number (should

always be cited in publications)

Possible molecule types:genomic DNA and RNA

mRNA other DNA and RNA rRNA transcribed RNAtRNA unassigned DNA and RNA viral cRNA

GenBank entry AF415175http://www.ncbi.nlm.nih.gov/nuccore/16589063


Nucleotide sequence

Taxonomy


Nucleotide sequence

Taxonomy

References

Nucleotide sequence

Taxonomy

References

Features:Information provided by the submitterMay include annotation of the sequence


OrganismMolecule typeChromosomal locationTissue typeGene nameCDS annotation=> protein sequence + Protein IDentifier (PID: stable identifier & version number)

Protein sequence

Gives access to the nucleic acid sequence of the CDS (not of the entire mRNA)

"Features" may provide much more informationdepending upon the sequence and the submitter…

3’end of chromosome Y

EMBL #AJ271736

Very similar view, links and Very similar view, links and options from the 3 sites:options from the 3 sites:

EMBL-Bank – GenBank - DDBJEMBL-Bank – GenBank - DDBJhttp://www.ddbj.nig.ac.jp/http://www.ebi.ac.uk/embl/ http://www.ncbi.nlm.nih.gov/

How to find a DNA sequence How to find a DNA sequence at the NCBI…at the NCBI…

http://www.ncbi.nlm.nih.gov/

Databases @ NCBIhttp://www.ncbi.nlm.nih.gov/Database/datamodel/index.html

The Entrez system:The Entrez system:integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others

=> Maximal=> Maximal interconnectivityinterconnectivity

Databases @ NCBIhttp://www.ncbi.nlm.nih.gov/Database/datamodel/index.html

Simple search with aSimple search with aEMBL-Bank/GenBank/DDBJ EMBL-Bank/GenBank/DDBJ

accession numberaccession number

Searching fromSearching froma bibliographic reference…a bibliographic reference…

Search results 2 and 3-> accession numbers provided by the authors in the article-> GenBank records

Search result 1-> corresponds to the RefSeq database…

RefSeq (Reference Sequence)RefSeq (Reference Sequence)

• Provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins;

• Most data extracted from GenBank -> choice of a reference sequence and annotation (no documented comparison between sequences)

• Some entries based on predictions (accession: XM_; XR_; XP_; ZP_);

• Currently, 8'665 species represented;

• Annotation: Manual annotation (only in entries tagged as "reviewed"); Collaboration; Propagation from other sources; Computation.

CURATION

GENOME ANNOTATION No

INFERRED No

MODEL No

PREDICTED No

PROVISIONAL No

REVIEWEDYes Yes (sequence +

functional information and features)

VALIDATED Yes Yes (initial sequence)

WGS No

RefSeq (Reference Sequence)RefSeq (Reference Sequence)

RefSeq entry NM_015595: SGEF mRNA RefSeq entry NM_015595: SGEF mRNA

Accession numberDefinitionTaxonomyList of references


Gene nameExon annotationCDS annotation and sequence


Sequence

Searching withSearching withthe gene name…the gene name…

Etc.

GenBank

Refseq

NCBI Entrez systemNCBI Entrez system Looks for the request in all NCBI databases

Cannot be ignored -> no simple way to search only in your favourite NCBI database

Searching using BLAST…Searching using BLAST…

RefSeq

UniGene:Clusters of transcript sequences that appear to come from the same transcription locus

!?UniSTS:62643 maps to multiple loci in Homo sapiens

Information on tissue expression

UniGene Mapping of

known genes

UniGene Mapping of

known genes

Mapping of RNA (EMBL/GenBank/DD

BJ& RefSeq)

UniGene

Mapping of RNA (EMBL/GenBank/DDBJ

& RefSeq)

Mapping of RefSeq RNAMapping of

known genes

UniGene

Mapping of RNA (EMBL/GenBank/DDBJ

& RefSeq)

Mapping of RefSeq RNA

This view by default can be customized

Mapping of known genes

1. Choose desired option;2. Add it (and/remove undesired)3. Apply the new display

Zoom out -> a better view of the genomic

context of the sequence of interest

Original view

Map viewer~ 110 organisms

represented in Genome database.(www.ncbi.nlm.nih.gov/sites/entrez?

db=genome)

Genomic tools on the Genomic tools on the UCSC server:UCSC server:BLAT searchBLAT search

And:A.GambiaeA.MelliferaS.cerevisiae

a total of 47 organisms

http://genome.ucsc.edu/cgi-bin/hgBlat

Feb. 2009 assembly: not all data implemented !May be better to use former assembly for the time being.

Genome browser @ UCSC

cDNAsequen

ce

Chromosomal location

Consensus CDS& other sequences from reliable resources

gDNA sequence

http://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi

Annotation of genes is provided by multiple public resources, using different methods, and resulting in information that is similar but not always identical.

CCDS database goal: provide a standard set of gene annotations.

Collaborative project involving teams (manual and automated annotation): * European Bioinformatics Institute (EBI) * National Center for Biotechnology Information (NCBI) * Wellcome Trust Sanger Institute (WTSI) * University of California, Santa Cruz (UCSC)

Currently available only for human and mouse genomes (July 2009):20'159 human CCDS (including isoforms) -> 17'054 CCDS genes17'707 mouse CCDS (including isoforms) -> 16'889 CCDS genes

Chromosomal location

Consensus CDS& other sequences from reliable resources

gDNA sequence

(Human) ESTs(including unspliced)

(Human) spliced ESTs

(Human) mRNAs

All sequences can be retrieved

The view can be completely

customized…

…including with various tools

allowing comparative

genomics

…and including your own data !

http://genome.ucsc.edu/

Back to the Blat Back to the Blat viewerviewer

Arrows >>>> show the direction of transcription

2 transcripts from the same locus:BDNF (Brain-Derived Neurotrophic Factor) BDNFOS (BDNF Opposite Strand)

View of alternative exons

Alternative exons

Constitutive exons

Interested by this exon ?

Just zoom in…

Genome browser @ UCSC has many great options, give it a

try!

http://genome.ucsc.edu/

Typical problems

or

Why wonderful tools will never replace the brain of a life

scientist !

… Once upon a time, there was a gene on chromosome 11…

2 essential genome resources are missing from this lecture:

Ensembl (http://www.ensembl.org/index.html): automated annotation of many genomes;

Vega (http://vega.sanger.ac.uk/index.html):High quality manual annotation of genomes (currently Homo sapiens, Mus musculus, Danio rerio, Gorilla gorilla, Macropus eugenii, Sus scrofa, Canis familiaris).

Please go and visit them!

The flow of informationThe flow of information

From DNA sequencesFrom DNA sequencesto protein to protein

sequences:sequences:

A little biologyA little biologyandand

A few databasesA few databases

Increase in complexity 5-10 x

Alternative promoter usage Alternative splicing

Trans-splicingmRNA editing …

Increase in complexity2-5 x

~ 100’000human

transcripts

~ 20’500 human protein-encoding

genes

~ 1'000'000 human proteins

TranscriptoTranscriptomeme

From genome to proteome:From genome to proteome:the example of humanthe example of human

GenomeGenome ProteomeProteome

Post-translational modifications (PTMs)

Most PTMs cannot be predicted from DNA

sequences

The hectic life of a protein The hectic life of a protein sequence…sequence…

cDNAs, ESTs, genomes, …

DDBJDDBJ

Data not submitted to public databases, delayed or cancelled…

…if a Coding Sequence (CDS)is submitted

Protein sequence databases

Nucleic acid databases

Gene predictionRefSeq, Ensembl

+ some MODs

no CDS

EMBL GenBankwww.insdc.orgInternational Nucleotide Sequence Database Collaboration

Sequences from

publicationsJournal scan

Direct submissions

!!!!

99% of the protein sequences found in databases come from the translation

nucleotide sequences=> Experimental evidence may be

lacking!

EMBL (DNA)EMBL (DNA)

TrEMBL TrEMBL Translated EMBL

Translated CDS

Reference + tissue

Protein name

Translated CDS

Product name

Tissue

Reference

Automated extraction of protein

sequence (translated CDS),

gene name and references +Automated annotation.

A similar pipeline is used at the NCBI to go from GenBankGenBank

to GenPeptGenPept

!!!!

The quality of UniProtKB/TrEMBL (& GenPept) entries depends upon the

quality of the submissions in the original EMBL-Bank/GenBank/DDBJ

entry.

EMBLEMBL

TrEMBLTrEMBL

EMBL (DNA)EMBL (DNA)

TrEMBLTrEMBL

Translated CDS

Reference

Protein name

Translated CDS

Product name

Tissue

Reference

Automated extraction of protein

sequence (translated CDS),

gene name and references.Automated annotation.

Swiss-ProtSwiss-ProtManual annotation

of the sequence and review of

associated biological

information

Protein nameSS

Many more references

Translated CDS+ SAPs+ isoforms+ …

Full annotation

Sequence

Sequence

features

Ontologies

References

Nomenclature

Splice variants

Annotations

Evidence for protein existence:Annotation in UniProtKB

5 levels of evidence: 1. evidence at protein level, 2. evidence at transcript level, 3. inferred by homology, 4. predicted,5. uncertain.

http://www.uniprot.org/uniprot/P35613

http://www.uniprot.org/uniprot/Q9Y471

2D-gel dbs 2DBase-EcoliANU-2DPAGEAarhus/Ghent-2DPAGE (no server)

COMPLUYEAST-2DPAGECornea-2DPAGE DOSAC-COBS-2DPAGEECO2DBASE (no server)

HSC-2DPAGEOGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGEREPRODUCTION-2DPAGESiena-2DPAGESWISS-2DPAGEWorld-2DPAGE

Family and domain dbsGene3DHAMAPInterProPANTHERPfamPIRSFPRINTSProDomPROSITESMARTTIGRFAMs

Organism-specific dbsAGDBuruListCGDCTDCYGD DictyBaseEchoBASEEcoGeneeuHCVdbFlyBaseGenAtlasGeneCardsGeneDB_SpombeGeneFarmGrameneH-InvDB HGNCHPA LegioListLepromaListiListMaizeGDBMGIMIMMypuListOrphanet PharmGKBPhotoListPseudoCAPRGDSagaListSGDSubtiListTAIRTubercuListWormBaseWormPepXenbaseZFIN

Protein family/group dbsCAZyMEROPSPeroxiBasePptaseDBREBASETCDB

Genome annotation dbsEnsemblGeneIDGenomeReviewsKEGGNMPDRTIGRUCSCVectorBase

Enzyme and pathway dbsBioCycBRENDAPathway_Interaction_DBReactome

OthersBindingDBPMAP-CutDBDrugBank NextBio

Sequence dbsEMBLIPIPIRUniGeneRefSeq

3D structure dbsDisProtHSSPPDBPDBsumSMR

PTM dbsGlycoSuiteDBPhosphoSitePhosSite

UniProtKB/Swiss-Prot:115 explicit links

and 19 implicit links!

Proteomic dbsPeptideAtlasPRIDEProMEX

Protein-protein interaction dbsDIPIntAct

Phylogenomic dbsHOGENOMHOVERGENOMA

Polymorphism dbsdbSNP

Gene expression dbsArrayExpressBgeeCleanExGermOnline

Ontologies GO

Protein Information Resource

European Bioinformatics Institute European Molecular Biology Laboratory

Swiss Institute of

Bioinformatics

The UniProt The UniProt consortiumconsortium

UniProt mission:

Provide a comprehensive high-quality and freely accessible resource of protein sequence and functional annotation.

New release every 3 weeks

Update frequencyUpdate frequencyA crucial issue !! A crucial issue !!

• Sometimes very difficult, or even impossible, to find;

• Crucial not only for the database itself, but also for tools using databases.

Update frequencyUpdate frequency

http://www.matrixscience.com/search_intro.html

Mascot MS/MS identification tool is fine, but it cannot be used from this website !

Solution: Download the database of interest and make sure you work with an up-to-date version.

Never hesitate to ask for an Never hesitate to ask for an updateupdate

UniProtKBUniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (9’232’223 entries)

UniParcUniParc: protein sequence archive (equivalent to

EMBL-Bank/GenBank/DDBJ at the protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @ EBI, not downloadable) (20’070’606 entries)

UniParc entry contains all records for a unique sequence in major publicly available databases.

TrEMBL entry merged into Swiss-Prot => does not

exist anymore


UniParcUniParc: protein sequence archive (EMBL equivalent at the

protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @ EBI, not downloadable) (20’070’606 entries)

UniRefUniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 8’474’689 entries; UniRef90 5’668'669 entries; UniRef50 2'729'565 entries)

UniRef100, 90 and 50UniRef100, 90 and 50

One UniRef100 entry -> merge of identical sequences (including subfragments, splice variants). Based on UniProtKB sequences and selected UniParc records (such as Ensembl & RefSeq).

One UniRef90 entry -> sequences that have at least 90% or more identity. Built from UniRef100.

One UniRef50 entry -> sequences that are at least 50% identical. Built from UniRef100.


UniParcUniParc: protein sequence archive (EMBL equivalent at the

protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @ EBI, not downloadable) (17’646’564 entries)

UniRefUniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 6,652,983 entries; UniRef90 4’438’653 entries; UniRef50 2’104’702 entries)

UniMESUniMES: protein sequences derived from metagenomic projects (Global Ocean Sampling (GOS)) (Blast, download) (UniMes 6'028'191 entries)

What is "Non-Redundancy" ?What is "Non-Redundancy" ?

• UniParcUniParc– One UniParc entry for all entries corresponding to

100% identical sequences (100% identity over the entire length) (from many different databases).

• UniRefUniRef– One UniRef100 entry for all entries corresponding to

100% identical sequences (including fragments) from UniProtKB, Ensembl, Refseq, PDB.

• UniProtKB/Swiss-ProtUniProtKB/Swiss-Prot– One Swiss-Prot entry for all the protein products of

one gene, including fragments, variations/polymorphisms, splice variants, sequencing errors…

Comparing searches:Comparing searches:NCBI and UniProtNCBI and UniProt

GenPept

GenPept

Swiss-Prot

RefSeq

Identical sequences

AAC34135 CAH72619Identical

sequencesAAF05316 BAG55035 CAH72618 AAI17423 AAF89753

NP_612564 O00206

Search for the human Toll-like

receptor 4 Entrez Entrez Protein (NCBI)Protein (NCBI)

Swiss-Prot

Search for the human Toll-like

receptor 4 in

UniProtKBUniProtKB

Sequences retrieved in Entrez Protein:

O00206AAF05316CAH72618 CAH72619BAG55035AAI17423 AAF89753

NP_612564* AAC34135

*Based on A126770, BC117422,AL160272

and AA598398

Major protein sequence resourcesMajor protein sequence resources

UniProtKB: Swiss-Prot + TrEMBL

EntrezProtein: Swiss-Prot+GenPept+PIR+PDB+PRF+RefSeq

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (~12’000 species)

UniProtKB/TrEMBL: submitted CDS (EMBL); automated annotation (~202’000 species)

GenPept: submitted CDS (GenBank)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of ‘published’ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation

Resources integrated in the

entries

Resources integrated in the

search engine

Model Organism Databases Model Organism Databases (MODs) at a glance(MODs) at a glance

Model organismModel organism

Species extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the organism model will provide insight into the workings of other organisms.

Model organisms MODs

Mus musculus MGI http://www.informatics.jax.org/Rattus norvegicus RGD http://rgd.mcw.edu/Oryza sativa RAP-DB http://rapdb.dna.affrc.go.jp/Arabidopsis thaliana TAIR http://www.arabidopsis.org/Drosophila melanogaster FlyBase http://flybase.org/Schizosaccharomyces pombe S. pombe GeneDB http://www.genedb.org/genedb/pombe/Saccharomyces cerevisiae SGD http://www.yeastgenome.org/Caenorhabditis elegans WormBase http://www.wormbase.org/ Dictyostelium discoideum dictyBase http://dictybase.org/ Bacillus subtilis SubtiList http://genolist.pasteur.fr/SubtiList/ Escherichia coli ecogene http://ecogene.org/ Danio rerio (zebrafish) ZFIN http://zfin.org/

Just a few examples, not an exhaustive list!

Methanocaldococcus jannaschii -> no MOD

Model organism databases (MODs)Model organism databases (MODs)

Genome annotation;Gene models;Gene mapping;Official nomenclature;Gene expression;Functional annotation;Interactions;Information about mutants/knockout/transgenic animals;Phenotypes;(cross-)references;Species-specific reagents…

Key resources for information on a given organismService provided to/from a given community

MODs do not necessarily store sequences,but give access to them

Link to cDNA sequences

http://gmod.org/wiki/Main_Page

The world of databases is a

jungle

A few points to rememberA few points to rememberwhen using databaseswhen using databases

- Content ;

- Primary / secondary / meta-databases ;- Curated / non-curated ;- manual / automated curation ;- Redundant / non-redundant.

- Update frequency;

- Stable identifiers ;

- Strategy ;- Dataflow ;- Collaborations between databases.

Test a few genomic Test a few genomic databases and toolsdatabases and tools

NCBI:http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeEBI:http://www.ebi.ac.uk/genomes/TIGR: http://cmr.jcvi.org/tigr-scripts/CMR/shared/Genomes.cgi

Genome annotation and analysis tools:http://www.ensembl.org/index.htmlhttp://vega.sanger.ac.uk/index.htmlhttp://genome.ucsc.edu/ -> BLAT, Galaxy, Custom tracks, …http://www.jgi.doe.gov/software/ -> Genome portal, Integrated Microbial Genomes (IMG) and other tools

Generic Model Organism Database http://gmod.org/wiki/Main_Page

Genomes and genomic tools: a few sites

Find your favorite (completely sequenced) organism in a genome db;Follow the links to see the options on different sites;Find the sequences;Look at the annotation of your favorite gene;Compare the entries corresponding to this gene across sites;Test search engines (restrict searches, compare results, …)

Whenever possible use on-line tutorials, such as:http://www.ensembl.org/info/website/tutorials/index.html

Visit GMOD, see the tools (http://gmod.org/wiki/GMOD_Components)

Play around with the BLAT search, customize display, follow the links, …

Genomes and genomic tools:Hands-on

Go and visit databases cited in this lecture;

The databases/tools that should be "familiar" to all are:http://genome.ucsc.edu/cgi-bin/hgBlathttp://www.ensembl.org/index.htmlgene/genome databases/tools on http://www.ncbi .nlm.nih.gov/

If none of the databases are of interest for you, go to the NAR database (http://www.oxfordjournals.org/nar/database/a/) and find databases that are closest to your interests;

Play around…

Hands on protein sequence databases and UniProt:http://education.expasy.org/cours/HK09/Protein_database_TP.html(corrections: http://education.expasy.org/cours/HK09/Protein_database_TP_correction.html)

Genomes and genomic tools:Hands-on

Thank You !Thank You !

genome, protein and model organism databases

Documents

sequence deposition

historyfrom dna sequence

protein sequences2

data need

swissprot protein sequences

protein products of1

new issueemblbank

collection of related