tunis, march 2007 a. auchincloss uniprotkb and expasy uniprotkb/swiss-prot and expasy: protein...
Post on 23-Jan-2016
219 views
TRANSCRIPT
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and
proteomics tools developed at the
Swiss Institute of Bioinformatics Andrea Auchincloss ([email protected])
Tunis, March 19, 2007
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Outline• The Swiss Institute of Bioinformatics• What is UniProt?• UniProt Knowledgebase: Swiss-Prot and
TrEMBL• HPI, post-translational modifications, HAMAP• UniRef and UniParc• Databases for protein function and domains:
PROSITE, InterPro etc.• ExPASy; other tools
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Swiss Institute of Bioinformatics (SIB)
• Non-profit foundation created in 1998;• Groups in Geneva, Lausanne and Basel;• Federation of several groups (some of
which existed and collaborated long before the foundation of the institute), about 170 researchers in 2006.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
www.isb-sib.ch
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
SIB missions
• Development of databases and software tools;• High-quality bioinformatics research program;• Courses and seminars for the training of
bioinformatics research scientists. This includes a master’s degree in proteomics and bioinformatics, several weekly courses and a doctoral school
• Services to the Swiss Life Sciences community (EMBnet node).
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Swiss Institute of Bioinformatics:20 research and service groups
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Proteins are organic compounds made of amino acids arranged in a linear chain and joined by peptide bonds…
Wikipedia
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Different ‘views’ of a protein
Proteins are composed of 20 "standard" amino acids, symbolised by a LETTER.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Proteins can also work together to perform a particular function, and they often associate to form complexes.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Proteins are essential parts of all living organisms and participate in every process within cells.
-> enzymes
-> structural or mechanical functions
-> important in cell signaling, immune response, cell adhesion, cell cycle, toxins….
Proteins are a necessary component in our diet, since animals cannot synthesize all the amino acids and must obtain essential amino acids from food.
Protein/Gene number
Organism Number
Bacteria 182-8,591S. cerevisiae 6,127C. elegans 17,947 Drosophila 13,849A. thaliana ∼ 25,674Human ∼21,000
1953: 1st sequence (bovine insulin)
1986: 4,000 sequences
2006: 3.5 million sequences
Where will it stop?
The universe in which protein databases
evolve
AMB, SP20
179,000,021,0001st estimate: ~30 million species (1.5 million named) 2nd estimate: 20 million bacteria/archaea x 4,000 genes
5 million protists x 6,000 genes
3 million insects x 14,000 genes
1 million fungi x 6,000 genes
0.6 million plants x 20,000 genes
0.2 million molluscs, worms, arachnids, etc. x 20,000 genes
0.2 million vertebrates x 21,000 genes
The calculation: 2x107x4000+5x106x6000+3x106x14000+106x6000+6x105x20000+2x105x20000+2x105x21000+21000(you!)
Caveat: this is an estimate of the number of potential sequence entries, but not that of the number of distinct protein entities in the biosphere. AMB, SP20
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
What is sequencing is underway right now?
Many eukaryotic & bacterial genomes (varying sizes)
Metagenomics (environmental samples)
~ 6 million sequences submitted/published in December 2006,
~ 17 million sequences being generated at the Venter Institute, 6 million proteins are being
submitted from the GOS (Global Ocean Sampling) trip
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Protein sequences; what is sequenced?
Currently about 3.5 to 4.0 million ‘known’ protein sequences
More than 99% of these are derived by translation of nucleotide sequences
Less than 1%: direct protein sequencing (Edman, MS/MS…)
-> It is important that users know where the protein sequence comes from…
(sequence & gene prediction quality)!
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Level of DNA/RNA sequence quality
- DNA/RNA sequencing quality (genome or WGS, cDNA or EST …)
- Gene prediction quality; programs used, is there manual intervention afterwards?
For example:Authors can specify the nature of the CDS in the nucleotide databases by using qualifiers: "/evidence=experimental" or "/evidence=not_experimental".
Very rarely done…
The hectic life of a sequence …
cDNAs, ESTs, genomes, …
EMBL, GenBank, DDBJ
Data not submitted to public databases, delayed or cancelled…
…if the submitters provide an annotated Coding Sequence
(CDS)
Public protein sequence databases
Public nucleic acid
databases
CDS translation provided by EMBL
CDS provided by the submitters
The first Met !
CDS: CoDing Sequence (CDS)
Complete genome (submitted)
only ~ 1,858 CDS available!
Data not submitted
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Issue for the users:the protein database jungle
The hectic life of a sequence …
cDNAs, ESTs, genomes, …
EMBL, GenBank, DDBJ
Data not submitted to public databases, delayed or cancelled…
…if the submitters provide an annotated Coding Sequence
(CDS)
Public protein sequence databases
Public nucleic acid
databases
The hectic life of a sequence …
TrEMBL GenPept
CoDing Sequences provided by submitters
cDNAs, ESTs, genomes, …
EMBL, GenBank, DDBJ
Data not submitted to public databases, delayed or cancelled…
Swiss-Prot
RefSeq*
Manually annotated
PRF
Scientific publications derived sequences
EnsEMBL*
IPI
CCDS
UniParc
UniProtKB
PDB* Also gene prediction
PIR
+ species-specific databases (EcoGene, TubercuList, TIGR…)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Major public protein sequence database ‘sources’
UniProtKB: Swiss-Prot + TrEMBL
NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq
PIR PDB PRF
UniProtKB/Swiss-Prot: manually annotated protein sequences (11,000 species)
UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot (127,000 species)
GenPept: submitted CDS (GenBank); redundant with UniProtKB (about 130,000 species)
PIR: Protein Information Resource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4,000 species)
Integrated resources
‘cross-references’
Separated resources
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Other protein sequence databases
CCDS: EBI + NCBI + Wellcome Trust Sanger + UC Santa Cruz (2 species)
Consensus human and mouse sequences between 4 institutions… Combining different approaches – ab initio, by similarity - and taking advantage of the expertise acquired by different institutes, including manual annotation…
EnsEMBL: UniProtKB + RefSeq + gene prediction (31 species)
aligns some eukaryotic genomic sequences with all the sequences found in EMBL, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (→ known genes)- Also does some gene prediction (→ novel genes)
IPI: UniProtKB + RefSeq + EnsEMBL + (H-InvDB, TAIR, VEGA) (7 species)
provides a guide to the main databases that describe the human, mouse, rat, zebrafish, Arabidopsis, chicken, and cow proteomes.
…
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
The UniProtThe UniProt consortiumconsortium
Protein Information Resource
European Bioinformatics Institute European Molecular Biology Laboratory
Swiss Institute of
Bioinformatics
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
The UniProt Consortium
UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein
information
www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006).
Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc
and soon UniMES (for Metagenomic and Environmental Sequences)
UniProt KnowledgeBase
UniProtKB/TrEMBLComputer annotatedprotein sequences
3’600’000 entries~100’000 species
UniRef100UniRef 90UniRef 50
• One UniRef100 entry =
All identical sequences (including fragments).
• One UniRef90 entry = Sequences that have at least
90% or more identity.
• One UniRef50 entry =Sequences that are at least
50% or more identity.
Independent of species.
UniProt Archives~8’000’000 entries
Archived raw protein
sequences, found in publicly
accessible databases:
Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl,
IPI, PDB, RefSeq, FlyBase, WormBase,
Patent Offices.Use with extreme caution: Contains
pseudogenes, incorrect CDS
predictions, etc…
UniProtKB/Swiss-Prot
Manually annotatedprotein sequences
260’000 entries ~10’000 species
UniProtKB Release 9.7 consists of:
The Universal Protein resource components
Allows comprehensible BLAST similarity searches by providing sets of representative sequences
UniProtKB
produced by SIB and EBI
produced by PIR
produced by EBI
UniProt KnowledgeBase
UniProtKB/TrEMBLComputer annotatedprotein sequences
3,900,000 entries~127,000 species
UniRef100UniRef 90UniRef 50
• One UniRef100 entry =
All identical sequences (including fragments).
• One UniRef90 entry = Sequences that have at least
90% or more identity.
• One UniRef50 entry =Sequences that are at least
50% or more identity.
Independent of species.
UniProt Archives~8’000’000 entries
Archived raw protein
sequences, found in publicly
accessible databases:
Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl,
IPI, PDB, RefSeq, FlyBase, WormBase,
Patent Offices.Use with extreme caution: Contains
pseudogenes, incorrect CDS
predictions, etc…
UniProtKB/Swiss-Prot
Manually annotatedprotein sequences
260,000 entries ~11,000 species
The Universal Protein resource components
Allows comprehensible BLAST similarity searches by providing sets of representative sequences
UniProtKB
produced by SIB and EBI
produced by PIR
produced by EBI
UniProt KnowledgeBase
UniProtKB/TrEMBLComputer annotatedprotein sequences
3,900,000 entries~127,000 species
UniRef100UniRef 90UniRef 50
• One UniRef100 entry =
All identical sequences (including fragments).
• One UniRef90 entry = Sequences that have at least
90% or more identity.
• One UniRef50 entry =Sequences that are at least
50% or more identity.
Independent of species.
UniProt Archives~8’000’000 entries
Archived raw protein
sequences, found in publicly
accessible databases:
Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl,
IPI, PDB, RefSeq, FlyBase, WormBase,
Patent Offices.Use with extreme caution: Contains
pseudogenes, incorrect CDS
predictions, etc…
UniProtKB/Swiss-Prot
Manually annotatedprotein sequences
260,000 entries ~11,000 species
The Universal Protein resource components
Allows comprehensible BLAST similarity searches by providing sets of representative sequences
UniProtKB
produced by SIB and EBI
produced by PIR
produced by EBI
UniProt KnowledgeBase
UniProtKB/TrEMBLComputer annotatedprotein sequences
3,900,000 entries~127,000 species
UniRef100UniRef 90UniRef 50
• One UniRef100 entry =
All identical sequences (including fragments).
• One UniRef90 entry = Sequences that have at least
90% or more identity.
• One UniRef50 entry =Sequences that are at least
50% or more identity.
Independent of species.
UniProt Archives~8,800,000 entries
Archived raw protein
sequences, found in publicly
accessible databases:
Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl,
IPI, PDB, RefSeq, FlyBase, WormBase,
Patent Offices.Use with extreme caution: Contains
pseudogenes, incorrect CDS
predictions, etc…
UniProtKB/Swiss-Prot
Manually annotatedprotein sequences
260,000 entries ~11,000 species
The Universal Protein resource components
Allows comprehensible BLAST similarity searches by providing sets of representative sequences
UniProtKB
produced by SIB and EBI
produced by PIR
produced by EBI
UniProt web sites…
http://www.expasy.org/sprot/
http://www.pir.uniprot.org/
http://www.ebi.ac.uk/uniprot/
http://www.uniprot.org/
Soon, a new unified web site,
with a very powerful search engine….
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
http://beta.uniprot.org/
Test it! Logon:guestPassword: amazing
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
The UniProt groups from SIB, The UniProt groups from SIB, EBI and PIREBI and PIR (Antibes, September 2004)
In Geneva (SIB):2 Group Leaders44 Annotators4 Prosite annotators22 Programmers and Researchers5 Administrators, science communicators 3 System Administrators4 Students1 GISAID------------------85 people
At EBI: (Swiss-Prot + EMBL + TrEMBL)75 people (29 Annotators)
At PIR: 1 Group Leader13 Protein Science Team12 Informatics Team------------------26 people
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniProtKB has biweekly releases; available from about ~100 servers, the main sources being ExPASy and www.uniprot.org
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniProtKBFrom EMBL (DNA) to
TrEMBL (protein)
EMBL
TrEMBL
Reference
Automated extract of the protein sequence (CDS), gene name,
taxonomy and references.
Automated annotation (KWs and protein family).
Gene/protein name
CDS
Taxonomy
! TrEMBL does not translate DNA sequences, nor does it use gene prediction programs: only takes the existing CDS proposed by the submitting authors in the EMBL/Genbank/DDBJ entry
In particular, the proposed CDS and derived protein sequences can be experimentally proven or derived from gene prediction programs (this is not obvious from the TrEMBL entry)
TrEMBL does not validate any sequences
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
!!!!
The quality of UniProtKB/TrEMBL data is directly dependent on the information provided by the
submitter of the original nucleotide entry.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniProtKBFrom TrEMBL to Swiss-Prot
Swiss-Prot
Annotation of sequence differences (conflicts, variants, splicing…)
EMBL
TrEMBL
CDS
Average of 6 independent sequence reports for each human protein
Manual annotation of the sequence and
associated biological
information (derived from literature,
external experts, databases…)
Automated extraction of the protein sequence (CDS), gene name
and references.Automated annotation.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Distinguishing Swiss-Prot and TrEMBL
– A TrEMBL entry is a computer-annotated record derived from a coding sequence (CDS) in the nucleotide sequence databases, not in Swiss-Prot, after some redundancy removal and automated annotation.
– A Swiss-Prot entry is a manually annotated record for a given protein.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniProtKB From TrEMBL to Swiss-Prot
Step 1: Sequence check
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniProtKB/Swiss-Prot
Non-redundant 1 entry -> 1 gene (1 species)
i) Merge all known protein sequences (CDS and amino acid) derived from the same gene
-> decreases redundancy and improves sequence reliability
ii) Annotation of the sequence differences (including conflicts, polymorphisms, splice variants etc..)
-> annotation of protein diversity
260,000 + 3,800,000 3,600,000
Redundancy…
Redundancy in TrEMBL&
Redundancy between TrEMBL and Swiss-Prot
In the future: redundancy is going to decrease: "new" genome sequencing → "new" proteins
UniProtKB/Swiss-Prot ~11,000 species
UniProtKB/TrEMBL ~127,000 species
- 13 sequences (complete or partial) - derived from mRNA (n=6) or genomic DNA (n=7)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
All alternatively spliced sequences are available for BLAST searches, protein identification tools and are downloadable…
Human: ~2/3 of the human genes are alternatively spliced
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
- 6 genomic sequences (complete or partial) - 1 protein sequence from PIR
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Multiple alignment of the available clpB sequences
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Within Swiss-Prot?
• A snapshot of the situation (December 2006):– 28,200 entries with 82,000 sequence conflicts;– 2,600 entries with corrected frameshifts;– 15,100 entries with corrected initiation sites;– 4,300 entries with other sequence ‘problems’.
• At least 43,000 entries (19% of Swiss-Prot) required a minimal amount of annotation effort to obtain the “correct” sequence.
Quality of protein information from genome projects
• Proteins originating from different genome projects:– Drosophila: what a curated (thanks to FlyBase)
genome effort should look like: only 1.8% of the gene models conflict with what we have in UniProtKB/Swiss-Prot;
– Arabidopsis: a genome where lots of work was done to annotate it when it was sequenced, but where nothing as been done since (at least in the public view): 19.5% of the gene models are erroneous;
– Tetraodon nigroviridis: a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins.
– Bacteria and Archaea have almost no splicing, so prediction is “easier”, however errors are still made…
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
• Producing a clean set of sequences is not a trivial task;
• It is not getting easier as more and more types of sequence data is submitted;
• It is important to pursue our efforts in making sure we provide to our users the most correct set of sequences for a given organism.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
• As most protein sequences are derived from translation of nucleotide sequence and are only predictions, the new PE line indicates whether there is any evidence that proves the existence of a protein;
• The ‘Protein existence evidence’ will have 5 different qualifiers:
1. Evidence at protein level2. Evidence at transcript level3. Inferred from homology4. Predicted- Unassigned (used mostly in TrEMBL)
New ‘Protein existence evidence’ tag
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Righting the wrongs
“Sequences are rarely deposited in a “mature” state; as with all scientific research, DNA and protein
annotation is a continual process of learning, revision and corrections.”
“Sequencing error rates: ~1 base in 10’000”
“Making people aware of errors is good and great; making people aware that they’re responsible also for
correcting errors is even greater”
C. Hardley, EMBO reports, 4(9), 2003.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniProtKB
From TrEMBL to Swiss-Prot
Step 2: Annotation:literature
controlled vocabulary
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
• The focal point of the efforts to maintain and develop UniProtKB/Swiss-Prot;
• It is becoming more and more important as it provides: a summary of what is known about a protein; creates template for automatic annotation for the
many organisms whose genome sequence is/will be available but whose proteins will not be characterized;
provides well annotated (corpus) entries to train literature mining tools (text mining).
Annotation
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
(…)
….Source of data- publications (> 1,700 journals cited) -also external scientific expertise & other databases
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Comments: “structured free text”, 27 defined topics
Manually annotatedInformation from papers, specialized databases, computer prediction, external experts, brain stormingDistinction between data obtained experimentally and computerized inferences
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniProtKB
From TrEMBL to Swiss-Prot
Step 3: Sequence analysis (bioinformatics tools)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Annotators could not work without the help of our software developers;
The annotation platform
Anabelle: much more than a domain
annotation platform
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
We manually check the results !
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
What else is in a UniProtKB/Swiss-Prot entry?
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Cross-references; a central hub
• Swiss-Prot was the first database with X-references;• Explicitly X-referenced to 85 databases:
– DNA (EMBL/GenBank/DDBJ), – 3D-structure (PDB)– Family and domain (InterPro, HAMAP, PROSITE, Pfam,
etc.)– genomic (OMIM, MGI, FlyBase, SGD, SubtiList, etc.)– 2D-gel (e.g. SWISS-2DPAGE)– specialized db (e.g.GlycoSuiteDB, PhosSite, MEROPS);– literature (PubMed)
• Each UniProtKB/Swiss-Prot entry can be seen as a central hub for the data available about the protein it describes
Gasteiger E. et al, Curr. Issues Mol. Biol. 3:47-55(2001)Gasteiger E. et al, Curr. Issues Mol. Biol. 3:47-55(2001)www.expasy.org/cgi-bin/lists?dbxref.txtwww.expasy.org/cgi-bin/lists?dbxref.txt
2D-gel databases ANU-2DPAGEAarhus/Ghent-2DPAGECOMPLUYEAST-2DPAGECornea-2DPAGE DOSAC-COBS-2DPAGEECO2DBASEHSC-2DPAGEOGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGEREPRODUCTION-2DPAGESiena-2DPAGESWISS-2DPAGE
Family and domain databasesGene3DHAMAPInterProPANTHERPIRSFPfamPRINTSProDomPROSITESMARTTIGRFAMs
Organism-specific databasesAGDCYGD DictyBaseEchoBASEEcoGeneeuHCVdbFlyBaseGeneDB_SpombeGeneFarmGrameneH-InvDB HGNCHIVHPA LegioListLepromaListiListMaizeGDBMGIMIMMypuListPhotoListRGDSagaListSGDStyGeneSubtiListTAIRTubercuListWormBaseWormPepZFIN
Enzyme and pathway databasesBioCycReactome
MiscellaneousArrayExpressdbSNPDIPDrugBank GOIntActLinkHubRZPD-ProtExp
Protein family/group databasesGermOnlineMEROPSPeroxiBasePptaseDBREBASETRANSFAC
Sequence databasesEMBLPIRUniGene
3D structure databasesHSSPPDBSMR
PTM databasesGlycoSuiteDBPhosSite
UniProtKB/Swiss-Prot
explicit links
Genome annotation databasesEnsemblGenomeReviewsKEGGTIGR
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Implicit cross-references on new web server and ExPASy
Implicit X-references to 26 additional db added by the ExPASy server on the www (i.e.: GeneCards, ModBase, etc.)
These X-refs are not present as hard-coded DR lines in the Swiss-Prot entry as it can be downloaded by ftp, but are added on the fly when someone views an entry on ExPASy. This can be done because enough information is present in the UniProtKB entry to access the related information in another db. Example: All Swiss-Prot/TrEMBL are linked to the BLOCKS domain db, via the Swiss-Prot/TrEMBL accession number
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Keyword definition and usage in Swiss-Prot
Linked to Gene Ontology to further facilitate
information retrieval via controlled vocabularies
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
In a UniProtKB/Swiss-Prot entry, you can expect to find:
• All the names of a given protein (and of its gene);• Its biological origin with links to the taxonomic
databases;• A selection of references;• A summary of what is known about the protein:
function, alternative products, PTM, tissue expression, disease, 3D-structures, etc.…;
• Numerous cross-references;• Selected keywords;• A description of important sequence features:
domains, PTMs, variations, etc.;• A (often corrected) protein sequence and the
description of various isoforms/variants.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Monitoring entry history: The UniProtKB Sequence/Annotation Version archive
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
… and many useful links:
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
And on the new website
other tools are not yet available…
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniProt Knowledgebase
• Swiss-Prot: Manually annotated section
• TrEMBL: Automatically annotated section
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Distinguishing Swiss-Prot and TrEMBL
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Accession number: to be used when you cite a UniProtentry in anywhere (never cite the entry name (ID) alone)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Non-Redundant Complete Proteome Sets
• Text search UniProtKB keyword “Complete proteome”, combined with an organism name
• Or download precomputed sets (bacteria, archaea, some eukaryotes): ftp://ftp.expasy.org/databases/complete_proteomes/entries
• Or EBI Integr8 http://www.ebi.ac.uk/integr8/
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
The main annotation programs:
• HAMAP (High quality Automated and Manual Annotation of microbial Proteomes; bacteria, archaea, plastids);
• HPI (Human Proteomics Initiative);• PPAP (Plant Proteome Annotation Project);• FPAP (Fungal Proteome Annotation Project);• Viral proteins;• Tox-Prot (Toxin Annotation Project);• ENZYMES (proteins with EC numbers);• PTMs• 3D-structure• Protein-protein interactions• Quality assurance, includes controlled vocabularies
Swiss-Prot annotation priorities
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Model organisms
• Organisms for which we want to have a more in-depth coverage;
• Completeness, links with specialized databases, specific documents;
• Examples: E.coli, B.subtilis, human, mouse, fruitfly, C.elegans, yeast, S.pombe, A.thaliana.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Human Proteomics Initiative
(HPI)
post-translational modifications of proteins
(PTMs)5-10 fold increase
alternative splicingof mRNA
2-5 fold increase
~ 100,000 human
transcripts
~ 21,000 human genes
~ 1,000,000 human proteins
Considerable increase in complexity
From genome to proteome
In the case of human genes, the Swiss-Prot/TrEMBL redundancy is still very high:
15,803 + 53,100 about 20,000*
* human gene number estimation:21,000-35,000
MS proteomics has verified more than 10% of human genes products, but has not identified significant numbers of unpredicted proteins
What is missing:• Sequences not submitted to EMBL/GenBank/DDJB (and PIR)• Not yet predicted or known genes ("no CDS provided by the submitters" or no DNA sequence)• Confidential data (Patent application sequences)• Immunoglobulins, T-cell receptors (-> UniParc)•…
1000
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Post-translational modifications
(PTMs)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
PTM definition
a post-translational modification or PTM is
a modification of a polypeptide chain involving
the making or the breaking of covalent bond(s)
that occurs during (co-translational class) or after
translation.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
PTMs PTMs influenceinfluence or even or even definedefine protein protein functionfunction
phosphorylation and possibly GlcNAcylation and S-nitrosylation are a means of transducing extracellular signals to the inside of the cells.methylation has a role in nuclear protein import. lipid addition allows protein to membrane association (e.g. GPI-anchor, myristate, palmitate).intrachain disulfide bonds and N-glycosylation influence protein folding.interchain disulfide bonds bind subunits together.other PTMs are directly involved in the protein function, as for example the binding of cofactors (e.g. pyridoxal phosphate), or the synthesis of a cofactor by the modification of amino acids present in the protein (e.g. quinones).
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
PTM varietyGly Ala Val Leu Ile Lys Arg His Asp Glu Asn Gln Cys Ser Thr Met Pro Phe Tyr Trp
acetylation
methylation
acylation
phosphorylation
oxidation
crosslinks
hydroxylation cofactor binding sulfation C-linked sugar N-linked sugar O-linked sugar S-linked sugar
acetylation
methylation
acylation
crosslinks
GPI
amidation
crosslinks
methylation
C-terminal modifications
in black: cytoplasmic modificationsin dark grey: both cytoplasmic and extracellular modifications, depending on the exact typein light grey: extracellular modifications
N-terminal modifications
side-chain modifications
Each protein can be modified at various sites…which gives a high number of ‘alternative’ peptides.
283 different protein modifications are annotated in UniProtKB/Swiss-Prot…
Large scale experiments (LSE) for
PTMs! • PTM information can now be obtained from
results of proteomics large scale experiments (LSE);
• In the past 12 months we have added about 6’000 experimental PTMs using data originating from some of these projects.
AMB, SP20
Proteomic studies have lead to the updating of 2767 human Swiss-Prot entries, mainly with PTM information
(UniProt release 10.0 , March 2007)
Glycosylation (9%)
Other PTMs (4%)
Phosphorylation (83%)
Subcellular location (4%)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Bacteria and Archaea
(HAMAP)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
In 2006, ≈130 new bacterial and archaeal genomes (not WGS) were submitted to the DNA databases;
If on "average" 4,000 proteins/genome=>500,000 proteins!
How to cope????
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
High qualityAutomated andManualAnnotation of microbial
ProteomesLots of microbial genomes, lots of proteins. What should we do with them in UniProt?
HAMAP
http://www.expasy.org/unirule/MF_00319
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Automatic annotation of proteins belonging to specified families
(1)• This program requires the continuous
development and adaptation of software tools as well as the development of a database of annotation rules for each family (so far about 1,400).
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Allows us to annotate automatically, yet with a very high level of quality, proteins that belong to well defined protein families;
Can be applied to both characterized proteins and to some UPF’s (Uncharacterized Protein Family);
The families are based on UniProtKB/Swiss-Prot entries, so we first do all the annotation steps described earlier!
Using HAMAP, we can currently annotate to Swiss-Prot quality level between 10% to 50% of a complete
microbial proteome (next step: HAMAP for Fungi…)
/www.expasy.ch/sprot/hamap/
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Updates• DNA sequence archives
– EMBL/GenBank/DDBJ is an archive• All submitted data goes into the archive• Submitters are responsible for the submitted
sequences and the accompanying annotation• Nobody else can change them (including the
curators at EMBL/GenBank/DDBJ)
• Protein sequence databases– UniPRotKB/Swiss-Prot is NOT an archive
• Swiss-Prot chooses what goes into the database and where to place it
• Swiss-Prot updates annotation and sequences when necessary
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
**ZB SYP, 28-NOV-2003; ALB, 16-NOV-2004; MIM, 31-Jan-2006;**ZB BER, 13-FEB-2006; LYG, 14-JUN-2006; LYG, 21-SEP-2006;**ZB CHH, 05-DEC-2006;
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
User updates or annotation requests
Accessing & Searching UniProtKBAccessing & Searching UniProtKBDirect access (keyword search)• New search tool – we’ll use it later• Sequence Retrieval System (SRS, Europe), will
disappear • Entrez (NCBI, USA) – UniProtKB/Swiss-Prot (not
TrEMBL) is integrated in GenPept, but with a changed format, and with some information (e.g. implicit cross-references) is missing
• Query tools on ExPASy & UniProt (http://www.expasy.org/sprot/, http://www.uniprot.org)
Indirect access (sequence search)• Bioinformatics & sequence analysis tools (Blast,
Fasta, GCG, Emboss, MS Identification tools…)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Downloading the UniProt Knowledgebase
http://www.expasy.org/sprot/download.htmlhttp://www.expasy.org/sprot/download.html
• Swiss-Prot and TrEMBL form a complete, non-redundant database, the UniProt Knowledgebase
• Can be downloaded from ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase
• In “Swiss-Prot” format, fasta or xml format• Complemented by sequences of alternative splice
isoforms• “everything” about “ all” proteins! (at least all CDS
submitted to the public nucleotide sequence databases)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
If you want to develop tools to work with your local copy of UniProtKB:
Swissknife – a PERL parser for UniProtKBConstantly updated according to latest format
changesAdvantage: you do not need to know how
exactly the information is stored in the flat file
• http://swissknife.sourceforge.net/• ftp://ftp.ebi.ac.uk/pub/software/swissprot/
Swissknife/
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
• Swiss-Prot is the non redundant, manually annotated and highly cross-referenced section of the UniProt Knowledgebase
• Be aware of the differences between UniProtKB/TrEMBL and UniProtKB/Swiss-Prot – Computer vs. Human– Redundant vs. Non-redundant
• Always cite the Accession number, not the entry name– The AC is stable– The entry name might change
We need your feedback and your [email protected]
http://www.expasy.org/sprot/update.html(and from every UniProtKB entry page on our servers)
Take home messageTake home message
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
The UniProt Consortium
UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein
information
www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006).
Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc
and soon UniMES (for Metagenomic and Environmental Sequences)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniRef100, 90 and 50 clusters
One UniRef100 entry -> all identical sequences from UniProtKB and some sections of UniParc (including fragments, Swiss-Prot splice variants).
One UniRef90 entry -> sequences that have at least 90% or more identity.
One UniRef50 entry -> sequences that are at least 50% identical.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniRef100, 90 and 50 clusters
One cluster can contain sequences of several species, clustering is done independently of the organism
Each cluster has a “representative”, “reference” sequence, preferably that of the best-annotated Swiss-Prot entry
UniRef identifiers are of the form UniRef100_P99999, UniRef50_P00414 – not stable, as clusters are recomputed with every biweekly release, and cluster representatives can change!
UniRef is useful for comprehensive BLAST sequence searches by providing sets of representative sequences.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Implicit cross-link UniProtKB to UniRef:Implicit cross-link UniProtKB to UniRef:
new web view:new web view:
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
The UniProt Consortium
UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein
information
www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006).
Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc
and soon UniMES (for Metagenomic and Environmental Sequences)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniParc – the UniProt Archive• 8.8 million sequences• Sequences and cross-references (AC numbers)• A comprehensive collection of the raw protein
sequences in public databases (including those not submitted to the DNA databases):
Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices.
• UniParc can be used to track sequence versionsUse with extreme caution: also contains pseudogenes,
incorrect CDS predictions, etc…and is highly redundant !
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniParc UniParc tracks a protein tracks a protein sequence and its integration in various databasessequence and its integration in various databases
http://www.pir.uniprot.org/cgi-bin/textSearch_AR
Patent data
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
UniParc entry UPI0000033477 part 2
TrEMBL entry was merged into Swiss-Prot
TrEMBL entry probably to be merged into Swiss-Prot
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
www.expasy.ch/prosite
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
A database of protein families and domains using two kinds of motif descriptors:
Patterns or regular expressions : •User friendly (easy to understand and to use)•Well designed for the detection of biologically meaningful sites such as residues playing a structural or functional role •Can be used to scan a protein database in reasonable time on any computer
Generalized profiles or weight matrices : •Well adapted to cover the full length of the protein or domain •Are able to detect highly divergent families or domains with only a few well conserved positions
PROSITE
Identification of protein domains and families
• There are two non-exclusive approaches for the determination of the function of an uncharacterized protein:– Comparison with a complete sequence database
(BLAST)– Scanning a database of patterns and profiles
• Most proteins can be grouped into families. Proteins belonging to a particular family share functional attributes and are derived from a common ancestor;
• Some regions in the sequence are more conserved than others during evolution because they are important for the function or the structure of the protein;
• Like fingerprints for police identification, signatures built out of sequence patterns or profiles can be used to formulate hypotheses about the function of uncharacterized proteins.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Definitions of conserved regions
Conserved regions can be classified into 5 different groups:
• Families: proteins that have the same domain arrangement, be 1 or many domains.
• Domains: specific combination of secondary structures that assume characteristic three dimensional structures or folds.
• Repeats: structural units always found in two or more copies that assemble in specific fold. Assemblies of repeats might also be thought of as domains.
• Motifs: short regions with conserved active- or binding-sites that usually adopt a folded conformation only in association with their ligands.
• Sites: functional residues (active sites, disulfide bridges, post-translationally modified residues)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Conserved regions (2)
CSA_PPIASE
TP
R
TP
R
TP
R
PPID family: 1 CSA_PPIASE domain + 3 TPR repeat
Cys 181: active site residue Binding cleft (motif)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
http://www.expasy.org/tools/scanprosite/
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Functionally and structurally relevant residues in PROSITE motif
descriptorsA new concept to extract more information
from profilesPrinciple :• Combining the advantages of profiles
(high sensitivity) and patterns (position-specific information)
• Tagging of amino acids at precise positions in the profile and checking their presence in the matched sequence
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Aim:
• Provide users with biologically meaningful functional and structural information:
active sites, post-translational modification sites,binding sites,disulfide bonds,transmembrane regions.
• Help the UniProtKB/Swiss-Prot annotation and provide enhanced homogeneity:
domain name and boundaries,keywords and linked GO terms,EC numbers,false negative PROSITE patterns.
ProRule
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Sigrist et al.: Bioinformatics 21:4060-4066(2005)
www.expasy.ch/prosite/prorule.html
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Other methods for protein/domain identification
Pfam, TIGRFAMs, SMART, Gene3D, PANTHER, CDD: Hidden Markov Models (HMM), Probabilistic models;
PRINTS: “Unweighted” matrices; protein fingerprints
BLOCKS: Weight matrix derived from ungapped alignments;
PIRSF, SUPERFAMILY: classification system based on evolutionary relationship of whole proteins
ProDom: automatic compilation of homologous domains based on recursive PSI-BLAST searches.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
The InterPro projectwww.ebi.ac.uk/interpro
Integrated Documentation Resource of Protein Families, Domains and Functional Sites
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
The InterPro projectwww.ebi.ac.uk/interpro
• Unification of PROSITE, PRINTS, Pfam and ProDom into an integrated resource of protein families, domains and functional sites in 2000;
• Joint effort in creating a unified yet methodologically diverse system for protein family/domain identification;
• Single set of “documents” linked to the various methods;• Distributed with tools by anonymous FTP and through
www servers;• Used to enhance the functional annotation of UniProtKB
(Swiss-Prot and TrEMBL)• Has progressively incorporated other databases
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Current status of InterProRelease 14.1 (February 2007) was built from Pfam, PRINTS, PROSITE, ProDom, SMART, TIGRFAMs, PIRSF, Scop based SUPERFAMILY, Gene3D and PANTHER, and the current UniProt/Swiss-Prot + TrEMBL data. (for details see http://www.ebi.ac.uk/interpro/release_notes.html)
InterPro release 14.1 contains 13,953 entries, representing 3,911 domains, 9,610 families, 232 repeats, 34 active sites, 20 binding sites and 19 post-translational modification sites. Overall, there are 15,880,845 InterPro hits from 3,100,874 UniProtKB protein sequences.
92.4% of Swiss-Prot and 76.4% of TrEMBL protein sequences have one or more InterPro hits.
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
http://www.ebi.ac.uk/interpro/
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001304
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
InterPro: Graphical domain representation
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteomeID=25
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteomeId=18
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
The ExPASy www server
• First molecular biology server on the Web (August 1993); ~500 million accesses since;
• Dedicated to proteomics:– Databases: UniProtKB, PROSITE, Swiss-2DPAGE,
etc.;– Many 2D/MS protein identification/characterization and
sequence analysis tools;
• Mirror sites in Australia, Brazil, Canada, China and Korea: http://{au|br|ca|cn|kr|www}.expasy.org
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
ExPASy software tools
• Tools for the display and management of databases (NiceProt, Swiss-Shop sequence alerting system, etc.);
• Tools for sequence analysis (ScanProsite, ProtParam, ProtScale, RandSeq, Translate, etc.);
• Proteomics tools (AACompIdent, FindMod, FindPept, Aldente, PeptideMass, TagIdent, etc.);
• 3D-structure analysis and display tools (Swiss-Model, Swiss-PDBviewer)
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Identification:Aldente,
TagIdent, AAcompIdent,
MultiIdent
Identification:Aldente,
TagIdent, AAcompIdent,
MultiIdent
Characterization:
FindMod,GlycoMod, FindPept
Characterization:
FindMod,GlycoMod, FindPept
Analysis:PeptideMass,GlycanMass,BioGraph,
PeptideCutterProtScale,ProtParam
Analysis:PeptideMass,GlycanMass,BioGraph,
PeptideCutterProtScale,ProtParam
- Use annotation in Swiss-Prot and TrEMBL (preprocessing, PTMs, etc.)- Hyper-links between tools and databases
http://www.expasy.org/tools/
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
http://www.expasy.org/links.html
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Finding out about recent developments:
UniProtKB/Swiss-Prot recent format changes:http://www.expasy.org/sprot/relnotes/sp_news.html
UniProtKB/Swiss-Prot planned format changes:http://www.expasy.org/sprot/relnotes/sp_soon.html
Subscribe to the electronic Swiss-Flash bulletins: http://www.expasy.org/swiss-flash/
What’s new on ExPASy: http://www.expasy.org/history.html
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
References (1)UniProtKB/Swiss-Prot: http://www.expasy.org/sprot/sprot-ref.html
Wu C. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information.Nucleic Acids Res. 34:D187-191(2006).
Boeckmann B. et al. Protein variety and functional diversity: Swiss-Prot annotation in its biological contextComptes Rendus Biologies 328:882-99(2005).
Bairoch A.Swiss-Prot: Juggling between evolution and stability Brief. Bioinform. 5:39-55(2004).
Farriol-Mathis N. et al. Annotation of post-translational modifications in the Swiss-Prot knowledgebase. Proteomics 4:1537-1550(2004).
Gasteiger E. et al. A. Swiss-Prot: Connecting biological knowledge via a protein databaseCurr. Issues Mol. Biol. 3:47-55(2001).
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
References (2)PROSITE:Hulo N., et al., The PROSITE database. Nucleic Acids Res. 34:D227-
D230(2006).
Sigrist C.J.A., et al., PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform. 3:265-274(2002).
Gattiker A., et al., ScanProsite: a reference implementation of a PROSITE scanning tool. Applied Bioinformatics 1:107-108(2002).
Sigrist C.J.A., et al., ProRule: a new database containing functional and structural information on PROSITE profiles. Bioinformatics. 2005 21(21):4060-6.
ExPASy:Gasteiger E. et al.ExPASy: the proteomics server for in-depth protein
knowledge and analysis. Nucleic Acids Res. 31:3784-3788(2003).
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Useful general publications
• Nucleic Acids Res. Database issue 2006, vol. 34, supplement 1: http://nar.oupjournals
.org/content/vol34/suppl_1/ • Nucleic Acids Res. Web server issue
2005, vol. 33, supplement 2: http://nar.oupjournals.org/content/vol33/suppl_2/
• Book: Bioinformatics for Dummies, by J.-M. Claverie and C. NotredamePublisher: For Dummies; 2nd edition (December, 2006) ISBN: 0764516965
Tunis, March 2007A. Auchincloss
UniProtKB and ExPASy
Take home message
• We need your [email protected]
Or via the website
Before the introduction to Swiss-Prot/ExPASy…
After the introduction to Swiss-Prot /ExPASy …
Some practical exercises:
http://education.expasy.org/cours/Tunis/
1. Finding databases2. Comparing protein databases3. Comparing BLAST programs4. BLAST output5. Bacterial start sites6. UniRef7. Different views of UniProtKB8. Environmental sequences9. Inter-database links & PROSITE10. InterPro11. Using UniProtKB/Swiss-Prot to create datasets