mgm workshop. 19 oct 2010 functional annotation datasources konstantinos mavrommatis head of omics...
TRANSCRIPT
MGM workshop. 19 Oct 2010
Functional annotationFunctional annotationDatasourcesDatasources
Konstantinos MavrommatisKonstantinos MavrommatisHead of Omics groupHead of Omics group
DOE-JGIDOE-JGI
[email protected]@lbl.gov
MGM workshop. 19 Oct 2010
OutlineOutline
Genome annotation (Functional)
How do we know it is correct?
How do we do it?Data collectionsProtein familiesPathway collections
MGM workshop. 19 Oct 2010
Genome annotation: The Genome annotation: The process of identifying the process of identifying the locations and functions of locations and functions of
coding sequences.coding sequences.
cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF)
molecular/enzymatic (methyltransferase) Reaction (methylation)
Substrate (cobalt-precorrin-4)
Ligand (S-adenosyl-L-methionine)
metabolic (cobalamin biosynthesis)
physiological (maintenance of healthy nerve and red blood cells, through B12).
MGM workshop. 19 Oct 2010
Functional annotation Functional annotation helps make sense out of helps make sense out of
nonsensenonsense
But it only But it only directs us to directs us to the potential the potential
of the of the organismorganism
MGM workshop. 19 Oct 2010
Function prediction is Function prediction is mainly based on mainly based on
homology detectionhomology detection Homology
implies a common evolutionary origin.
not retention of similarity in any of their properties.
Homology ≠ similarity of function.
Function transfer by homology
Conservative amino acid substitution
Low complexity region
Gap (insertion or deletion)
MGM workshop. 19 Oct 2010
Function transfer based Function transfer based on homology is error on homology is error
proneprone
Punta & Ofran. PLOS Comp Biol. 2008
MGM workshop. 19 Oct 2010
Limits in transfer of Limits in transfer of annotation based on annotation based on
homologyhomology
Punta & Ofran. PLOS Comp Biol. 2008
MGM workshop. 19 Oct 2010
If no similarity is detected If no similarity is detected use alternative methods to use alternative methods to
predict function predict function
Subcellular localization
Gene context
Special sequence motifs features
Cytoplasm
S ~ S S ~ S
Periplasm
MGM workshop. 19 Oct 2010
Genome annotation
Model pathway
Annotation should make Annotation should make sense in the context of sense in the context of
the cell metabolismthe cell metabolism
SubstrateA
SubstrateB
SubstrateC
SubstrateDEnzyme 2Enzyme 1 Enzyme 3
Enzyme 2? ?Enzyme 1 Enzyme 3 ✓
MGM workshop. 19 Oct 2010
Annotation should make Annotation should make sense.sense.
Missing genes may be present.Missing genes may be present.
MGM workshop. 19 Oct 2010
Helps prediction
Is error prone.
Has to make sense.
Genome annotation: The Genome annotation: The process of identifying the process of identifying the locations and functions of locations and functions of
coding sequences.coding sequences.
MGM workshop. 19 Oct 2010
There are multiple There are multiple datasources to help datasources to help
organize information and organize information and facilitate annotationfacilitate annotation
Sequence databases
Protein classification databases
Specialized databases
MGM workshop. 19 Oct 2010
Primary databases store Primary databases store raw information from raw information from
various sourcesvarious sourcesEMBL/GenBank/DDBJ EMBL/GenBank/DDBJ ((http://
www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl,http://www.ebi.ac.uk/embl))
Archive containing all sequences from all sources
GenBank/UnitProt contain translations of sequences.
Year Base pairs Sequences2004 44,575,745,17640,604,3192005 56,037,734,46252,016,7622006 69,019,290,70564,893,7472007 83,874,179,73080,388,3822008 99,116,431,94298,868,465
MGM workshop. 19 Oct 2010
Primary databases Primary databases accumulate errors in accumulate errors in
sequences and annotationssequences and annotations
In the sequences themselves:Sequencing errors.Cloning vector sequences.
In the annotations: Inaccuracies, omissions, and
even mistakes. Inconsistencies between some
fields. Redundancy. {
{
{
MGM workshop. 19 Oct 2010
IMG is using Refseq as its IMG is using Refseq as its primary sourceprimary source
ATTGACTA
TTGACA
CGTGA
ATTGACTA
TATAGCCG
ACGTGC
ACGTGCA
CGTGC
TTGACA
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTAATTGACTA
ATTGACTA
TATAGCCG
TATAGCCG
TATAGCCG
TATAGCCG
GenBank
TATAGCCG TATAGCCGTATAGCCG TATAGCCG
ATG
A
CATT
GA
GA
ATT
ATTC
GA
GA
ATTC
C
GA
GA
ATT
C
GAGA
ATT
C
GA
GA
ATTC
C
GA
GA
ATTC
C
UniGene
RefSeq
GenomeAssembly
Labs
Curators
Algorithms
TATAGCCGAGCTCCGATACCGATGACAA
MGM workshop. 19 Oct 2010
Protein families use Protein families use different methods to different methods to
classify proteins classify proteins
COG/KOG Pfam TIGRfam KEGG Orthologs InterPro
MGM workshop. 19 Oct 2010
What are COGs/KOGs? What are COGs/KOGs? How much can I trust How much can I trust
them?them?Reciprocal best hitBidirectional best hit
Blast best hitUnidirectional best hit
COG1COG2
>gnl|COG|2723 COG2723, BglB, Beta-glucosidase/6-phospho-beta-glucosidase/beta- galactosidase [Carbohydrate transport and metabolism]. Length = 460
Score = 388 bits (998), Expect = e-132 Identities = 176/503 (34%), Positives = 251/503 (49%), Gaps = 75/503 (14%)
Query: 4 SFPKSFRFGWSQAGFQSEMGTPGSEDPNTDWYVWVHDPENIASGLVSGDLPEHGPGYWGL 63 FPK F +G + A FQ E +DW VWVHD I LVSGD PE ++ Sbjct: 3 KFPKDFLWGGATAAFQVEGAWNEDGKGPSDWDVWVHDE--IPGRLVSGDPPEEASDFYHR 60
Query: 64 YRMFHDNAVKMGLDIARINVEWSRIFPKPMPDPPQGNVEVKGNDVLAVHVDENDLKRLDE 123 Y+ A +MGL+ R ++EWSRIFP Sbjct: 61 YKEDIALAKEMGLNAFRTSIEWSRIFPNGDGGEV-------------------------- 94
Query: 124 AANQEAVRHYREIFSDLKARGIHFILNFYHWPLPLWVHDPIRVRKGDLSGPTGWLDVKTV 183 N++ +R Y +F +LKARGI + YH+ LPLW+ P GW + +TVSbjct: 95 --NEKGLRFYDRLFDELKARGIEPFVTLYHFDLPLWLQKPYG----------GWENRETV 142
Query: 184 INFARFAAYTAWKFDDLADEYSTMNEPNVVHSNGYMWVKSGFPPSYLNFELSRRVMVNLI 243 FAR+AA +F D + T NEPNVV GY+ G PP ++ + + +V +++Sbjct: 143 DAFARYAATVFERFGDKVKYWFTFNEPNVVVELGYL--YGGHPPGIVDPKAAYQVAHHML 200
Query: 244 QAHARAYDAVKAISKK-PIGIIYANSSFTPLTDK--DAKAVELAEYDSRWIFFDAIIKGE 300 AHA A A+K I+ K +GII + PL+DK D KA E A+ F DA +KGESbjct: 201 LAHALAVKAIKKINPKGKVGIILNLTPAYPLSDKPEDVKAAENADRFHNRFFLDAQVKGE 260
Query: 301 --------------LMGVTRDDL----KGRLDWIGVNYYSRTVVKLIGEKSYVSIPGYGY 342 L + DL + +D+IG+NYY+ + VK + GYG Sbjct: 261 YPEYLEKELEENGILPEIEDGDLEILKENTVDFIGLNYYTPSRVK---AAEPRYVSGYGP 317
MGM workshop. 19 Oct 2010
http://pfam.sanger.ac.uk
HMMs of protein alignments (local) for domains, or global (cover whole protein)
Pfam are based on the Pfam are based on the detection of domains detection of domains
MGM workshop. 19 Oct 2010
TIGRfamTIGRfam
Full length alignments. Domain alignments. Equivalogs: families of
proteins with specific function.
Superfamilies: families of homologous genes.
HMMs
http://www.tigr.org/TIGRFAMs/
MGM workshop. 19 Oct 2010
Hits to other
models
How can we search Pfam How can we search Pfam and TIGRfam?and TIGRfam?
Query: BChl_A [M=357]Accession: PF02327.12Description: Bacteriochlorophyll A proteinScores for complete sequences (score includes all domains): --- full sequence --- --- best 1 domain --- -#dom- E-value score bias E-value score bias exp N Sequence Description ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- 0.00014 11.2 0.0 0.00024 10.5 0.0 1.2 1 tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1
Domain annotation for each sequence (and alignments):>> tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1 OS=Ignisphaera aggregans (strain DSM) # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- 1 ! 10.5 0.0 1.1e-05 0.00024 217 273 .. 255 307 .. 240 321 .. 0.84
Alignments for each domain: == domain 1 score: 10.5 bits; conditional E-value: 1.1e-05 BChl_A 217 fshagsgvvdsisrwaelfpveklnkpasveagfrsdsqgievkvdgelpgvsvdag 273 fs+ g+v+si+ w l ++ + e gfr + iev v+g l v +d tr|E0STV9|E0STV9_IGNAA 255 FSKKPIGIVESIASWIPLREGDR----EAAEKGFRYNLWPIEVAVNGYLDDVYRDDL 307 899999*********98877765....3569*********************99864 PP
•GA Gathering method: Search threshold to build the full alignment.•TC Trusted Cutoff: Lowest sequence score and domain score of match in the full alignment.•NC Noise Cutoff: Highest sequence score and domain score of match not in full alignment.
Noise cutoff
Gathering cutoff
Trusted cutoff
MGM workshop. 19 Oct 2010
InterPro. Composite InterPro. Composite pattern databasespattern databases
To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro
Release 30.0 (Dec10) contains 21178 entries Central annotation resource, with pointers to its satellite dbs
http://www.ebi.ac.uk/interpro/
MGM workshop. 19 Oct 2010
KEGG orthologyKEGG orthology
Xizeng Mao et al. Bioinformatics Volume 21,(2005)3787-3793
<10-5 evalue≤ rank 5≥ 70% query length≥ 30% identity
<10-5 evalue≤ rank 5≥ 70% query length≥ 30% identity
MGM workshop. 19 Oct 2010
ENZYMEENZYME
MGM workshop. 19 Oct 2010
Pathway collectionsPathway collectionsKEGGKEGG
Contains information about biochemical pathways, and protein interactions.
http://www.kegg.com
MGM workshop. 19 Oct 2010
Pathway collections:Pathway collections:MetacycMetacyc
MGM workshop. 19 Oct 2010
Functional annotationFunctional annotation
http://imgweb.jgi-psf.org/img_er_v260/doc/img_er_ann.pdf
MGM workshop. 19 Oct 2010
RNA structural and RNA structural and functional annotation are functional annotation are
coupled coupled
SILVA alignments of rRNAs are used to generate models
Covariance models for each RNA class are used to predict genes
MGM workshop. 19 Oct 2010
There is a plethora of There is a plethora of specialized databases that specialized databases that
one needs to searchone needs to search
http://www.oxfordjournals.org/nar/database/c
MGM workshop. 19 Oct 2010
In most cases In most cases databases are databases are
interconnected but …interconnected but …
SWISS-PROT
ENZYME
PDB
HSSP
SWISSNEW
YPDREF
YPD
PDBFINDERALI
DSSP
FSSP
NRL_3D
PMD
PIR
ProtFam
FlyGene
TFSITE
TFACTOR
EMBL
TrEMBL
ECDC
TrEMBLNEW
EMNEW
EPD
GenBank MOLPROBE
OMIM
MIMMAP
REBASE
PROSITE ProDom
PROSITEDOCBlocks
SWISSDOM
..not all databases are updated ..not all databases are updated regularly. regularly.
Changes of annotation in one Changes of annotation in one database are not reflected in database are not reflected in
othersothers
MGM workshop. 19 Oct 2010
There are multiple There are multiple datasources to help datasources to help
organize information and organize information and facilitate annotationfacilitate annotation
Sequence databases Contain sequences deposited by verious sources
Protein classification databases Utilize sequence homology or other criteria to
group together proteins COG, Pfam, TIGRfam, InterPro, KO terms
Specialized databases Start by searching for available resources
MGM workshop. 19 Oct 2010
Question? Question?
Genome annotation (Functional)
How do we know it is correct?
How do we do it?Data collectionsProtein familiesPathway collections