mgm workshop. 19 oct 2010 functional annotation datasources konstantinos mavrommatis head of omics...

MGM workshop. 19 Oct 2010

Functional annotationFunctional annotationDatasourcesDatasources

Konstantinos MavrommatisKonstantinos MavrommatisHead of Omics groupHead of Omics group

DOE-JGIDOE-JGI

[email protected]@lbl.gov


OutlineOutline

Genome annotation (Functional)

How do we know it is correct?

How do we do it?Data collectionsProtein familiesPathway collections


Genome annotation: The Genome annotation: The process of identifying the process of identifying the locations and functions of locations and functions of

coding sequences.coding sequences.

cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF)

molecular/enzymatic (methyltransferase) Reaction (methylation)

Substrate (cobalt-precorrin-4)

Ligand (S-adenosyl-L-methionine)

metabolic (cobalamin biosynthesis)

physiological (maintenance of healthy nerve and red blood cells, through B12).


Functional annotation Functional annotation helps make sense out of helps make sense out of

nonsensenonsense

But it only But it only directs us to directs us to the potential the potential

of the of the organismorganism


Function prediction is Function prediction is mainly based on mainly based on

homology detectionhomology detection Homology

implies a common evolutionary origin.

not retention of similarity in any of their properties.

Homology ≠ similarity of function.

Function transfer by homology

Conservative amino acid substitution

Low complexity region

Gap (insertion or deletion)


Function transfer based Function transfer based on homology is error on homology is error

proneprone

Punta & Ofran. PLOS Comp Biol. 2008


Limits in transfer of Limits in transfer of annotation based on annotation based on

homologyhomology

Punta & Ofran. PLOS Comp Biol. 2008


If no similarity is detected If no similarity is detected use alternative methods to use alternative methods to

predict function predict function

Subcellular localization

Gene context

Special sequence motifs features

Cytoplasm

S ~ S S ~ S

Periplasm


Genome annotation

Model pathway

Annotation should make Annotation should make sense in the context of sense in the context of

the cell metabolismthe cell metabolism

SubstrateA

SubstrateB

SubstrateC

SubstrateDEnzyme 2Enzyme 1 Enzyme 3

Enzyme 2? ?Enzyme 1 Enzyme 3 ✓


Annotation should make Annotation should make sense.sense.

Missing genes may be present.Missing genes may be present.


Helps prediction

Is error prone.

Has to make sense.

Genome annotation: The Genome annotation: The process of identifying the process of identifying the locations and functions of locations and functions of

coding sequences.coding sequences.


There are multiple There are multiple datasources to help datasources to help

organize information and organize information and facilitate annotationfacilitate annotation

Sequence databases

Protein classification databases

Specialized databases


Primary databases store Primary databases store raw information from raw information from

various sourcesvarious sourcesEMBL/GenBank/DDBJ EMBL/GenBank/DDBJ ((http://

www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl,http://www.ebi.ac.uk/embl))

Archive containing all sequences from all sources

GenBank/UnitProt contain translations of sequences.

Year Base pairs Sequences2004 44,575,745,17640,604,3192005 56,037,734,46252,016,7622006 69,019,290,70564,893,7472007 83,874,179,73080,388,3822008 99,116,431,94298,868,465


Primary databases Primary databases accumulate errors in accumulate errors in

sequences and annotationssequences and annotations

In the sequences themselves:Sequencing errors.Cloning vector sequences.

In the annotations: Inaccuracies, omissions, and

even mistakes. Inconsistencies between some

fields. Redundancy. {

{

{


IMG is using Refseq as its IMG is using Refseq as its primary sourceprimary source

ATTGACTA

TTGACA

CGTGA

ATTGACTA

TATAGCCG

ACGTGC

ACGTGCA

CGTGC

TTGACA

TTGACA

TTGACA

CGTGA

CGTGA

CGTGA

ATTGACTA

ATTGACTAATTGACTA

ATTGACTA

TATAGCCG

TATAGCCG

TATAGCCG

TATAGCCG

GenBank

TATAGCCG TATAGCCGTATAGCCG TATAGCCG

ATG

A

CATT

GA

GA

ATT

ATTC

GA

GA

ATTC

C

GA

GA

ATT

C

GAGA

ATT

C

GA

GA

ATTC

C

GA

GA

ATTC

C

UniGene

RefSeq

GenomeAssembly

Labs

Curators

Algorithms

TATAGCCGAGCTCCGATACCGATGACAA


Protein families use Protein families use different methods to different methods to

classify proteins classify proteins

COG/KOG Pfam TIGRfam KEGG Orthologs InterPro


What are COGs/KOGs? What are COGs/KOGs? How much can I trust How much can I trust

them?them?Reciprocal best hitBidirectional best hit

Blast best hitUnidirectional best hit

COG1COG2

>gnl|COG|2723 COG2723, BglB, Beta-glucosidase/6-phospho-beta-glucosidase/beta- galactosidase [Carbohydrate transport and metabolism]. Length = 460

Score = 388 bits (998), Expect = e-132 Identities = 176/503 (34%), Positives = 251/503 (49%), Gaps = 75/503 (14%)

Query: 4 SFPKSFRFGWSQAGFQSEMGTPGSEDPNTDWYVWVHDPENIASGLVSGDLPEHGPGYWGL 63 FPK F +G + A FQ E +DW VWVHD I LVSGD PE ++ Sbjct: 3 KFPKDFLWGGATAAFQVEGAWNEDGKGPSDWDVWVHDE--IPGRLVSGDPPEEASDFYHR 60

Query: 64 YRMFHDNAVKMGLDIARINVEWSRIFPKPMPDPPQGNVEVKGNDVLAVHVDENDLKRLDE 123 Y+ A +MGL+ R ++EWSRIFP Sbjct: 61 YKEDIALAKEMGLNAFRTSIEWSRIFPNGDGGEV-------------------------- 94

Query: 124 AANQEAVRHYREIFSDLKARGIHFILNFYHWPLPLWVHDPIRVRKGDLSGPTGWLDVKTV 183 N++ +R Y +F +LKARGI + YH+ LPLW+ P GW + +TVSbjct: 95 --NEKGLRFYDRLFDELKARGIEPFVTLYHFDLPLWLQKPYG----------GWENRETV 142

Query: 184 INFARFAAYTAWKFDDLADEYSTMNEPNVVHSNGYMWVKSGFPPSYLNFELSRRVMVNLI 243 FAR+AA +F D + T NEPNVV GY+ G PP ++ + + +V +++Sbjct: 143 DAFARYAATVFERFGDKVKYWFTFNEPNVVVELGYL--YGGHPPGIVDPKAAYQVAHHML 200

Query: 244 QAHARAYDAVKAISKK-PIGIIYANSSFTPLTDK--DAKAVELAEYDSRWIFFDAIIKGE 300 AHA A A+K I+ K +GII + PL+DK D KA E A+ F DA +KGESbjct: 201 LAHALAVKAIKKINPKGKVGIILNLTPAYPLSDKPEDVKAAENADRFHNRFFLDAQVKGE 260

Query: 301 --------------LMGVTRDDL----KGRLDWIGVNYYSRTVVKLIGEKSYVSIPGYGY 342 L + DL + +D+IG+NYY+ + VK + GYG Sbjct: 261 YPEYLEKELEENGILPEIEDGDLEILKENTVDFIGLNYYTPSRVK---AAEPRYVSGYGP 317


http://pfam.sanger.ac.uk

HMMs of protein alignments (local) for domains, or global (cover whole protein)

Pfam are based on the Pfam are based on the detection of domains detection of domains


TIGRfamTIGRfam

Full length alignments. Domain alignments. Equivalogs: families of

proteins with specific function.

Superfamilies: families of homologous genes.

HMMs

http://www.tigr.org/TIGRFAMs/


Hits to other

models

How can we search Pfam How can we search Pfam and TIGRfam?and TIGRfam?

Query: BChl_A [M=357]Accession: PF02327.12Description: Bacteriochlorophyll A proteinScores for complete sequences (score includes all domains): --- full sequence --- --- best 1 domain --- -#dom- E-value score bias E-value score bias exp N Sequence Description ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- 0.00014 11.2 0.0 0.00024 10.5 0.0 1.2 1 tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1

Domain annotation for each sequence (and alignments):>> tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1 OS=Ignisphaera aggregans (strain DSM) # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- 1 ! 10.5 0.0 1.1e-05 0.00024 217 273 .. 255 307 .. 240 321 .. 0.84

Alignments for each domain: == domain 1 score: 10.5 bits; conditional E-value: 1.1e-05 BChl_A 217 fshagsgvvdsisrwaelfpveklnkpasveagfrsdsqgievkvdgelpgvsvdag 273 fs+ g+v+si+ w l ++ + e gfr + iev v+g l v +d tr|E0STV9|E0STV9_IGNAA 255 FSKKPIGIVESIASWIPLREGDR----EAAEKGFRYNLWPIEVAVNGYLDDVYRDDL 307 899999*********98877765....3569*********************99864 PP

•GA Gathering method: Search threshold to build the full alignment.•TC Trusted Cutoff: Lowest sequence score and domain score of match in the full alignment.•NC Noise Cutoff: Highest sequence score and domain score of match not in full alignment.

Noise cutoff

Gathering cutoff

Trusted cutoff


InterPro. Composite InterPro. Composite pattern databasespattern databases

To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro

Release 30.0 (Dec10) contains 21178 entries Central annotation resource, with pointers to its satellite dbs

http://www.ebi.ac.uk/interpro/


KEGG orthologyKEGG orthology

Xizeng Mao et al. Bioinformatics Volume 21,(2005)3787-3793

<10-5 evalue≤ rank 5≥ 70% query length≥ 30% identity

<10-5 evalue≤ rank 5≥ 70% query length≥ 30% identity


ENZYMEENZYME


Pathway collectionsPathway collectionsKEGGKEGG

Contains information about biochemical pathways, and protein interactions.

http://www.kegg.com


Pathway collections:Pathway collections:MetacycMetacyc


Functional annotationFunctional annotation

http://imgweb.jgi-psf.org/img_er_v260/doc/img_er_ann.pdf


RNA structural and RNA structural and functional annotation are functional annotation are

coupled coupled

SILVA alignments of rRNAs are used to generate models

Covariance models for each RNA class are used to predict genes


There is a plethora of There is a plethora of specialized databases that specialized databases that

one needs to searchone needs to search

http://www.oxfordjournals.org/nar/database/c


In most cases In most cases databases are databases are

interconnected but …interconnected but …

SWISS-PROT

ENZYME

PDB

HSSP

SWISSNEW

YPDREF

YPD

PDBFINDERALI

DSSP

FSSP

NRL_3D

PMD

PIR

ProtFam

FlyGene

TFSITE

TFACTOR

EMBL

TrEMBL

ECDC

TrEMBLNEW

EMNEW

EPD

GenBank MOLPROBE

OMIM

MIMMAP

REBASE

PROSITE ProDom

PROSITEDOCBlocks

SWISSDOM

..not all databases are updated ..not all databases are updated regularly. regularly.

Changes of annotation in one Changes of annotation in one database are not reflected in database are not reflected in

othersothers


There are multiple There are multiple datasources to help datasources to help

organize information and organize information and facilitate annotationfacilitate annotation

Sequence databases Contain sequences deposited by verious sources

Protein classification databases Utilize sequence homology or other criteria to

group together proteins COG, Pfam, TIGRfam, InterPro, KO terms

Specialized databases Start by searching for available resources


Question? Question?

Genome annotation (Functional)

How do we know it is correct?

How do we do it?Data collectionsProtein familiesPathway collections

mgm workshop. 19 oct 2010 functional annotation datasources konstantinos mavrommatis head of omics...

Documents