modeling functional genomics datasets cvm8890-101

33
Modeling Functional Modeling Functional Genomics Datasets Genomics Datasets CVM8890-101 CVM8890-101 Lesson 2 Lesson 2 13 June 2007 13 June 2007 Teresia Buza Teresia Buza

Upload: neka

Post on 09-Jan-2016

45 views

Category:

Documents


7 download

DESCRIPTION

Modeling Functional Genomics Datasets CVM8890-101. Lesson 2 13 June 2007Teresia Buza. Lesson 2: Introduction to functional annotation. Orthologs and homologs; clusters of orthologous genes (COGs) and the gene ontology (GO); and how to find what functional annotation is available. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Modeling Functional Genomics Datasets CVM8890-101

Modeling Functional Genomics Modeling Functional Genomics DatasetsDatasets

CVM8890-101CVM8890-101

Lesson 2Lesson 2

13 June 200713 June 2007 Teresia BuzaTeresia Buza

Page 2: Modeling Functional Genomics Datasets CVM8890-101

Lesson 2: Introduction to Lesson 2: Introduction to functional annotation. functional annotation.

Orthologs and homologs; Orthologs and homologs; clusters of orthologous clusters of orthologous

genes (COGs) and the gene genes (COGs) and the gene ontology (GO); and how to ontology (GO); and how to

find what functional find what functional annotation is available. annotation is available.

Page 3: Modeling Functional Genomics Datasets CVM8890-101

1.Introduction to Functional Annotation

Page 4: Modeling Functional Genomics Datasets CVM8890-101

ATGTCCTATCCATGTCGTACAGATTGACGAGAT

Genomic hypothesisGenome

Protein

mRNA transcript

Gene

Transcriptome

Proteome

Central Dogma New technology

Genome sequencing

Transcript profiling

Protein quantification

What next?

Where are we?Where are we?

Functional annotation

Structural annotation

What is all this?

Page 5: Modeling Functional Genomics Datasets CVM8890-101

Genome Annotation

Biologists refer to both the annotation of the genome Biologists refer to both the annotation of the genome

and functional annotation of gene products:and functional annotation of gene products:

““Structural” AnnotationStructural” Annotation

& &

““Functional” AnnotationFunctional” Annotation

Page 6: Modeling Functional Genomics Datasets CVM8890-101

Structural annotation

Identification of genomic elements.• ORFs predicted during genome assembly• Location of ORFs • Gene structure • Coding regions • Location of regulatory motifs etc

Functional annotation

Attaching biological information to genomic elements.• Biochemical function • Biological function • Involved regulation and interactions • Expression etc

These steps may involve both biological experiments and in

silico analysis.

Structural & Functional AnnotationStructural & Functional Annotation

http://en.wikipedia.org/wiki/Genome_annotation#Genome_annotation (with modifications)

Page 7: Modeling Functional Genomics Datasets CVM8890-101

Why Functional Annotation?Why Functional Annotation?

Enables you to take large “laundry lists” of genes/proteins and turn them into a biologically useful model

Page 8: Modeling Functional Genomics Datasets CVM8890-101

• Annotation of gene products = Annotation of gene products = Gene OntologyGene Ontology (GO)(GO) annotation annotation

• Initially, predicted ORFs have no functional Initially, predicted ORFs have no functional

literature and GO annotation relies on literature and GO annotation relies on computational methods computational methods (rapid but ?Quantity vs Quality)(rapid but ?Quantity vs Quality)

• Functional literature exists for many genes/proteins Functional literature exists for many genes/proteins

prior to genome sequencing prior to genome sequencing (slow but provide high (slow but provide high quality annotations)quality annotations)

• GO annotation does not rely on a completed GO annotation does not rely on a completed genome sequence! genome sequence!

Functional AnnotationFunctional Annotation

Page 9: Modeling Functional Genomics Datasets CVM8890-101

Types of Functional annotationTypes of Functional annotationBased in direct experimental evidence of function Experiments in the same ORGANISM example:• Enzyme assays• Binding experiments• Pathway analysis• Synthetic lethals• Functional complementation• Gene mutations• RNAi• 2-hybrid interactions etc

Indirect Evidence of function• Expression analysis• Structure analysis• Sequence analysis

Page 10: Modeling Functional Genomics Datasets CVM8890-101

Problem:• Many genes/proteins have no annotation• Some have unknown functions Challenge:• We want to get the maximum functional

annotation for modeling our data

Solution:• Read papers (Pubmed etc) • Search for homologs/orthologs of known function• Homologs and orthologs help assign function….

Functional AnnotationFunctional Annotation

Page 11: Modeling Functional Genomics Datasets CVM8890-101

2. Finding Function: orthologs and homologs

Page 12: Modeling Functional Genomics Datasets CVM8890-101

What are Homologs, Orthologs, Paralogs?

Homolog Is a relationship between genes separated by the event of speciationor genetic duplication

Ortholog

Orthologs are homologous genes in different species that evolved from a common ancestor gene by speciation. Normally (not always), orthologs retain the same function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes.

Paralog Paralogs are homologous genes related by duplication within a genome. Paralogs evolve new functions, even if these are related to the original one.

http://homepage.usask.ca/~ctl271/857/def_homolog.shtml

Page 13: Modeling Functional Genomics Datasets CVM8890-101

http://www.ensembl.org/info/data/compara/tree_example1.jpg

Orthologs & Paralogs

orthologs

Paralogs

Page 14: Modeling Functional Genomics Datasets CVM8890-101

How to search for Orthology?How to search for Orthology?

BLAST : BLAST : http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/• Sequence alignment search tool• Utilizes heuristic algorithm

MPsrch: http://www.ebi.ac.uk/MPsrch/• Sequence comparison tool• Implement Smith & Waterman algorithm• Utilizes exhaustive algorithm

Domain analysis: Domain analysis: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml• Analysis of regions of sequence homology among sets of proteins that are not all full-

length homologs.• Homology domains often, but not always, correspond to recognizable protein folding

domains

Protein family databases Protein family databases (e.g. COGs & KOGs)(e.g. COGs & KOGs)• Superfamily: Complete set of proteins having sequence homology over essentially their

full length.• Subfamilies: Incomplete set of homologous proteins which yet encompass proteins of

diverse function

Page 15: Modeling Functional Genomics Datasets CVM8890-101

Systems for Functional AnnotationSystems for Functional Annotation

1.1. C Clusters of lusters of OOrthologous rthologous GGroups (COGs)roups (COGs)

ProkaryotesProkaryotes

2. eu2. euKKaryote aryote OOrthologous rthologous GGroups (KOGs)roups (KOGs)

EukaryotesEukaryotes

3.3. G Gene ene OOntology (GO)ntology (GO)

Page 16: Modeling Functional Genomics Datasets CVM8890-101

COGs & KOGs

Both are based on orthology. Both are based on orthology. Genes are assigned to broad Genes are assigned to broad

categories (A-Z)categories (A-Z) Each category corresponds to an Each category corresponds to an

ancient conserved domain ancient conserved domain

COGs - prokaryotesCOGs - prokaryotes KOGs - eukaryotesKOGs - eukaryotes

Page 17: Modeling Functional Genomics Datasets CVM8890-101

1.1. Information storage and processingInformation storage and processing

2.2. Cellular processes and signalingCellular processes and signaling

3.3. MetabolismMetabolism

4.4. Poorly characterizedPoorly characterized

COGs has 25 functional categories (A – Z) in four broad groups

Text search:Text search:

Clusters of Orthologous Groups (COGs)Clusters of Orthologous Groups (COGs)http://www.ncbi.nlm.nih.gov/COG/

Page 18: Modeling Functional Genomics Datasets CVM8890-101

INFORMATION STORAGE AND PROCESSING

[J] Translation, ribosomal structure and biogenesis [A] RNA processing and modification [K] Transcription [L] Replication, recombination and repair [B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING

[D] Cell cycle control, cell division, chromosome partitioning [Y] Nuclear structure [V] Defense mechanisms [T] Signal transduction mechanisms [M] Cell wall/membrane/envelope biogenesis [N] Cell motility [Z] Cytoskeleton [W] Extracellular structures [U] Intracellular trafficking, secretion, and vesicular transport [O] Posttranslational modification, protein turnover, chaperones

COGs CategoriesCOGs Categories

ftp://ftp.ncbi.nih.gov/pub/COG/COG/fun.txt

Page 19: Modeling Functional Genomics Datasets CVM8890-101

METABOLISM [C] Energy production and conversion [G] Carbohydrate transport and metabolism [E] Amino acid transport and metabolism [F] Nucleotide transport and metabolism [H] Coenzyme transport and metabolism [I] Lipid transport and metabolism [P] Inorganic ion transport and metabolism [Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED

[R] General function prediction only [S] Function unknown

COGs CategoriesCOGs Categories

ftp://ftp.ncbi.nih.gov/pub/COG/COG/fun.txt

Page 20: Modeling Functional Genomics Datasets CVM8890-101

Tatusov et al., 2000: The COG database: a tool for genome-scale analysis of protein functions and evolution

Classification of COGs by functional categories

Example 1

Page 21: Modeling Functional Genomics Datasets CVM8890-101

Effects of Antibiotics on Pasteurella multocida transcriptome

Nanduri et al 2006

Example 2

AMX

CTC

ENR

DecreaseIncrease

COG categories

05

1015202530

3540

05

10152025303540

0

5

10

15

20

25

30

35

40

- C D E F G H I J K L M N O P Q R S T U V

Page 22: Modeling Functional Genomics Datasets CVM8890-101

The Gene Ontology (GO)The Gene Ontology (GO)• The Gene Ontology (GO) is the de facto Standard for

functional annotation

• GO functional annotation is based on orthology AND direct experimental evidence

• GO terms allow much more detailed functional analysis (> 24,000 terms) than COGs & KOGs (25 broad terms)

• GO is a controlled vocabulary of terms split into three related ontologies covering basic areas of molecular biology:

molecular function: 8,123 terms biological process: 13,960 terms cellular component: 2,071 terms

GO Report 2007- 04

Page 23: Modeling Functional Genomics Datasets CVM8890-101

0 50 100 150 200 250 300 350

NucleusCell

CytoplasmMitochondrion

Plasma membraneCytosol

CytoskeletonExtracellular matrix

NucleoplasmEndoplasmic

Golgi apparatusIntracellularEndosome

CytoplasmicChromosome

NucleolusLysosome

Nuclear envelopeExtracellular spaceExtracellular region

Cellular_componentCilium

Nuclear chromosomeRibosome

PeroxisomeMicrotubule

VacuoleUnlocalized protein

Number of GO terms

Cellular Component

Functional Annotation of Chicken Proteomic data

Example 3

Page 24: Modeling Functional Genomics Datasets CVM8890-101

Use GO for…….Use GO for…….

• Modeling function in high-throughput datasets (arrays!) started by Fly, Yeast, Mouse (Ashburner et al 2000, 2001)

• Grouping gene products by biological functionGrouping gene products by biological function

• Determining which classes of gene products are Determining which classes of gene products are over-represented or under-representedover-represented or under-represented

• Focusing on particular biological pathways and Focusing on particular biological pathways and functions functions ((hypothesis-drivenhypothesis-driven))

• Relating a protein’s location to its functionRelating a protein’s location to its function

Page 25: Modeling Functional Genomics Datasets CVM8890-101

Annotating to the Annotating to the GOGO

• Need to show type of evidence of

function Literature curation: read and interpret

reviewed literature (IDA, IGI, IMP, IPI, IGC)

(TAS, NAS) Computational analysis (RCA, ISS, IEA)

http://www.geneontology.org/GO.evidence.shtml

Page 26: Modeling Functional Genomics Datasets CVM8890-101

4. How to find functional 4. How to find functional annotation for your speciesannotation for your species

Page 27: Modeling Functional Genomics Datasets CVM8890-101

How to find functional annotationHow to find functional annotation

For quick search you need to know:

Name of your species (e.g Sus scrofa, Aspergillus flavus) Taxonomy ID (e.g 9823 – S. scrofa, 5059 – A. flavus etc) Database to look in (e.g. NCBI, Uniprot, EBI-GOA, GOC, AgBase

etc)

Not all functional annotation for a species will be in one database!

Not very many species have a broad coverage of GO annotation…

BUT do not worry Search for their homologs might help May rely on manual annotation from literature (Refer Manual annotation Course on by Fiona McCarthy)

Page 28: Modeling Functional Genomics Datasets CVM8890-101

Functional annotationAre the genes/proteins in GenBank? Check by Taxon ID

GOA make GO annotations (IEA) usingautomated methods

Manual annotations from literature (IDA, IMP, IPI, IGI, IEP codes)

GOA collect all GO annotations& submit to GOC

GOA maintain annotation file

AgBase maintains annotation file

UniProtKB

Known?NM_, NP_

Fill in GO association file

Annotate by structural/sequence similarity ORTHOLOGS (ISS code)

Submit to AgBase(Agricultural Species)

GOC maintain annotation files• unfiltered GOA• filtered GOA

Yes

YesNo GO Manual annotations from literature

(IDA, IMP, IPI, IGI, IEP codes)

UniParc/IPI Annotate by structural/sequence similarity ORTHOLOGS (ISS code)

No

No GO Manual annotations from literature (IDA, IMP, IPI, IGI, IEP codes)

Annotate by structural/sequence similarity ORTHOLOGS (ISS code)

No GO

Page 29: Modeling Functional Genomics Datasets CVM8890-101

DemonstrationDemonstration

Page 30: Modeling Functional Genomics Datasets CVM8890-101
Page 31: Modeling Functional Genomics Datasets CVM8890-101
Page 32: Modeling Functional Genomics Datasets CVM8890-101
Page 33: Modeling Functional Genomics Datasets CVM8890-101