scalable data mining for functional genomics and metagenomics

Post on 30-Dec-2015

33 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Scalable data mining for functional genomics and metagenomics. Curtis Huttenhower 09-16-10. Harvard School of Public Health Department of Biostatistics. Greatest discoveries in biology?. Our job is to create computational microscopes: - PowerPoint PPT Presentation

TRANSCRIPT

Scalable data mining for functional genomics and metagenomics

Curtis Huttenhower

09-16-10Harvard School of Public HealthDepartment of Biostatistics

2

Greatest discoveries in biology?

Our job is to create computational microscopes:

To ask and answer specific biological questions using

millions of experimental results

3

Outline

1. Data mining:Integrating very large

genomic data compendia

2. Metagenomics:Network models of

microbial communities

4

A computational definition offunctional genomics

Genomic data Prior knowledge

Data↓

Function

Function↓

Function

Gene↓

Gene

Gene↓

Function

5

A framework for functional genomics

HighSimilarity

LowSimilarity

HighCorrelation

LowCorrelation

G1G2

+

G4G9

+

G3G6

-

G7G8

-

G2G5

?

0.9 0.7 … 0.1 0.2 … 0.8

+ - … - - … +

0.8 0.5 … 0.05 0.1 … 0.6

HighCorrelation

LowCorrelation

Fre

quen

cy

Let.Not let.

Fre

quen

cy

SimilarDissim.

Fre

quen

cy

P(G2-G5|Data) = 0.85

100Ms gene pairs →

← 1

Ks

data

sets

+ =

6

Functional networkprediction and analysis

Global interaction network

Carbon metabolism network Extracellular signaling network Gut community network

Currently includes data from30,000 human experimental results,

15,000 expression conditions +15,000 diverse others, analyzed for

200 biological functions and150 diseases

HEFalMp

7

Functional network prediction from diverse microbial data

486 bacterial expression

experiments

876 raw datasets

310 postprocessed

datasets

304 normalized coexpression networks

in 27 species

Integrated functional interaction networks

in 15 species

307 bacterial interaction

experiments

154796 raw interactions

114786 postprocessed

interactions

E. Coli Integration

← Precision ↑, Recall ↓

8

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

1

1log2

1'

'

''

z

eiey ,

ieeeiey ,,

i

ieiee yw ,*,̂

22,

*, ˆ

1

eie

ies

w

Simple regression:All datasets are equally accurate

Random effects:Variation within and

among datasets and interactions

9

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

1

1log2

1'

'

''

z

+ =

10

Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune

Graphle http://huttenhower.sph.harvard.edu/graphle/

11

Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune

Graphle http://huttenhower.sph.harvard.edu/graphle/

X?

12

Predicting gene function

Cell cycle genes

Predicted relationships between genes

HighConfidence

LowConfidence

13

Predicting gene functionPredicted relationships

between genes

HighConfidence

LowConfidence

Cell cycle genes

14

Cell cycle genes

Predicting gene functionPredicted relationships

between genes

HighConfidence

LowConfidence

These edges provide a measure of how likely a gene is to

specifically participate in the process of

interest.

15

Comprehensive validation of computational predictions

Genomic data

Computational Predictions of Gene Function

MEFITSPELLHibbs et al 2007

bioPIXIEMyers et al 2005

Genes predicted to function in mitochondrion organization

and biogenesis

Laboratory ExperimentsPetite

frequencyGrowthcurves

Confocal microscopy

New known functions for correctly predicted genes

Retraining

With David Hess, Amy Caudy

Prior knowledge

16

Evaluating the performance of computational predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

135Under-annotations

82Novel Confirmations,

First Iteration

17Novel Confirmations,

Second Iteration

340 total: >3x previously known genes in ~5 person-months

17

Evaluating the performance of computational predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

95Under-annotations

40Confirmed

Under-annotations

80Novel Confirmations

First Iteration

17Novel Confirmations

Second Iteration

340 total: >3x previously known genes in ~5 person-months

Computational predictions from large collections of genomic data can be

accurate despite incomplete or misleading gold standards, and they

continue to improve as additional data are incorporated.

18

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

The strength of these relationships indicates how

cohesive a process is.

Chemotaxis

19

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

Chemotaxis

20

Functional mapping: mining integrated networks

Flagellar assembly

The strength of these relationships indicates how

associated two processes are.

Predicted relationships between genes

HighConfidence

LowConfidence

Chemotaxis

21

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

22

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

23

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

24

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

25

Cross-species knowledge transferusing functional data

PinakiSarder

)P()|P()|P( sssss FRFRDDFR ),P( ts FRFR

)|P( DFRs

)},{|P( ssts DFRFR

)P()|},P({ sssst FRFRDFR

st

stD

sss FRFRFRDFRs

)|P()|P()P(

TaFTan

26

TaFTan: Cross-species knowledge transfer using functional data

E. coli

B. subtilis

P. aeruginosa

M. tuberculosis

Species-specific data

Species’ data excluded

All species’ data

log(

prec

isio

n/ra

ndom

)

log(recall)

• Important to take advantage of all

available data for any one organism

• Important to take advantage of all

available data for every organism

• Scalable to dozens of organisms with

hundreds of functional datasets

• Currently working on making this

more context-specific

27

Outline

1. Data mining:Integrating very large

genomic data compendia

2. Metagenomics:Network models of

microbial communities

28

~2000

AML/ALLSurvival

Mutation

Geneexpression

Batcheffects

Functionalmodules

So what does all of this have to do with

microbial communities ?

29

~2005

Healthy/DiabetesBMI

M/F

SNPgenotypes

Populationstructure

LD

30

2010

Healthy/IBDTemperature

Location

Taxa &Orthologs

???

Niches &Phylogeny Test for

correlatesMultiple

hypothesiscorrection

Featureselection

p >> n

Confounds/stratification/environment

Cross-validate

Biological story?

Independent sample

Intervention/perturbation

31

What’s metagenomics?Total collection of microorganisms

within a community

Also microbial community or microbiota

Total genomic potential of a microbial community

Total biomolecular repertoire of a microbial community

Study of uncultured microorganisms from the environment, which can include

humans or other living hosts

32

The Human Microbiome Project

2006 - ongoing

• 300 “normal” adults, 18-40

• 16S rDNA + WGS• 5 sites/18 samples +

blood• Oral cavity: saliva, tongue,

palate, buccal mucosa, gingiva,

tonsils, throat, teeth• Skin: ears, inner elbows• Nasal cavity• Gut: stool• Vagina: introitus, mid,

fornix• Reference genomes

(~200-800)

All healthy subjects; followup projects in psoriasis, Crohn’s,

colitis, obesity, acne, cancer, resistant

infection…

Hamady, 2009

33

What features to test?

16S reads

WGS reads

Taxa

Orthologous clusters

Pathways/modules

Functional roles

Pathway activity

Genomic data(Reference genomes)

Functional data(Experimental models)

Binning

Clustering

Microbiome data

34

HMP: Data features

16S reads

Orthologous clusters

Pathways/modules

Taxa

Genes(KOs)

Pathways(KEGGs)

35

HMP: Body sites

Taxa

KOs

KEGGs

Vanilla linear SVM

36

HMP: Subjects

Taxa

KEGGs

We can tell who you are by the bugs in

your mouth!

37

HMP: Metabolic reconstruction

WGS reads

Pathways/modules

Genes(KOs)

Pathways(KEGGs)

Functional seq.KEGG + MetaCYC

CAZy, TCDB,VFDB, MEROPS…

BLAST → Genes

rra

r

raa

p

gap

gc

)(

)(

1

)()1(

)(

Genes → PathwaysMinPath (Ye 2009)

SmoothingWitten-Bell

otherwiseTNNgc

gcTNTVTNgc

)/()(

0)()/()/()(

Gap filling

300 subjects1-3 visits/subject

15-18 body sites/visit10-20M reads/sample

100bp reads

BLAST

?

38

HMP: Metabolic reconstruction

Pathway coverage Pathway abundance

39

HMP: Metabolic reconstruction

Pathway coverage

Pathway abundance← Samples →

← P

ath

wa

ys

Aerobic body sites

Gastrointestinal body sites

All

bo

dy

sit

es

(“c

ore

”)

40

MetaHIT: Data features

WGS reads

Pathways/modules

85 healthy, 15 IBD +

12 healthy, 12 IBD

ReBLASTed against KEGG since published data obfuscates read

counts

10x bootstrap within training cohort, test on

12+12 as validation

Taxa

PhymmBrady 2009

Genes(KOs)

Pathways(KEGGs)

41

MetaHIT: Taxonomic CD biomarkersBacteroidetes

Firmicutes

Methanomicrobia

Enterobacteriaceae

Chromatiales

Desulfobacterales

OxalobacteraceaeRhodobacteraceae

Bradyrhizobiaceae

iTOLLetunic 2007

42

MetaHIT: Taxonomic CD biomarkers

Down in CD

Up in CD

43

MetaHIT: Functional CD biomarkers

Growth/replication Motility Transporters Sugar metabolism

Down in CD

Up in CD

44

MetaHIT: KO IBD biomarkers

Transporters

Growth/replication

Motility

Sugarmetabolism

Down in IBD

Up in IBD

LEfSe

NicolaSegata

t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis…

Metagenomic differential analysis: LEfSe

1. Is there a statistically significant difference?

2. Is the difference biologically significant?

3. How large is the difference? PCA, LDA, mean difference, class or cluster distance…

expert supervision, specific post-hoc tests…

p(ANOVA) < 0.05

pairwise post-hoc Wilcoxon OK

Log(Score(LDA)) = 3.68

LEfSe:

45

46

LEfSe: A non-human exampleViromes vs. bacterial metagenomes

Metastats (White 2009): p < 0.001ANOVA: p < 0.05

LEfSE: DIFF!

Hi-level functional category: CarbohydratesHi-level functional category: TransportersHi-level functional category: Nucleosides and Nucleotides

LEfSE: NO DIFF!

Microbial Viral

Dinsdale 2008

47

• Sleipnir C++ library for computational functional genomics

• Data types for biological entities• Microarray data, interaction data, genes and gene sets,

functional catalogs, etc. etc.• Network communication, parallelization

• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)

• And it’s fully documented!

Sleipnir: Software forscalable functional genomics

Massive datasets require efficientalgorithms and implementations.

It’s also speedy: microbial data integration

computationtakes <3hrs.

48

Outline

1. Data mining:Integrating very large

genomic data compendia

2. Metagenomics:Network models of

microbial communities

• Network framework for

scalable data integration

• HEFalMp: human data

integration

• TaFTan: cross-species

knowledge transfer from

functional data

• 16S and WGS community

metabolic reconstruction

• LEfSe: biologically relevant

community differences

• Sleipnir: software forscalable genomic

datamining

49

Thanks!

http://huttenhower.sph.harvard.edu/sleipnir

Jacques Izard

Wendy Garrett

Sarah Fortune

Pinaki Sarder Nicola Segata

Levi Waldron LarisaMiropolsky

WillythssaPierre-Louis

Interested? We’re lookingfor postdocs!

http://huttenhower.sph.harvard.edu

OlgaTroyanskayaChris ParkDavid HessMatt HibbsChad MyersAna PopAaron Wong

Hilary CollerErin Haley

51

HEFalMp: Predicting human gene function

HEFalMp

52

HEFalMp: Predicting humangenetic interactions

HEFalMp

53

HEFalMp: Analyzing human genomic data

HEFalMp

54

HEFalMp: Understanding human disease

HEFalMp

55

Validating Human Predictions

Autophagy

Luciferase(Negative control)

ATG5(Positive control) LAMP2 RAB11A

NotStarved

Starved(Autophagic)

Predicted novel autophagy proteins

5½ of 7 predictions currently confirmed

With Erin Haley, Hilary Coller

56

Functional Mapping:Scoring Functional Associations

How can we formalizethese relationships?

Any sets of genes G1 and G2 in a network can be compared

using four measures:

• Edges between their genes

• Edges within each set• The background edges

incident to each set• The baseline of all edges

in the network

),(),(

),(

2121

21, 21 GGwithin

baseline

GGbackground

GGbetweenFA GG

Stronger connections between the sets increase association.

Stronger within self-connections or nonspecific background connections decrease association.

57

Functional Mapping:Bootstrap p-values

• Scoring functional associations is great……how do you interpret an association score?– For gene sets of arbitrary sizes?– In arbitrary graphs?– Each with its own bizarre distribution of edges?

Empirically!# Genes 1 5 10 50

1

5

10

50

Histograms of FAs for random sets

For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is

approximately normal with mean 1.

Standard deviation is asymptotic in the sizes

of both gene sets.

Maps FA scores to p-values for any gene sets and

underlying graph.

100

102

104

100

101

102

103

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

|G1|

|G2|

Null distribution σs for one graph

|)(|||

|||)(|),(ˆ

1),(ˆ

ji

jijiFA

jiFA

GCG

BGGAGG

GG

)(1)( ),(ˆ),,(ˆ, 212121xxFAP GGGGGG

58

Functional maps for cross-speciesknowledge transfer

G17

G16G15

G10

G6

G9

G8

G5

G11

G7

G12

G13

G14

G2

G1

G4

G3

O8

O4O5

O7

O9

O6

O2

O3

O1

O1: G1, G2, G3O2: G4O3: G6…

ECG1, ECG2BSG1ECG3, BSG2…

59

Functional maps for functional metagenomics

GOS 4441599.3Hypersaline Lagoon, Ecuador

KEGG Pathways

Org

anis

ms

Pathog ens

Env.

Mapping genes into pathways

Mapping pathways into

organisms

+ Integrated functional interaction networks

in 27 species

Mapping organisms into phyla

=

60

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

Data integration summarizes an impossibly huge amount of experimental data into an

impossibly huge number of predictions; what next?

61

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

How can a biologist take advantage of all this data to study

his/her favorite gene/pathway/disease without

losing information?

Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease

associations• Underlying experimental results and

functional activities in data

62

Functional maps for cross-speciesknowledge transfer

← Precision ↑, Recall ↓

Following up with unsupervised and partially anchored network alignment

63

LEfSe: A non-human exampleViromes vs. bacterial metagenomes

Metastats (White 2009): p < 0.001ANOVA: p < 0.05

LEfSE: DIFF!

Hi-level functional category: CarbohydratesHi-level functional category: Membrane TransportHi-level functional category: Nitrogen MetabolismHi-level functional category: Nucleosides and Nucleotides

LEfSE: NO DIFF!

Microbial Viral

top related