supervised and unsupervised methods for large scale genomic data integration

Supervised and unsupervised methods for large scale genomic data integration

Curtis Huttenhower

03-25-10Harvard School of Public HealthDepartment of Biostatistics

2

Greatest Biological Discoveries?

3

Are We There Yet?

• How much biology is out

there?

• How much have we found?

• How fast are we finding it?

Human Proteins withAnnotated Biological Roles

Age-Adjusted Citation Rates forMajor Sequencing Projects

Species Diversity ofEnvironmental Samples

Fierer 2008

#DistinctRoles

Matt Hibbs

4

#DistinctRoles

Matt Hibbs

Are We There Yet?

• How much biology is out

there?

• How much have we found?

• How fast are we finding it?

Human Proteins withAnnotated Biological Roles

Age-Adjusted Cost per Citation forMajor Sequencing Projects

Species Diversity ofEnvironmental Samples

Fierer 2008

Lots!

Not nearly all

Not fast enough

Our job is to create computational microscopes:

To ask and answer specific biomedical questions using

millions of experimental results

5

Outline

1. Big picture:Algorithms for mining

genome-scale datasets

2. Details:Recovering mechanistic detail

from high-throughput data

3. Applications:Microbial communities and functional metagenomics

6

A framework for functional genomics

HighSimilarity

LowSimilarity

HighCorrelation

LowCorrelation

G1G2

+

G4G9

+

…

G3G6

-

G7G8

-

…

G2G5

?

0.9 0.7 … 0.1 0.2 … 0.8

+ - … - - … +

0.8 0.5 … 0.05 0.1 … 0.6

HighCorrelation

LowCorrelation

Fre

quen

cy

Coloc.Not coloc.

Fre

quen

cy

SimilarDissim.

Fre

quen

cy

P(G2-G5|Data) = 0.85

100Ms gene pairs →

← 1

Ks

data

sets

+ =

7

Functional networkprediction and analysis

Global interaction network

Metabolism network Signaling network Gut community network

Currently includes data from30,000 human experimental results,

15,000 expression conditions +15,000 diverse others, analyzed for

200 biological functions and150 diseases

HEFalMp

8

HEFalMp: Predicting human gene function

HEFalMp

9

HEFalMp: Predicting humangenetic interactions

HEFalMp

10

HEFalMp: Analyzing human genomic data

HEFalMp

11

HEFalMp: Understanding human disease

HEFalMp

12

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

1

1log2

1'

'

''

z

eiey ,

ieeeiey ,,

i

ieiee yw ,*,̂

22,

*, ˆ

1

eie

ies

w

Simple regression:All datasets are equally accurate

Random effects:Variation within and

among datasets and interactions

13

Meta-analysis for unsupervisedfunctional data integration

Following up with semi-supervised approach

Evangelou 2007

Huttenhower 2006Hibbs 2007

1

1log2

1'

'

''

z

+ =

14

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

The strength of these relationships indicates how cohesive a process is.

Chemotaxis

15



HighConfidence

LowConfidence

Chemotaxis

16


Flagellar assembly

The strength of these relationships indicates how

associated two processes are.


HighConfidence

LowConfidence

Chemotaxis

17

Functional Mapping:Scoring Functional Associations

How can we formalizethese relationships?

Any sets of genes G1 and G2 in a network can be compared

using four measures:

• Edges between their genes

• Edges within each set• The background edges

incident to each set• The baseline of all edges

in the network

),(),(

),(

2121

21, 21 GGwithin

baseline

GGbackground

GGbetweenFA GG

Stronger connections between the sets increase association.

Stronger within self-connections or nonspecific background connections decrease association.

18

Functional Mapping:Bootstrap p-values

• Scoring functional associations is great……how do you interpret an association score?– For gene sets of arbitrary sizes?– In arbitrary graphs?– Each with its own bizarre distribution of edges?

Empirically!# Genes 1 5 10 50

1

5

10

50

Histograms of FAs for random sets

For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is

approximately normal with mean 1.

Standard deviation is asymptotic in the sizes

of both gene sets.

Maps FA scores to p-values for any gene sets and

underlying graph.

100

102

104

100

101

102

103

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

|G1|

|G2|

Null distribution σs for one graph

|)(|||

|||)(|),(ˆ

1),(ˆ

ji

jijiFA

jiFA

GCG

BGGAGG

GG

)(1)( ),(ˆ),,(ˆ, 212121xxFAP GGGGGG

19

Functional Mapping:Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

20

Functional Mapping:Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

21

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

Data integration summarizes an impossibly huge amount of experimental data into an

impossibly huge number of predictions; what next?

22

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

How can a biologist take advantage of all this data to study

his/her favorite gene/pathway/disease without

losing information?

Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease

associations• Underlying experimental results and

functional activities in data

23

Outline






• Gene expression

• Physical PPIs

• Genetic interactions

• Colocalization

• Sequence

• Protein domains

• Regulatory binding

sites

…

?

How do functional interactionsbecome pathways?

24

+ =

Functional genomic data

25

With Chris Park, Olga Troyanskaya

Simultaneous inference of physical, genetic, regulatory, and functional networks

Functional interactions

Regulatory interactions

Post-transcriptional regulation

Metabolic interactions

Phosphorylation Protein complexes

26

Learning a compendium of interaction networks

Train one SVM per interaction type

Resolve consistency using hierarchical Bayes net

27

Learning a compendium of interaction networks

AUC

0.5 1.0

Both presence/absence and directionality of

interactions are accurately inferred

28

Using network compendia to predictcomplete pathways

Additional 20 novel synthetic lethality predictions tested,

14 confirmed(>100x better than random)

Confirmed

Unconfirmed

With David Hess

29

Interactive aligned network viewer –coming soon!

Graphle

30

Outline






31

Microbial Communities andFunctional Metagenomics

• Metagenomics: data analysis from environmental samples– Microflora: environment includes us!

• Pathogen collections of “single” organisms form similar communities

• Another data integration problem– Must include datasets from multiple organisms

• What questions can we answer?– What pathways/processes are present/over/under-

enriched in a newly sequences microbe/community?– What’s shared within community X?

What’s different? What’s unique?– How do human microflora interact with diabetes,

obesity, oral health, antibiotics, aging, …– Current functional methods annotate

~50% of synthetic data, <5% of environmental data

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2

AGA

With Jacques Izard, Wendy Garrett

32

Data Integration for Microbial Communities

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2

AGA

~300 available expression datasets

~30 species

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2

AGA

Weskamp et al 2004

Flannick et al 2006

Kanehisa et al 2008

Tatusov et al 1997

• Data integration works just as well in microbes as it does in yeast and humans• We know an awful lot about some microorganisms and almost nothing about others• Sequence-based and network-based tools for function transfer both work in isolation• We can use data integration to leverage both and mine out additional biology

33

Functional network prediction from diverse microbial data

486 bacterial expression

experiments

876 raw datasets

310 postprocessed

datasets

304 normalized coexpression networks

in 27 species

Integrated functional interaction networks

in 15 species

307 bacterial interaction

experiments

154796 raw interactions

114786 postprocessed

interactions

E. Coli Integration

← Precision ↑, Recall ↓

34

Functional maps for cross-speciesknowledge transfer

← Precision ↑, Recall ↓

Following up with unsupervised and partially anchored network alignment

35

• Sleipnir C++ library for computational functional genomics

• Data types for biological entities• Microarray data, interaction data, genes and gene sets,

functional catalogs, etc. etc.• Network communication, parallelization

• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)

• And it’s fully documented!

Efficient Computation For Biological Discovery

Massive datasets and genomes require efficient algorithms and implementations.

It’s also speedy: microbial data integration

computationtakes <3hrs.

36

Outline






• Bayesian and unsupervised

methods for data integration• HEFalMp system for human data

analysis and integration• Functional mapping to statistically

summarize large data collections

• Simultaneous inference of an

interaction network compendium

• Accurate prediction of interaction

types and directionality• Validated pathways

and specificindividual interactions

in yeast

• Integration for microbial

communities and metagenomics

• Sleipnir software for efficient

large scale data mining

37

Thanks!

NIGMShttp://function.princeton.edu/hefalmp

http://huttenhower.sph.harvard.edu/sleipnir

Olga TroyanskayaChris ParkDavid HessMatt HibbsChad MyersAna PopAaron Wong

Hilary CollerErin Haley

Jacques Izard

Wendy Garrett

Sarah FortuneTracy Rosebrock

39

Validating Human Predictions

Autophagy

Luciferase(Negative control)

ATG5(Positive control) LAMP2 RAB11A

NotStarved

Starved(Autophagic)

Predicted novel autophagy proteins

5½ of 7 predictions currently confirmed

With Erin Haley, Hilary Coller

40

Functional maps for cross-speciesknowledge transfer

G17

G16G15

G10

G6

G9

G8

G5

G11

G7

G12

G13

G14

G2

G1

G4

G3

O8

O4O5

O7

O9

O6

O2

O3

O1

O1: G1, G2, G3O2: G4O3: G6…

ECG1, ECG2BSG1ECG3, BSG2…

41

Functional maps for functional metagenomics

GOS 4441599.3Hypersaline Lagoon, Ecuador

KEGG Pathways

Org

anis

ms

Pathog ens

Env.

Mapping genes into pathways

Mapping pathways into

organisms

+ Integrated functional interaction networks

in 27 species

Mapping organisms into phyla

=

42

Functional maps for functional metagenomics

NodesProcess cohesiveness in obesity

VeryDownregulated

Baseline(no change)

VeryUpregulated

EdgesProcess association in obesity

MoreCoregulated

LessCoregulated

Baseline(no change)

43

Current Work: Molecular Mechanismsin a Colorectal Cancer Cohort

With Shuji Ogino, Charlie Fuchs

~3,100gastrointestinal

subjects

~3,800tissue samples

~1,450colon cancer

samples~1,150

CpG island methylation

~1,200LINE-1

methylation

~700TMA immuno-histochemistry

~2,100cancer

mutation tests

Health Professionals Follow-Up

StudyNurse’s HealthStudy

LINE-1 Methylation• Repetitive element making up ~20% of

mammalian genomes• Very easy to assay methylation level (%)• Good proxy for whole-genome methylation

level

DASL Gene Expression• Gene expression analysis from

paraffin blocks• Thanks to Todd Golub, Yujin

Hoshida

~775gene

expression

44

Molecular Subtypes of Colorectal Cancer:Stem Cell Programs and Proliferation

Chr. 19 rearrangement,membrane receptors/channels

HSC signature

Neural/ESC signature

Angiogenesis, proliferation

BRCA interactors,chrom. stability factors

Cell cycle regulation

C1 C2 C3 C4Nonnegative matrix factorizationTumors →

← G

enes

45

Molecular Subtypes of Colorectal Cancer:Stem Cell Programs and Proliferation

Subramanian et al, 2005

195

146678

166945

325

799

NeuralStem Cell Signature

HematopoeiticStem Cell Signature

EmbryonicStem Cell Signature

Chr. 19q

18

8

7

BAX

CD133 + Bcl-X(L)

CD44 + CD166

Hypotheses?• Two main pathways to

proliferation:• HSC program + BAX• ESC/NSC program

• Two main pathways to deregulation:

• Angiogenesis + chrom. instability• Cell cycle disruption (MSI?)

Note that these regulatory programsdo not appear to correspond

with demographics or commonpathologic markers…

Testing now for correlation with outcome.

46

Epigenetics of Colorectal Cancer:LINE-1 methylation levels

30 35 40 45 50 55 60 65 70 75 8030

40

50

60

70

80

LINE-1 Methylation in Mul-tiple Tumors from the Same

Subject

Methylation %, Tumor #1M

eth

ylat

ion

%,

Tu

mo

r #2

ρ = 0.718, p < 0.01

Ogino et al, 2008

Lower LINE-1 methylation associates with poor colon cancer prognosis.

LINE-1 methylation varies remarkably between individuals…

…but it is highly correlated within individuals.

What does it all mean??What is the biological

mechanism linking LINE-1 methylation to colon cancer?

47


30 35 40 45 50 55 60 65 70 75 8030

40

50

60

70

80

LINE-1 Methylation in Mul-tiple Tumors from the Same

Subject

Methylation %, Tumor #1M

eth

ylat

ion

%,

Tu

mo

r #2

ρ = 0.718, p < 0.01

Ogino et al, 2008

Lower LINE-1 methylation associates with poor colon cancer prognosis.

LINE-1 methylation varies remarkably between individuals…

…but it is highly correlated within individuals.

This suggests a genetic effect.

This suggests a copy number variation.

This suggests linkage to a cancer-related pathway.

Is anything different about these outliers?

What is the biological mechanism linking LINE-1

methylation to colon cancer?

48


What is the biological mechanism linking LINE-1

methylation to colon cancer?

Preliminary Data• 10 genes differentially expressed even using simple methods• 1/3 are from the same family with known GI tumor prognostic value• 1/3 are X-chromosome testis/cancer-specific antigens• 1/2 fall in same cytogenic band, which is also a known CNV hotspot• HEFalMp links to a cascade of antigens/membrane receptors/TFs

Cell adhesion p-value ≈ 0, moderate correlation in many cancer arrays• GSEA pulls out a wide range of proliferation up (E2F),

immune response down; need to regress out prognosis correlates

Check back in acouple of months!

supervised and unsupervised methods for large scale genomic data integration

Documents

network gut community

formalizethese relationships

functional networkprediction

functional metagenomics

functional associations17how

human experimental results

distinctrolesmatt hibbs

mining genomescale datasets