large scale genomic data mining

82
Large scale genomic data mining Curtis Huttenhower 10-23-09 rvard School of Public Health partment of Biostatistics

Upload: tymon

Post on 24-Feb-2016

61 views

Category:

Documents


0 download

DESCRIPTION

Large scale genomic data mining. Curtis Huttenhower 10-23-09. Harvard School of Public Health Department of Biostatistics. Mining Biological Data. ~100 GB. More than 100GB. Mining Biological Data. ~100 GB. More than 100GB. Mining Biological Data. ~100 GB. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Large scale genomic data mining

Large scalegenomic data mining

Curtis Huttenhower

10-23-09Harvard School of Public HealthDepartment of Biostatistics

Page 2: Large scale genomic data mining

Mining Biological Data

~100 GB

More than 100GB

Page 3: Large scale genomic data mining

Mining Biological Data

~100 GB

More than 100GB

Page 4: Large scale genomic data mining

Mining Biological Data

~100 GB

More than 100GB

How can we ask and answer specific biomedical questions

using thousands ofgenome-scale datasets?

Page 5: Large scale genomic data mining

5

Outline

2. Applications:Human molecular data

and clinical cancer cohorts

1. Methodology:Algorithms for mining

genome-scale datasets

3. Next steps:Methods for microbial communities

and functional metagenomics

Page 6: Large scale genomic data mining

6

A Definition of Functional Genomics

Genomic data Prior knowledge

Data↓

Function

Function↓

Function

Gene↓

Gene

Gene↓

Function

Page 7: Large scale genomic data mining

7

MEFIT: A Framework forFunctional Genomics

BRCA1 BRCA2 0.9BRCA1 RAD51 0.8RAD51 TP53 0.85…

Related Gene Pairs

HighCorrelation

LowCorrelation

Freq

uenc

y

MEFIT

Page 8: Large scale genomic data mining

8

MEFIT: A Framework forFunctional Genomics

BRCA1 BRCA2 0.9BRCA1 RAD51 0.8RAD51 TP53 0.85…

BRCA2 SOX2 0.1RAD51 FOXP2 0.2ACTR1 H6PD 0.15…

Related Gene Pairs

Unrelated Gene PairsHigh

CorrelationLow

Correlation

Freq

uenc

y

MEFIT

Page 9: Large scale genomic data mining

9

MEFIT: A Framework forFunctional Genomics

Golub 1999

Butte 2000

Whitfield 2002

Hansen 1998

Functional Relationship

Page 10: Large scale genomic data mining

10

MEFIT: A Framework forFunctional Genomics

Golub 1999

Butte 2000

Whitfield 2002

Hansen 1998

Functional Relationship

Biological Context

Functional areaTissueDisease…

Page 11: Large scale genomic data mining

11

Functional Interaction Networks

MEFIT

Global interaction network

Autophagy networkVacuolar transport

network Translation network

Currently have data from30,000 human experimental results,

15,000 expression conditions +15,000 diverse others, analyzed for

200 biological functions and150 diseases

Page 12: Large scale genomic data mining

12

Predicting Gene Function

Cell cycle genes

Predicted relationships between genes

HighConfidence

LowConfidence

Page 13: Large scale genomic data mining

13

Predicting Gene FunctionPredicted relationships

between genes

HighConfidence

LowConfidence

Cell cycle genes

Page 14: Large scale genomic data mining

14

Cell cycle genes

Predicting Gene FunctionPredicted relationships

between genes

HighConfidence

LowConfidence

These edges provide a measure of how likely a gene is to

specifically participate in the process of

interest.

Page 15: Large scale genomic data mining

15

Comprehensive Validation of Computational Predictions

Genomic data

Computational Predictions of Gene FunctionMEFITSPELL

Hibbs et al 2007bioPIXIEMyers et al 2005

Genes predicted to function in mitochondrion organization

and biogenesis

Laboratory ExperimentsPetite

frequencyGrowthcurves

Confocal microscopy

New known functions for correctly predicted genes

Retraining

With David Hess, Amy Caudy

Prior knowledge

Page 16: Large scale genomic data mining

16

Evaluating the Performance of Computational Predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

135Under-annotations

82Novel Confirmations,

First Iteration

17Novel Confirmations,

Second Iteration

340 total: >3x previously known genes in ~5 person-months

Page 17: Large scale genomic data mining

17

Evaluating the Performance of Computational Predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

95Under-annotations

40Confirmed

Under-annotations

80Novel Confirmations

First Iteration

17Novel Confirmations

Second Iteration

340 total: >3x previously known genes in ~5 person-months

Computational predictions from large collections of genomic data can be

accurate despite incomplete or misleading gold standards, and they

continue to improve as additional data are incorporated.

Page 18: Large scale genomic data mining

18

Functional Associations Between Contexts

Predicted relationships between genes

HighConfidence

LowConfidence

The average strength of these relationships

indicates how cohesive a process is.

Cell cycle genes

Page 19: Large scale genomic data mining

19

Functional Associations Between Contexts

Predicted relationships between genes

HighConfidence

LowConfidence

Cell cycle genes

Page 20: Large scale genomic data mining

20

Functional Associations Between Contexts

DNA replication genes

The average strength of these relationships indicates how

associated two processes are.

Predicted relationships between genes

HighConfidence

LowConfidence

Cell cycle genes

Page 21: Large scale genomic data mining

21

Functional mapping:Scoring functional associations

How can we formalizethese relationships?

Any sets of genes G1 and G2 in a network can be compared

using four measures:

• Edges between their genes

• Edges within each set• The background edges

incident to each set• The baseline of all edges

in the network

),(),(),(

2121

21, 21 GGwithin

baselineGGbackground

GGbetweenFA GG Stronger connections between the sets increase association.

Stronger within self-connections or nonspecific background connections decrease association.

Page 22: Large scale genomic data mining

22

Functional mapping:Bootstrap p-values

• Scoring functional associations is great……how do you interpret an association score?– For gene sets of arbitrary sizes?– In arbitrary graphs?– Each with its own bizarre distribution of edges?

Empirically!# Genes 1 5 10 50

1

5

10

50

Histograms of FAs for random sets

For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is

approximately normal with mean 1.

Standard deviation is asymptotic in the sizes

of both gene sets.

Maps FA scores to p-values for any gene sets and

underlying graph.

100

102

104

100

101

102

103

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

|G1|

|G2|

Null distribution σs for one graph

|)(||||||)(|

),(ˆ

1),(ˆ

ji

jijiFA

jiFA

GCGBGGA

GG

GG

)(1)( ),(ˆ),,(ˆ, 212121xxFAP GGGGGG

Page 23: Large scale genomic data mining

23

Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox HomeostasisAldehyde

Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

CatabolismNegative Regulation

of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

Page 24: Large scale genomic data mining

24

Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox HomeostasisAldehyde

Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

CatabolismNegative Regulation

of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

Page 25: Large scale genomic data mining

25

Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox HomeostasisAldehyde

Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

CatabolismNegative Regulation

of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

AHP1DOT5GRX1GRX2…

APE3LAP4PAI3PEP4 …

Page 26: Large scale genomic data mining

26

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

Data integration summarizes an impossibly huge amount of experimental data into an

impossibly huge number of predictions; what next?

Page 27: Large scale genomic data mining

27

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

How can a biologist take advantage of all this data to study

his/her favorite gene/pathway/disease without

losing information?

Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease

associations• Underlying experimental results and

functional activities in data

Page 28: Large scale genomic data mining

28

Outline

2. Applications:Human molecular data

and clinical cancer cohorts

1. Methodology:Algorithms for mining

genome-scale datasets

3. Next steps:Methods for microbial communities

and functional metagenomics

Page 29: Large scale genomic data mining

29

HEFalMp: Predicting human gene function

HEFalMp

Page 30: Large scale genomic data mining

30

HEFalMp: Predicting humangenetic interactions

HEFalMp

Page 31: Large scale genomic data mining

31

HEFalMp: Analyzing human genomic data

HEFalMp

Page 32: Large scale genomic data mining

32

HEFalMp: Understanding human disease

HEFalMp

Page 33: Large scale genomic data mining

33

Validating Human Predictions

Autophagy

Luciferase(Negative control)

ATG5(Positive control) LAMP2 RAB11A

NotStarved

Starved(Autophagic)

Predicted novel autophagy proteins

5½ of 7 predictions currently confirmed

With Erin Haley, Hilary Coller

Page 34: Large scale genomic data mining

34

Current Work: MolecularMechanisms in a Colon Cancer CohortWith Shuji Ogino, Charlie Fuchs

~3,100gastrointestinal

subjects

~3,800tissue samples

~1,450colon cancer

samples~1,150

CpG island methylation

~1,200LINE-1

methylation

~700TMA immuno-histochemistry

~2,100cancer

mutation tests

Health Professionals Follow-Up

StudyNurse’s HealthStudy

LINE-1 Methylation• Repetitive element making up ~20% of

mammalian genomes• Very easy to assay methylation level (%)• Good proxy for whole-genome methylation

level

DASL Gene Expression• Gene expression analysis from

paraffin blocks• Thanks to Todd Golub, Yujin

Hoshida

~775gene

expression

Page 35: Large scale genomic data mining

35

Colon Cancer:LINE-1 methylation levels

30 35 40 45 50 55 60 65 70 75 8030

40

50

60

70

80

LINE-1 Methylation in Mul-tiple Tumors from the Same

Subject

Methylation %, Tumor #1M

ethy

latio

n %

, Tum

or #

2

ρ = 0.718, p < 0.01

Ogino et al, 2008

Lower LINE-1 methylation associates with poor colon cancer prognosis.

LINE-1 methylation varies remarkably between individuals…

…but it is highly correlated within individuals.

What does it all mean??What is the biological

mechanism linking LINE-1 methylation to colon cancer?

With Shuji Ogino, Charlie Fuchs

Page 36: Large scale genomic data mining

36

Colon Cancer:LINE-1 methylation levels

30 35 40 45 50 55 60 65 70 75 8030

40

50

60

70

80

LINE-1 Methylation in Mul-tiple Tumors from the Same

Subject

Methylation %, Tumor #1M

ethy

latio

n %

, Tum

or #

2

ρ = 0.718, p < 0.01

Ogino et al, 2008

Lower LINE-1 methylation associates with poor colon cancer prognosis.

LINE-1 methylation varies remarkably between individuals…

…but it is highly correlated within individuals.

This suggests a genetic effect.

This suggests a copy number variation.

This suggests linkage to a cancer-related pathway.

Is anything different about these outliers?

What is the biological mechanism linking LINE-1

methylation to colon cancer?

With Shuji Ogino, Charlie Fuchs

Page 37: Large scale genomic data mining

37

Colon Cancer:LINE-1 methylation levels

What is the biological mechanism linking LINE-1

methylation to colon cancer?

Preliminary Data• Six genes differentially expressed even using naïve methods• One uncharacterized, one oncogene, three malignancy, one histone• 1/3 are from a family with known variable GI expression, prognostic

value• 2/3 fall in same cytogenic band, which is also a known CNV hotspot• HEFalMp links to a set of transmembrane receptors/channels• Better analysis pulls out mostly one-carbon metabolism and a few

more signaling pathways (neurotransmitters??)

Check back in acouple of months!

Page 38: Large scale genomic data mining

38

Outline

2. Applications:Human molecular data

and clinical cancer cohorts

1. Methodology:Algorithms for mining

genome-scale datasets

3. Next steps:Methods for microbial communities

and functional metagenomics

Page 39: Large scale genomic data mining

39

Next Steps:Microbial Communities

• Data integration is off to a great start in humans– Complex communities of distinct cell types– Very sparse prior knowledge

• Concentrated in a few specific areas– Variation across populations– Critical to understand mechanisms of disease

Page 40: Large scale genomic data mining

40

Next Steps:Microbial Communities

• What about microbial communities?– Complex communities of distinct species/strains– Very sparse prior knowledge

• Concentrated in a few specific species/strains– Variation across populations– Critical to understand mechanisms of disease

Page 41: Large scale genomic data mining

41

Next Steps:Functional Metagenomics

• Metagenomics: data analysis from environmental samples– Microflora: environment includes us!

• Another data integration problem– Must include datasets from multiple organisms

• Another context-specificity problem– Now “context” can also mean “species”

• What questions can we answer?– How do human microflora interact with diabetes,

obesity, oral health, antibiotics, aging, …– What’s shared within community X?

What’s different? What’s unique?– What’s perturbed in disease state Y?

One organism, or many? Host interactions?– Current methods annotate ~50% of synthetic data,

<5% of environmental data

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2AGA

Page 42: Large scale genomic data mining

42

Next Steps:Microbial Communities

PKH1

PKH3

PKH2LPD1

CAR1W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2AGA

~120 available expression datasets

~70 species

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2AGA

Weskamp et al 2004

Flannick et al 2006

Kanehisa et al 2008

Tatusov et al 1997

• Data integration works just as well in microbes as it does in humans• We know an awful lot about some microorganisms and almost nothing about others• Purely sequence-based and purely network-based tools for function transfer both fall short• We need data integration to take advantage of both and mine out useful biology!

Page 43: Large scale genomic data mining

43

Functional Maps forFunctional Metagenomics

YG17

YG16YG15

YG10

YG6

YG9

YG8

YG5

YG11

YG7

YG12

YG13

YG14

YG2

YG1YG4

YG3

KO8

KO4

KO5

KO7

KO9

KO6

KO2

KO3

KO1

KO1: YG1, YG2, YG3KO2: YG4KO3: YG6…

ECG1, ECG2PAG1ECG3, PAG2…

Page 44: Large scale genomic data mining

44

Functional Maps forFunctional Metagenomics

Page 45: Large scale genomic data mining

45

Validating Orthology-BasedFunctional Mapping

Does unweighted data integration predict functional relationships?

What is the effect of “projecting” through an orthologous space?

Recall

log(

Pre

cisi

on/R

ando

m)

KEGG

GO

Recall

log(

Pre

cisi

on/R

ando

m)

Recall

log(

Pre

cisi

on/R

ando

m)

GO

Unsupervised integration

Individual datasets

Recall

log(

Pre

cisi

on/R

ando

m) Individual

datasets

KEGG

Unsupervised integration

Page 46: Large scale genomic data mining

46

Validating Orthology-BasedFunctional Mapping

YG17

YG16YG15

YG10

YG6

YG9

YG8

YG5

YG11

YG7

YG12

YG13

YG14

YG2

YG1YG4

YG3Holdout set,

uncharacterized “genome”

Random subsets,characterized “genomes”

Page 47: Large scale genomic data mining

47

Validating Orthology-BasedFunctional Mapping

Page 48: Large scale genomic data mining

48KEGG KEGG

GO GO

Validating Orthology-BasedFunctional Mapping

Can subsets of the yeast genome predict a heldout subset’s

functional maps?

Can subsets of the yeast genome predict a heldout subset’s

interactome?

0.68 0.48

0.39 0.25

0.30 0.37

0.27 0.390.43

0.40

What have we learned?• Yeast is incredibly well-curated

• KEGG tends to be more specific than GO

• Predicting interactomes by projecting through

functional maps

works decently in the absolute best case

Page 49: Large scale genomic data mining

49

Functional Maps forFunctional Metagenomics

Now, what happens if you do this forcharacterized microbes?

• ~20 (somewhat) well-characterized species

• 1-35 datasets each

• Integrate within species

• Evaluate using KEGG

• Then cross-validate by holding out species

Recall

log(

Pre

cisi

on/R

ando

m)

KEGG

Unsupervised integrations

Page 50: Large scale genomic data mining

50

Next Steps:Missing Methodology, Mining

• Most machine learning algorithms are optimized for one of two cases:

– Small, dense data

– Large, sparse data

• HEFalMp integrates ~300M records using ~1K features, relatively few of which are missing, in ~200 contexts

Feature selection

Regularization

Dimension reductionSimple models, efficient algorithmsSlightly less

Page 51: Large scale genomic data mining

51

Next Steps:Missing Methodology, Models

Dataset #1

Dataset #2

Dataset #2 …

Functional Relationship

Page 52: Large scale genomic data mining

52

Next Steps:Missing Methodology, Models

Dataset #1

Dataset #2

Dataset #3 …

Functional Relationship

Biological Context

Page 53: Large scale genomic data mining

53

Next Steps:Missing Methodology, Models

Dataset #1

Dataset #2

Dataset #3 …

Functional Relationship

Cellular Processes

Tissue/Cell Lineage

Disease State

Developmental Stage

Cross-Species Orthology

This is clearly not a sustainable system;novel large-scale hierarchical modeling is needed to capture the complex biology of metazoan and

metagenomic interaction networks.

Types of Interactions

Regulation

Page 54: Large scale genomic data mining

54

Efficient Computation For Biological Discovery

Massive datasets and genomes require efficient algorithms and implementations.

• Sleipnir C++ library for computational functional genomics

• Data types for biological entities• Microarray data, interaction data, genes and gene sets,

functional catalogs, etc. etc.• Network communication, parallelization

• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)

• And it’s fully documented!It’s also speedy: improves on Bayes Net Toolbox by

~22x in memory usage and up to >100x in runtime.

Page 55: Large scale genomic data mining

55

Efficient Computation For Biological Discovery

Massive datasets and genomes require efficient algorithms and implementations.

• Sleipnir C++ library for computational functional genomics

• Data types for biological entities• Microarray data, interaction data, genes and gene sets,

functional catalogs, etc. etc.• Network communication, parallelization

• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)

• And it’s fully documented!

8 hours

1 minute

30 years

2 months

18 hours

Original processing time

Current processing time

2-3 hours

Page 56: Large scale genomic data mining

56

Outline

2. Applications:Human molecular data

and clinical cancer cohorts

1. Methodology:Algorithms for mining

genome-scale datasets

3. Next steps:Methods for microbial communities

and functional metagenomics

• Bayesian system for genomic

data integration• Sleipnir software for efficient

large scale data mining• Functional mapping to statistically

summarize large data collections

• HEFalMp system for human data

analysis and integration

• Six confirmed predictions in

autophagy• Ongoing analysis of

LINE-1methylation in colon

cancer• Data integration

applied tomicrobial

communities andfunctional

metagenomics• Efficient machine

learningfor large, dense

feature spaces

Page 57: Large scale genomic data mining

57

Thanks!

NIGMShttp://function.princeton.edu/sleipnir

http://function.princeton.edu/hefalmp

Interested? We’re lookingfor students and postdocs!Biostatistics Department

http://huttenhower.sph.harvard.edu

Hilary CollerErin HaleyTsheko Mutungu

Olga TroyanskayaMatt HibbsChad MyersDavid HessEdo AiroldiFlorian Markowetz

Shuji OginoCharlie Fuchs

Page 58: Large scale genomic data mining
Page 59: Large scale genomic data mining

59

Colon Cancer:Immunohistochemistry

Tumor #1 Tumor #2 … Tumor #700AKT1 0 11 55AURKA 0 5 0CCND1 25 0 30

… …

Gen

es

Conditions

Quantities

The world’s smallest, cheapest microarray!

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Page 60: Large scale genomic data mining

60

Colon Cancer:Immunohistochemistry

~700 Tumor Samples

LINE-1 hypomethylated outliers

LINE-1 methylation “normal”

STAT3VDR

HIF1A

CDKN1B

AURKAMAPK

CDX2

DNMT1

PPARGCDK8

CTSBPTEN

CCND10

10

20

30

40

50

60

70

80

LINE-1 Methylation Low

Normal

IHC

Pseu

doex

pres

sion

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

The world’s smallest, cheapest microarray!

Page 61: Large scale genomic data mining

61

Colon Cancer:Mining Microarrays

STAT3VDR

HIF1A

CDKN1B

AURKAMAPK

CDX2

DNMT1

PPARGCDK8

CTSBPTEN

CCND1-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

log2

( Low

/ No

rmal

)~650 datasets

~15,000 expression conditions

~24,000 genes

Most like our 26-gene LINE-1 differential methylation

signature

Least like the signature

26 genes in signature

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

Page 62: Large scale genomic data mining

62

Colon Cancer: Mining Microarrays

“The goal of GSEA is to determine whether members of a gene

set S tend to occur toward the top (or bottom) of the list L.”

data

Subramanian et al, 2005

Most like our 26-gene LINE-1 differential methylation

signature

Least like the signature

Bleomycin effect on mutagen-sensitive lymphoblastoid cells

Folic acid deficiency effect on colon cancer cells

Bladder tumor stage classification

Normal tissue of diverse types

Muscle function and aging

Non-diseased lung tissue

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

What CNV-linked genes are differentially expressed in

these datasets?

Dataset 1

Condition XCondition YCondition Z

Dataset 2

Condition ACondition BCondition CCondition DCondition E

Page 63: Large scale genomic data mining

63

Colon Cancer: Mining Microarrays

“The goal of GSEA is to determine whether members of a gene

set S tend to occur toward the top (or bottom) of the list L.”

Subramanian et al, 2005

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

What CNV-linked genes are differentially expressed in

these datasets?

CNV 1

Gene XGene YGene Z

CNV 2

Gene AGene BGene CGene DGene E

Most upregulated insignificantly enriched datasets

Most downregulated

PSGs (11 genes on 19q13.3)

PCDHs (~50 genes on 5q31.3) Misc. ~12 genes on 16p13.3

Iafrate et al, 2005

?

Page 64: Large scale genomic data mining

64

Colon Cancer: Mining Microarrays

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

What CNV-linked genes are differentially expressed in

these datasets?

Iafrate et al, 2005

Pregnancy specific β glycoproteins

Salahshor et al, 2005

“PSG9 is not found in the non-pregnant adult except in association with cancer, and it appears to be an early molecular event associated with colorectal cancer.”Differential gene expression profile reveals deregulation of pregnancy specific β1 glycoprotein 9 early during colorectal carcinogenesis

Page 65: Large scale genomic data mining

65

Colon Cancer:Generating a Hypothesis

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

What CNV-linked genes are differentially expressed in

these datasets?

Pregnancy specific β glycoproteins

Page 66: Large scale genomic data mining

66

Colon Cancer:Generating a Hypothesis

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

What CNV-linked genes are differentially expressed in

these datasets?

Pregnancy specific β glycoproteins

Page 67: Large scale genomic data mining

67

Colon Cancer:Using All the Data

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

What CNV-linked genes are differentially expressed in

these datasets?

Pregnancy specific β glycoproteins

GI cancers and chemotherapy

Yes(caveat investigator)

Get back to me in a couple of months…

What’s the state of the data?• Extremely hypomethylated colon cancer carries a significantly poor prognosis• In our cohort, these ~20 tumors are weakly enriched for a protein activity signature

based on IHC• The expression datasets most enriched for the same signature represent mainly GI

cancer and chemotherapy conditions• The PSG gene family is upregulated in these datasets and is linked to a known CNV• HEFalMp associates the PSGs with cancer based on correlation with known colorectal

cancer genes in a variety of expression datasets

Nothing definite – yet.

Page 68: Large scale genomic data mining

68

• Of only five regulators found, four have

generic cell cycle/proliferation targets

• Just five basic regulators for ~7,000 genes?

• These motifs only appear upstream of ~half

of the genes

Human Regulatory Networks

G0

I

III

IV

V

VIVII

IX

VIII

II

X

6,829genes

Serum re-stimulated (hrs)Serum starved (hrs)1

5< <50

2 4 8 24 96 1 2 4 8 24 48

Dev

elop

men

t

Dev

elop

men

t

Cho

lest

erol

Pro

tein

loca

lizat

ion

Cel

l cyc

le

RN

A pr

oces

sing

Met

abol

ism

FIRE: Elemento et al. 2007

Elk-1

Sp1

NF-Y

YY1

Quiescence: reversible exit from the cell cycle

Page 69: Large scale genomic data mining

69

Regulatory Modules:Expression Biclusters + Sequence Motifs

CRG1

CRG2

CRG3

CRG4

RND1

RND2

RND3

RND4

RND5

RND6

RND7

RND8

3 4 71 2 5 6 8Bicluster:Coregulated subset of genes and conditions

Page 70: Large scale genomic data mining

70

Regulatory Modules:Expression Biclusters + Sequence Motifs

CRG1CRG2

CRG4CRG3

RND1RND2RND3RND4RND5RND6RND7RND8

3 4 71 2 5 6 8Bicluster:Coregulated subset of genes and conditions

Page 71: Large scale genomic data mining

71

Regulatory Modules:Expression Biclusters + Sequence Motifs

CRG1CRG2

CRG4CRG3

RND1RND2RND3RND4RND5RND6RND7RND8

3 4 71 2 5 6 8Bicluster:Coregulated subset of genes and conditions

…do all that, and simultaneously find

(under)enriched sequence motifs!

…any dataset can contain many

overlapping biclusters…

…any gene or condition can participate in

multiple biclusters…

Page 72: Large scale genomic data mining

72

COALESCE: Combinatorial Algorithm forExpression and Sequence-based Cluster Extraction

Gene Expression DNA Sequence

5’ UTR 3’ UTR

Upstream flank Downstream flank

Evolutionary Conservation

Nucleosome Positions

Identify conditions where genes

coexpress

Identify motifs enriched in

genes’ sequences

Create a new module

Select genes based on conditions

and motifs

Subtract mean from all data

Regulatory modules• Coregulated genes• Conditions where they’re

coregulated• Putative regulating motifs

Feature selection:Tests for differential expression/frequency

Bayesian integration

Page 73: Large scale genomic data mining

73

COALESCE: SelectingCoexpressed Conditions

• For each gene expression condition…– Compare distributions of values for

• Genes in the module versus• Genes not in the module

– If significantly different, include the condition

Preserving data structure:• If multiple conditions derive from the

samedataset, can be included/excluded as a

unit• For example, time course vs. deletion

collection• Test using multivariate z-test• Precalculate covariance matrix; still very

efficient

Page 74: Large scale genomic data mining

74

COALESCE: SelectingSignificant Motifs

• Coalesce looks for three kinds of motifs:– K-mers– Reverse complement pairs– Probabilistic Suffix Trees (PSTs)

• For every possible motif…– Compare distributions of values for

• Genes in the module versus• Genes not in the module

– If significantly different, include the motif

ACGACGT

ACGACAT | ATGTCGT

A

TC

G

T

TG

CA

• This can distinguish flanks from UTRs• Fast!• Efficient enough to search coding sequence

(e.g. exons/introns)

Page 75: Large scale genomic data mining

75

COALESCE: SelectingProbable Genes

• For each gene in the genome…For each significant condition… For each significant motif…

What’s the probability the gene came from the module’s distribution?

What’s the probability that it came from outside the module?

)()|()()|()()|()|(

MgPMgDPMgPMgDPMgPMgDPDMgP

Distributions of each feature in and out of the developing module are observed from the data.

Prior is used to stabilize module convergence; genes already in the module are more likely to stay there next iteration.

The probability of a gene being in the module given some data…

Page 76: Large scale genomic data mining

76

COALESCE: IntegratingAdditional Data Types

Nucleosome placement Evolutionary conservation

• Can be included as additional datasets and feature

selected just like expression conditions/motifs.

• Or can be used as a prior or weight on the values of

individual motifs.

N CG1 2.5 0.0

G2 0.6 0.5

G3 1.2 0.9

… … …

TCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATG

Page 77: Large scale genomic data mining

77

COALESCE Results:S. cerevisiae Modules

~2,200 conditions

~6,000 genes

The haystack

A needle100 genes

80 conditions

Page 78: Large scale genomic data mining

78

COALESCE Results:Yeast TF/Target Accuracy

Bas1p Hap4p Met32p

Cup2p Met31p

Zap1p Upc2p Mbp1p

Hsf1p Gln3p Hap3p Gcn4p Uga3p Gis1p Hap5p

-0.3

-0.1

0.0999999999999997

0.3

0.5

0.7

0.9

1.1

1.3

COALESCEcMonkeyFIREWeeder

Z-Sc

ore

Page 79: Large scale genomic data mining

79

COALESCE Results:Yeast Clustering Accuracy

• ~2,200 yeast conditions– Recapitulation of known biology from Gene Ontology

Page 80: Large scale genomic data mining

80

COALESCE Results:Yeast Clustering Accuracy

• ~2,200 yeast conditions– Recapitulation of known biology from Gene Ontology

ASCL1 in 5’ flank, unch. sequences underenriched in 3’ UTR

M. musculus: Up in callosal and motor neurons

C. elegans: Up in larvae, down in adults

GATA in 5’ flank, miR-788 seed in 3’ UTR

AAGGGGC (zf?) and enriched in 5’ flank

H. sapiens: Up in normal muscle, down in diabetic

Page 81: Large scale genomic data mining

81

COALESCE: Coregulated Quiescence Modules

Down during quiescence entry, up during quiescence exit,down with adenoviral infection

Specific predicted uncharacterized reverse complement motif

Up during quiescence entry, down during quiescence exit

Many known related (proliferation) motifs:Pax4, Staf, NFKB1, Gfi, ESR1, Runx1, Su(H)

Down during quiescence entry,enriched for transport/trafficking

miR-297 motif predicted in 3’ UTR (CACATAC)

Down with let-7 exposure

let-7 motifs predicted in 3’ UTR (UACCUC)

Page 82: Large scale genomic data mining

82

Summary

• COALESCE algorithm for regulatory module prediction

– Biclustering + putative de novo motifs

– Optimized for complex organisms (fast!)• Large genomes, large data collections

– High accuracy, low false positives

– Leverage prior knowledge, multiple data types