1 sylvia richardson centre for biostatistics imperial college, london bayesian hierarchical...

69
1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s) Helen Causton and Tim Aitman (Hammersmith) Peter Green (Bristol) Philippe Broët (INSERM, Paris) BBSRC Exploiting Genomics grant

Upload: arianna-malloy

Post on 28-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

1

Sylvia RichardsonCentre for Biostatistics

Imperial College, London

Bayesian hierarchical modelling of genomic data

In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s)

Helen Causton and Tim Aitman (Hammersmith)Peter Green (Bristol)

Philippe Broët (INSERM, Paris)

BBSRC Exploiting Genomics grant

Page 2: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

2

Outline

• Introduction

• A fully Bayesian gene expression index (BGX)

• Differential expression and array effects

• Mixture models

• Discussion

Page 3: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

3

Part 1 Introduction

• Recent developments in genomics have led to techniques – Capable of interrogating the genome at

different levels– Aiming to capture one or several stages of

the biological process

DNA mRNA protein phenotype

Page 4: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

4

DNA -> mRNA -> protein

Pictures from http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookTOC.html

Protein-encoding genes are transcribed into mRNA (messenger), and the mRNA is translated to make proteins

Fundamental process

Page 5: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

5

DNA Microarrays are used to measure the relative abundance

of mRNA, providing information on gene expression in a particular

cell, under particular conditions

The fundamental principle used to measure the expression is that of hybridisation between a sample and probes:

– Known sequences of single-stranded DNA representing genes are immobilised on microarray–Tissue sample (with unknown concentration of RNA) fluorescently labelled– Sample hybridised to array– Array scanned to measure amount of RNA present for each sequence

The expression level of ten of thousands of probes are measured on a single microarray !

gene expression profile

What are gene expression data ?

gene expression measure

Page 6: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

6

Variation and uncertainty

• condition/treatment• biological• array manufacture• imaging• technical

• gene specific variability of the probes for a gene

• within/between array variation

Gene expression data (e.g. Affymetrix) is the result of multiple sources of variability

Structured statistical modelling allows considering all uncertainty at once

Page 7: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

7

Example of within vs between strains gene variability

• 7 cross-bred strains of mice that differ only by a small portion of chromosome 1

• Strains have different phenotypes related to immunological disorders

• For each line, 9 animals used to obtain 3 pooled RNA extracts from spleen 7 x 3 samples

Excellent experimental design to minimise “biological variability between replicate animals”

Aim: to tease out differences between expression profiles of the 7 lines of mice and relate these to locations on chromosome 1

Page 8: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

8

Biological variability is large !

Total variance calculatedover the 21 samples

Average (over the 7 groups) of within strain variance calculated from the 3 pooled samples

Ratio within/total

Page 9: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

9

1000 genes most variablebetween strains: hierarchicalclustering recovers thecross-bred lines structure

Random set of 1000 genes

Page 10: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

10

Common characteristics of genomics data sets• High dimensional data (ten of thousands of

genes) and few samples• Many sources of variability (low signal/noise

ratio)

Common issues

• Pre-processing and data reduction• Multiple testing• Need to borrow information• Importance to include prior biological knowledge

Page 11: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

11

Part 2

• Introduction• A fully Bayesian gene expression index (BGX)

– Single array model– Multiple array model

• Differential expression and array effects• Mixture models• Discussion

Page 12: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

12

A fully Bayesian Gene eXpression index for Affymetrix GeneChip arrays

Anne Mette HeinSR, Helen Causton,

Graeme Ambler, Peter Green

Background correctionGene specific variability

(probe)

PMMM

PMMM

PMMM

PMMM

Gene index BGX

Raw intensities

Page 13: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

13

**

**

*

Slide courtesy of Affymetrix

Zoom Image of Hybridised Array

Expressed PM

Non-expressed PM

Image of Hybridised Array

Hybridised Spot

Each gene g represented by probe set: (J:11-20)

Perfect match: PMg1,…, PMgJ

Mis-match: MMg1,…, MMgJ

expression measure for gene g

Affymetrix GeneChips:

Page 14: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

14

Commonly used methods for estimation expression levels from GeneChips

MAS5:

• uses PM and MMs. Imputes IMs from MMs to obtain all PM-MMs positive

• gene expression measure : estimate obtained by applying Tukey Biweight to the set of log(PM-MM) values in the probe set

RMA:

• uses PMs only.

• Fits an model with additive gene and probe effects to log-scale background corrected PMs using median polish

Characteristics: positive, robust, noisy at low levels

Characteristics: positive, robust, attenuated signal detection

Page 15: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

15

Variability across conditions is conditioned by the choice of summary measure ! Beware of filtering

Mean (left) and Empirical standard deviation (right) over 7 conditions (arrays) for 45000 genes estimated by 2 different methods for quantifying gene expression

mean

Page 16: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

16

• The intensity for the PM measurement for probe (reporter) j and gene g is due to binding

of labelled fragments that perfectly match the oligos in the spot

The true Signal Sgj

of labelled fragments that do not

perfectly match these oligos

The non-specific hybridisation Hgj

• The intensity of the corresponding MM measurement is caused

by a binding fraction Φ of the true signal Sgj

by non-specific hybridisation Hgj

Model assumptions and key biological parameters

Page 17: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

17

BGX single array model:g=1,…,G (thousands), j=1,…,J (11-20)

Gene specific error terms:exchangeable

log(ξ g2)N(a,

b2)

log(Sgj+1) TN (μg , ξg2)

j=1,…,J

Gene expression index (BGX):

g=median(TN (μg , ξ g2))

“Pools” information over probes j=1,…,J

log(Hgj+1) TN(λ, η2)

Array-wide distribution

PMgj N( Sgj + Hgj , τ2)

MMgj N(Φ Sgj + Hgj ,τ2) Background noise, additive

signal Non-specific hybridisation

fraction

Priors: “vague” 2 ~ (10-3, 10-3) ~ B(1,1),

g ~ U(0,15) 2 ~ (10-3, 10-3), ~ N(0,103) “Empirical Bayes”

Page 18: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

18

Inference:

mean

2.5-97.5% credibility interval

• Implemented in WinBugs and C• allows: - Joint estimation of parameters in full

Bayesian framework• obtain: - posterior distributions of parameters

(and functions of these) in model:

1 2 3 2 3 4 1.75 2 2.25

For each gene g:

log(Sgj+1): j=1,…,J

log(Hgj+1): j=1,…,J

g:

Log-scale true signals: Log-scale non-spec. hybr: BGX: gene expr:

NB! A distribution

Page 19: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

19

Computational issues

• We found mixing slow for gene specific parameters (μg , ξg2)

and large autocorrelation • For low signal (bottom 25%) more variability of Sgj and Hgj ,

and less separation

So less information on (μg , ξg2) and longer runs are

needed• For the full hierarchical model, the convergence of the

hyperparameters for the distribution of ξg2 was problematic

• We studied sensitivity to a range of plausible values for those and implemented an “empirical Bayes” version of the model which was reproducible with sensible run length

Page 20: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

20

Posterior mean of g using a runof 30 000 versus those obtainedfrom runs of 5 000, 10 000 and20 000 sweeps

Reproducibilityis obtained withshort runs forlarge expressionvalues

Longer runsare necessary for low expression values

Page 21: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

21

• 14 samples of cRNA from acute myeloid leukemia (AML) tumor cell line

• In sample k: each of 11 genes spiked in at concentration ck:

sample k: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 conc. ck(pM): 0.0 0.5 0.75 1.0 1.5 2.0 3.0 5.0 12.5 25 50 75 100 150

• Each sample hybridised to an array

Single array model performance:Data set : varying concentrations (geneLogic):

Consider subset consisting of 500 normal genes

+ 11 spike-ins

Page 22: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

22

Single array model:examples of posterior distributions of BGX

expression indices

Each curve(truncated normal

with median param.) represents a gene

Examples with data:

o: log(PMgj-MMgj)

j=1,…,Jg

(at 0 if not defined)

Mean +- 1SD

Page 23: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

23

Single array model performance:

11 genes spiked in at 13 (increasing)

concentrations

BGX index g increases with

concentration …..

… except for gene 7 (spiked-in??)

Indication of smooth

& sustained increase

over a wider range of

concentrations

Comparison with other expression measures

Page 24: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

24

2.5 – 97.5 % credibility intervals for the Bayesian expression index

11 spike-in genes at 13 different concentration (data set A)

Note how the variabilityis substantially larger for low expression level

Each colour corresponds to a different spike-in geneGene 7 : broken red line

Page 25: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

25

PMMM

PMMM

PMMM

Gene specific variability (probe)Gene index BGX

Condition 1

PMMM

PMMM

PMMM

PMMM

Gene specific variability (probe)Gene index BGX

Distribution of differential expression parameter

Condition 2

Integrated modelling of Affymetrix data

PMMM

Distribution of expression index for gene g , condition 1

Distribution of expression index for gene g , condition 2

Hierarchical model of replicate(biological) variability and array effect

Hierarchical model of replicate(biological) variability and array effect

Page 26: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

26

PMgjcr N( Sgjcr+ Hgjcr , τcr2)

MMgjcr N(ΦSgjcr+ Hgjcr , τcr2)

BGX Multiple array model: conditions: c=1,…,C, replicates: r = 1,…,Rc

log(Sgjcr+1) TN (μgc , ξ gc2)

Gene and condition specific BGX

gc=median(TN(μgc, ξ gc

2)) “Pools” information over replicate probe sets j = 1,…J, r = 1,…,Rc

Background noise, additiveArray specific

log(Hgjcr+1) TN(λcr,ηcr2)

Array-specific distribution of non-specific hybridisation

Page 27: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

27

Subset of AffyU133A spike-in data set(AffyComp)

Consider:

• Six arrays, 1154 genes (every 20th and 42 spike-ins)

• Same cRNA hybridised to all arrays EXCEPT for spike-ins:

`1` `2` `3` … `12` `13` `14`

Spike-in genes: 1-3 4-6 7-9 … 34-36 37-39 40-42

Spike-in conc (pM):

Condition 1 (array 1-3): 0.0 0.25 0.50 … 128 256 512

Condition 2 (array 4-6): 0.25 0.50 1.00 … 256 512 0.00

Fold change: - 2 2 … 2 2 -

Page 28: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

28

BGX: measure of uncertainty providedPosterior mean +- 1SD credibility intervals

diffg=bgxg,1- bgxg,2

}

Spike in 1113 -1154above the blue line

Blue stars show RMA measure

Page 29: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

29

Part 3

• Introduction• A fully Bayesian gene expression index (BGX)• Differential expression and array effects

– Non linear array effects

– Model checking

• Mixture models• Discussion

Page 30: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

30

Differential expression and array effects

Alex Lewin SR, Natalia Bochkina, Tim Aitman

Page 31: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

31

Data Set and Biological question

Biological Question

Understand the mechanisms of insulin resistanceUsing animal models where key genes are knockout and

comparison made between gene expression of wildtype (normal) and knockout mice

Data set A (MAS 5) ( 12000 genes on each array)

3 wildtype mice compared with 3 mice with Cd36 knocked out

Data set B (RMA) ( 22700 genes on each array)

8 wildtype mice compared with 8 knocked out mice

Page 32: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

32

Differential expression parameter

Condition 1 Condition 2

Posterior distribution

(flat prior)

Mixture modelling for classification

Hierarchical model of replicateVariability and array effect

Hierarchical model of replicateVariability and array effect

Start with given pointestimates of expression

Page 33: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

33

Condition 1 (3 replicates)

Condition 2 (3 replicates)

Needs ‘normalisation’

Spline curves shown

Exploratory analysis of array effect

Mouse dataset A

Page 34: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

34

Model for Differential Expression

• Expression-level-dependent normalisation

• Only few replicates per gene, so share information between genes to estimate variability of gene expression between the replicates

• To select interesting genes:– Use posterior distribution of quantities of interest,

function of, ranks ….– Use mixture prior on the differential expression

parameter

Page 35: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

35

Data: ygsr = log gene expression for gene g, replicate r

g = gene effect

δg = differential effect for gene g between 2 conditions

r(g)s = array effect (expression-level dependent)

gs2 = gene variance

• 1st level yg1r N(g – ½ δg + r(g)1 , g1

2),

yg2r N(g + ½ δg + r(g)2 , g22),

Σrr(g)s = 0, r(g)s = function of g , parameters {a} and {b}

• 2nd level

Priors for g , δg, coefficients {a} and {b}

gs2 lognormal (μs, τs)

Bayesian hierarchical model for differential expression

Page 36: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

36

• Piecewise polynomial with unknown break points:r(g)s = quadratic in g for ars(k-1) ≤ g ≤ ars(k)

with coeff (brsk(1), brsk

(2) ), k =1, … # breakpoints

– Locations of break points not fixed– Must do sensitivity checks on # break points

• Joint estimation of array effects and differential expression: In comparison to 2 step method

– More accurate estimates of array effects– Lower percentage of false positive (simulation study)

Details of array effects (Normalization)

Page 37: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

37

Mouse Data set A

3 replicate arrays (wildtype mouse data)

Model: posterior meansE(r(g)s | data) v. E(g | data)

Data: ygsr - E(g | data)

For this data set, cubic fits well

Page 38: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

38

• Check assumptions on gene variances, e.g. exchangeable variances, what distribution ?

• Predict sample variance Sg2 new (a chosen checking function)

from the model specification (not using the data for this)

• Compare predicted Sg2 new with observed Sg

2 obs

‘Bayesian p-value’: Prob( Sg2 new > Sg

2 obs )

• Distribution of p-values approx Uniform if model is ‘true’

(Marshall and Spiegelhalter, 2003)• Easily implemented in MCMC algorithm

Bayesian Model Checking

Page 39: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

39

Data set A

Page 40: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

40

Possible Statistics for Differential Expression

δg ≈ log fold change

δg* = δg / (σ2 g1 / R1 + σ2 g2 / R2 )½ (standardised difference)

• We obtain the posterior distribution of all {δg} and/or

{δg* }

• Can compute directly posterior probability of genes satisfying criterion X of interest:

pg,X = Prob( g of “interest” | Criterion X, data)

• Can compute the distributions of ranks

Page 41: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

41

Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4

Criterion X

The majority of the genes

have very small pg,X :

90% of genes

have pg,X < 0.2

Genes withpg,X > 0.5 (green)

# 280pg,X > 0.8 (red)

# 46

pg,X = 0.49

Plot of log fold change versus overall expression level

Data set A 3 wildtype mice compared to 3 knockout mice (U74A chip) Mas5

Genes with low overall expression have a greater range of fold change than those with higher expression

Page 42: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

42

Gene is of interest if |log fold change| > log (1.5)Criterion X:

The majority of the genes

have very small pg,X :

97% of genes

have pg,X < 0.2

Genes withpg,X > 0.5 (green)

# 292pg,X > 0.8 (red)

# 139

Plot of log fold change versus overall expression level

Experiment: 8 wildtype mice compared to 8 knockout mice RMA

Page 43: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

43

Posterior probabilities and log fold change

Data set A : 3 replicates MAS5 Data set B : 8 replicates RMA

Page 44: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

44

Credibility intervals for ranks

100 genes with lowest rank (most under/over expressed)

Low rank, high uncertainty

Low rank, low uncertainty

Data set B

Page 45: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

45

• Compute

Probability ( | δg* | > 2 | data)

Bayesian analogue of a t test !• Order genes

• Select genes such that

Using the posterior distribution of δg*

(standardised difference)

Probability ( | δg* | > 2 | data) > cut-off ( in blue)

By comparison, additional genes selected by a standard

T test with p value < 5% are in red)

Page 46: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

46

Part 4

• Introduction• A fully Bayesian gene expression index • Differential expression and array effects• Mixture models

– Classification for differential expression– Bayesian estimate of False Discovery Rates– CGH arrays: models including information on clones spatial location on

chromosome

• Discussion

Page 47: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

47

Mixture and Bayesian estimation of false discovery rates

Natalia Bochkina, Philippe Broët Alex Lewin, SR

Page 48: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

48

• Gene lists can be built by computing separately a criteria for each gene and ranking

• Thousands of genes are considered simultaneously• How to assess the performance of such lists ?

Multiple Testing Problem

Statistical ChallengeSelect interesting genes without including too many false

positives in a gene list

A gene is a false positive if it is included in the list when it is truly unmodified under the experimental set up

Want an evaluation of the expected false discovery rate (FDR)

Page 49: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

49

Bayesian Estimate of FDR

• Step 1: Choose a gene specific parameter (e.g. δg ) or a gene statistic

• Step 2: Model its prior (resp marginal) distribution using a mixture model

-- with one component to model the unaffected genes (null hypothesis) e.g. point mass at 0 for δg

-- other components to model (flexibly) the alternative

• Step 3: Calculate the posterior probability for any gene of belonging to the unmodified component : pg0 | data

• Step 4: Evaluate FDR (and FNR) for any listassuming that all the gene classification are independent(Broët et al 2004) :

Bayes FDR (list) | data = 1/card(list) Σg list pg0

Page 50: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

50

Mixture framework for differential expression

• To obtain a gene list, a commonly used method

(cf Lönnstedt & Speed 2002, Newton 2003, Smyth 2003,

…) is to define a mixture prior for δg :

• H0 δg = 0 point mass at 0 with probability p0

• H1 δg ~ flexible 2-sided distribution to model pattern of differential expression

Classify each gene following its posterior probabilities of not being in the null: 1- pg0

Use Bayes rule or fix the FDR to get a cutoff

Page 51: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

51

Mixture prior for differential expression

• In full Bayesian framework, introduce latent allocation variable zg to help computations

• Joint estimation of all the mixture parameters (including p0) avoids plugging-in of values (e.g. p0) that are influential on the classification

• Sensitivity to prior settings of the alternative distribution

• Performance has been tested on simulated data sets

Poster by Natalia Bochkina

Page 52: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

52

Performance of the mixture prior

yg1r = g - ½ δg + g1r , r = 1, … R1

yg2r = g + ½ δg + g2r , r = 1, … R2

(For simplification, we assume that the data has been pre normalised)

Var(gsr ) = σ2gs ~ IG(as, bs)

δg ~ p0δ0 + p1G (1.5, 1) + p2G (1.5, 2)

H0 H1

Dirichlet distribution for (p0, p1, p2)

Exponential hyper prior for 1 and 2

Page 53: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

53

Estimation

• Estimation of all parameters combines information from biological replicates and between condition contrasts

• s2gs = 1/Rs Σr (ygsr - ygs. )2 , s = 1,2

Within condition biological variability

• 1/Rs Σr ygsr = ygs. ,

Average expression over replicates

• ½(yg1.+ yg2.) Average expression over conditions

• ½(yg1.- yg2.) Between conditions contrast

Page 54: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

54g = 1:G

DAG for the mixture model

a1, b1

½(yg1.+ yg2.)

1 , 2

δg 2g1 s2

g1

2g2 s2

g2g

zg

a2, b2

p

½(yg1.- yg2.)

Page 55: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

55

Simulated data

ygr ~ N(δg , σ2g) (8 replicates)

σ2gs ~ IG(1.5, 0.05)

δg ~ (-1)Bern(0.5) G(2,2), g=1:200

δg = 0, g=201:1000

Choice of simulation parametersinspired by estimates found in analyses of biological data sets

Plot of the true differences

Page 56: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

56Post Prob (g H1) = 1- pg0

Bayesrule

FDR (black)FNR (blue)as a function of1- pg0

Observedand estimatedFDR/FNRcorrespond well

Important feature

Page 57: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

57

Comparison of mixture classification and

posterior probabilities for δg* (standardised differences)

In red, 200

genes with

δg ≠ 0

Probability ( | δg* | > 2 | data)

31 = 4%False negative

10 = 6%False positive

Post Prob (g H1)

Page 58: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

58

Wrongly classified by mixture:

truly dif. expressed,

truly not dif. expressed

Classification errorsare on the borderline:

Confusion betweensize of fold change and biological variability

Page 59: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

59

Another simulation

Can we improve estimationof within conditionbiological variability ?

2628 data points

Many points addedon borderline:classificationerrors in red

Page 60: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

60g = 1:G

DAG for the mixture model

a1, b1

½(yg1.+ yg2.)

1 , 2

δg 2g1 s2

g1

2g2 s2

g2g

zg

a2, b2

p

½(yg1.- yg2.)

The varianceestimates areinfluenced bythe mixtureparameters

Use only partialinformation fromthe replicatesto estimate2

gs and feed

forwardin the mixture ?

Page 61: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

61

Mixture, full vs partial

In 46 data pointswith improvedclassification when‘feed back frommixture is cut’

In11 data pointswith changedbut new incorrect classification

Classificationaltered for 57 points:

Work in progress

Page 62: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

62

Mixture models in CGH arrays experiments

• Philippe Broët, SR

• Curie Institute oncology department

CGH = Competitive Genomic Hybridization

between fluorescein- labelled normal and pathologic

samples to an array containing clones designed

to cover certain areas of the genome

Page 63: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

63

In oncology, where carcinogenesis is associated with complex chromosomic alterations, CGH array can be used for detailed analysis of genomic changes in copy number (gains or loss of genetic information) in the tumor sample.

Amplification of an oncogene or deletion of a tumor suppressor gene are considered as important mechanisms for tumorigenesis

Loss Gain

Tumor supressor gene Oncogene

Aim: study genomic alterations

Page 64: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

64

Specificity of CGH array experiment

A priori biological knowledge from conventional CGH :• Limited number of states for a sequence :

- presence, - deletion, - gain(s)

corresponding to different intensity ratios on the slide

Mixture model to capture the underlying discrete states

• Clones located contiguously on chromosomes are likely to carry alterations of the same type

Use clone spatial location in the allocation model

• Some CGH custom array experiments target

restricted areas of the genome Large proportion of genomic alterations are expected

Page 65: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

65

3 component mixture model with spatial allocation

ygr N(θg , g2) , normal versus tumoral change, clone g

replicate measure r

θg wg0N(μ0 ,02) + wg1N(μ1 ,1

2) + wg2N(μ2 ,22)

μ0 : known central estimate obtained from reference clonesIntroduce centred spatial autoregressive Markov random fields, {ug

0}, {ug1}, {ug

2} with nearest neighbours along the chromosomes

presence

deletion gain

x x xg -1 g g+1

Spatial neighbours of g

Define mixture proportions to depend on the chromosomic location

via a logistic model: wgk = exp(ugk) / Σm exp(ug

m)

favours allocation of nearby clones to same componentWork in progress

Page 66: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

66

Deletion ?

Presence ?Ref value

μ0 = - 0.11μ0

Curie Institute CGH platform

Focus on Investigating deletion areas on chromosome 1 (tumour suppressor locus)

Data on 190 clones

Page 67: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

67

Mixture model posteriorprobability p of clone being deleted

Classification withcut-off at p ≥ 0.8

Short arm

Page 68: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

68

Bayesian gene expression measure (BGX)

Good range of resolution, provides credibility intervals

Differential Expression

Expression-level-dependent normalisationBorrow information across genes for variance estimationGene lists based on posterior probabilities or mixture classification

False Discovery Rate

Mixture gives good estimate of FDR and classifiesFlexibility to incorporate a priori biological features, e.g. dependence on chromosomic location

Future work Mixture prior on BGX index, with uncertainty propagated to mixture parameters, comparison of marginal and prior mixture approaches, clustering of profiles for more general experimental set-ups

Summary

Page 69: 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

69

Papers and technical reports:

Hein AM., Richardson S., Causton H., Ambler G. and Green P. (2004)BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data

(to appear in Biostatistics)

Lewin A., Richardson S., Marshall C., Glazier A. and Aitman T. (2003) Bayesian Modelling of Differential Gene Expression

(under revision for Biometrics)

Broët P., Lewin A., Richardson S., Dalmasso C. and Magdelenat H. (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics 22, 2562-2571.

Broët, P., Richardson, S. and Radvanyi, F. (2002) Bayesian Hierarchical Model for Identifying Changes in Gene Expression from Microarray Experiments. Journal of Computational Biology 9, 671-683.

Available athttp ://www.bgx.org.uk/

Thanks