:: microarray analysis ::

34
:: Microarray analysis :: •Data pre- processing •Normalization •Molecular diagnosis •Statistical classification Florian Markowetz [email protected]

Upload: briana

Post on 14-Jan-2016

122 views

Category:

Documents


3 download

DESCRIPTION

:: Microarray analysis ::. Data pre-processing Normalization Molecular diagnosis Statistical classification. Florian Markowetz [email protected]. From experiment to data. Raw data are not mRNA concentrations. tissue contamination RNA degradation amplification efficiency - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: :: Microarray analysis ::

:: Microarray analysis ::•Data pre-processing•Normalization•Molecular diagnosis •Statistical classification

Florian [email protected]

Page 2: :: Microarray analysis ::

From experiment to data

Page 3: :: Microarray analysis ::

Raw data are not mRNA concentrations

• tissue contamination• RNA degradation• amplification efficiency• reverse transcription

efficiency• Hybridization efficiency

and specificity• clone identification and

mapping• PCR yield, contamination

• spotting efficiency

• DNA support binding

• other array manufacturing related issues

• image segmentation

• signal quantification

• “background” correction

Page 4: :: Microarray analysis ::

Quality control: Noise and reliable signal

Arrays 1 ... n

Array level Gene levelProbe level

Probe level: quality of the expression measurement of one spot on one particular array

Array level: quality of the expression measurement on one particular glass slide

Gene level: quality of the expression measurement of one probe across all arrays

Page 5: :: Microarray analysis ::

Probe-level quality control

• Individual spots printed on the slide• Sources:

– faulty printing, uneven distribution, contamination with debris, magnitude of signal relative to noise, poorly measured spots;

• Visual inspection:– hairs, dust, scratches, air bubbles, dark regions, regions with

haze• Spot quality:

– Brightness: foreground/background ratio– Uniformity: variation in pixel intensities and ratios of intensities

within a spot– Morphology: area, perimeter, circularity.– Spot Size: number of foreground pixels

• Action:– set measurements to NA (missing values)– local normalization procedures which account for regional

idiosyncrasies.– use weights for measurements to indicate reliability in later

analysis.

Page 6: :: Microarray analysis ::

Spot identification

Individual spots are recognized, size and shape might be adjusted per spot (automatically fine adjustments by hand).

Additional manual flagging of bad (X) or non-present (NA) spots

poor spot quality

good spot quality

Different Spot identification methods: Fixed circles, circles with variable size, arbitrary spot shape (morphological opening)

NA

X

Page 7: :: Microarray analysis ::

Spot identification

Histogram of pixel intensities of a single spot

• The signal of the spots is quantified.

„Donuts“

Mean / Median / Mode / 75% quantile

Page 8: :: Microarray analysis ::

Local background

GenePix

QuantArray

ScanAlyse

Page 9: :: Microarray analysis ::

Array level quality control

• Problems:– array fabrication defect– problem with RNA extraction– failed labeling reaction– poor hybridization conditions– faulty scanner

• Quality measures:– Percentage of spots with no signal (~30% excluded spots) – Range of intensities– (Av. Foreground)/(Av. Background) > 3 in both channels– Distribution of spot signal area– Amount of adjustment needed: signals have to substantially

changed to make slides comparable.

Page 10: :: Microarray analysis ::

Gene-level quality control

Gene g

• Poor hybridization in the reference channel may introduce bias on the fold-change

• Some probes will not hybridize well to the target RNA

• Printing problems: such that all spots of a given inventory well have poor quality.

•A well may be of bad quality – contamination

•Genes with a consistently low signal in the reference channel

are suspicious

Page 11: :: Microarray analysis ::

Gene

mRNA Samples

gene-expression level or ratio for gene i in mRNA sample j

M =Log2(red intensity / green intensity)

Function (PM, MM) of MAS, dchip or RMA

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

A =average: log2(red intensity), log2(green intensity)

Function (PM, MM) of MAS, dchip or RMA

Gene expression data

Page 12: :: Microarray analysis ::

Data Data (log scale)

Scatterplot

Message: look at your data on log-scale!

Page 13: :: Microarray analysis ::

MA Plot

A = 1/2 log2(RG)

M =

log 2

(R/G

)

Page 14: :: Microarray analysis ::

Median centering

Log S

ignal, c

ente

red

at

0

One of the simplest strategies is to bring all „centers“ of the array data to the same level.

Assumption: the majority of genes are un-changed between conditions.

Median is more robust to outliers than the mean.

Divide all expression measurements of each array by the Median.

Page 15: :: Microarray analysis ::

Problem of median-centering

Log Green

Log

Red

Scatterplot of log-Signals after Median-centering

A = (Log Green + Log Red) / 2

M =

Log

Red

- Lo

g G

reen

M-A Plot of the same data

Median-Centering is a global Method. It does not adjust for local effects, intensity dependent effects, print-tip effects, etc.

Page 16: :: Microarray analysis ::

Lowess normalization

A = (Log Green + Log Red) / 2

M =

Log

Red

- Lo

g G

reen

Local estimate Use the estimate to bend

the banana straight

Page 17: :: Microarray analysis ::

Summary I

• Raw data are not mRNA concentrations• We need to check data quality on

different levels– Probe level– Array level (all probes on one array)– Gene level (one gene on many arrays)

• Always log your data• Normalize your data to avoid systematic

(non-biological) effects• Lowess normalization straightens

banana

Page 18: :: Microarray analysis ::

From data to knowledge

Gene

mRNA Samples

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Ok, now we made sure that our data is of high quality and systematic, non-biological effects are removed.

The result is a gene expression matrix

Is that already a result? No! It’s just data, not knowledge.We need to use this data to answer a scientific question.

Page 19: :: Microarray analysis ::

Supervised analysis

= learning from examples, classification– We have already seen groups of healthy and

sick people. Now let’s diagnose the next person walking into the hospital.

– We know that these genes have function X (and these others don’t). Let’s find more genes with function X.

– We know many gene-pairs that are functionally related (and many more that are not). Let’s extend the number of known related gene pairs.

Known structure in the data needs to be generalized to new data.

Page 20: :: Microarray analysis ::

Un-supervised analysis

= clustering– Are there groups of genes that behave

similarly in all conditions?– Disease X is very heterogeneous. Can we

identify more specific sub-classes for more targeted treatment?

No structure is known. We first need to find it. Exploratory analysis.

Page 21: :: Microarray analysis ::

Supervised analysis

Calvin, I still don’t know the difference between cats and dogs …Oh, now I get it!!

Don’t worry!I’ll show you once more:

Class 1: cats Class 2: dogs

Page 22: :: Microarray analysis ::

Un-supervised analysis

Calvin, I still don’t know the difference between cats and dogs …

I don’t know it either.

Let’s try to figure it out together …

Page 23: :: Microarray analysis ::

Supervised analysis: setup

• Training set– Data: microarrays– Labels: for each one we know if it falls into our

class of interest or not (binary classification)

• New data (test data)– Data for which we don’t have labels. – Eg. Genes without known function

• Goal: Generalization ability– Build a classifier from the training data that is

good at predicting the right class for the new data.

Page 24: :: Microarray analysis ::

One microarray, one dotExp

ress

ion

of g

en

e 2

Expression of gene 1

Think of a space with #genes dimensions (yes, it’s hard for more than 3).

Each microarray corresponds to a point in this space.

If gene expression is similar under some conditions, the points will be close to each other.

If gene expression overall is very different, the points will be far away.

Page 25: :: Microarray analysis ::

Which line separates best?

A B

C D

Page 26: :: Microarray analysis ::

No sharp knive, but a …

FAT

PLANE

Page 27: :: Microarray analysis ::

Support Vector Machines

Maximal margin separating hyperplane

Datapoints closest to separating hyperplane= support vectors

Page 28: :: Microarray analysis ::

How well did we do?

The classifier will usually perform worse than before:

Test error > training error

Same classifier (= line)

New data from same classes

Training error: how well do we do on the data we trained the classifier on?

But how well will we do in the future, on new data?

Test error: How well does the classifier generalize?

Page 29: :: Microarray analysis ::

Cross-validation

Train classifier and test itTraining error

Train TestTest error

K-fold Cross-validation

Train TestTrainStep 1.

Test TrainTrainStep 2.

Train TrainTestStep 3.

Here for K=3

Page 30: :: Microarray analysis ::

Summary II

• Supervised and un-supervised learning… are needed everywhere in biology and

medicine• Microarrays = points in high-dimensional spaces• Classifiers = lines (hyperplanes) in these spaces• Support Vector Machines use maximal margin

hyperplanes as classifiers• Classifier performance: Test error > training

error• Cross-validation is the right way to evaluate

classifier performance

Page 31: :: Microarray analysis ::

Biological verification and interpretation

Microarray experiment

Experimental design

Image analysis

Normalization

Biological question (hypothesis-driven or explorative)

TestingEstimation DiscriminationAnalysis

Clustering

Experimental Cycle

Quality Measurement

Failed

Pass

Pre-processing

To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination:

He may be able to say what the experiment died of.

Ronald Fisher

Page 32: :: Microarray analysis ::

Books

Gentleman, Carey, Huber, “Bioinformatics and Computational Biology Solutions Using R and Bioconductor”, Springer

David W. Mount, „Bioinformatics“, Cold Spring Harbor

Terry Speed, „Statistical Analysis of Gene Expression Microarray Data”. Chapman & Hall/CRC

Pierre Baldi & G. Wesley Hatfield, „DNA Microarrays and Gene Expression”, Cambridge

Giovanni Parmigani et al, „The Analysis of Gene Expression Data“, Springer

Page 33: :: Microarray analysis ::

And how do I analyze my own data?

www.r-project.orgwww.bioconductor.org•Open source•Free•Easy installation•Helpful community•High quality standards•Regularly maintained and updated•Tons of documentation•Every package comes with example vignettes to walk you through standard tasks.

Page 34: :: Microarray analysis ::

Acknowlegdements

• I ‘borrowed’ slides from: Tim Beissbarth, Achim Tresch, Wolfgang

Huber, Ulrich Mansmann, Terry Speed, Jean Yang, Benedikt Brors, Anja von Heydebreck, Rainer König

• More info on microarray analysis, lectures, tutorials:

http://compdiag.molgen.mpg.de/ngfn/