searching for differentially expressed genes

47
INSTITUTE OF BIOSTATISTICS AND ANALYSES UK SCIENCE & INNOVATION NETWORK BRITISH EMBASSY Searching for Differentially Expressed Genes Eva Budinská Eva Budinská Bioinformatics Conference on Genomics and Bioinformatics Conference on Genomics and Proteomics Data Analysis Proteomics Data Analysis 25-27.11.2009 25-27.11.2009 Brno, Czech Republic Brno, Czech Republic

Upload: erv

Post on 09-Jan-2016

37 views

Category:

Documents


3 download

DESCRIPTION

Searching for Differentially Expressed Genes. Eva Budinská Bioinformatics Conference on Genomics and Proteomics Data Analysis 25-27.11.2009 Brno, Czech Republic. Biological/medical researcher. We need you to find the differentially expressed genes in our dataset. Euh???. External HDD. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Searching for Differentially Expressed Genes

INSTITUTE OF BIOSTATISTICS AND ANALYSES

UK SCIENCE & INNOVATION NETWORKBRITISH EMBASSY

Searching for Differentially Expressed Genes

Searching for Differentially Expressed Genes

Eva BudinskáEva BudinskáBioinformatics Conference on Genomics and Bioinformatics Conference on Genomics and

Proteomics Data AnalysisProteomics Data Analysis

25-27.11.200925-27.11.2009Brno, Czech RepublicBrno, Czech Republic

Page 2: Searching for Differentially Expressed Genes

We need you to find the differentially expressed genes in our dataset...

Euh???

Biological/medical researcher

Data analyst

External HDD

Page 3: Searching for Differentially Expressed Genes

Searching for Searching for differentially expressed differentially expressed genesgenes

1. WHAT is it a differentially expressed gene?

Page 4: Searching for Differentially Expressed Genes

Searching for Searching for differentially expressed differentially expressed genesgenes

1. WHAT is it a differentially expressed gene?

2. WHY should we search it?

Page 5: Searching for Differentially Expressed Genes

Searching for Searching for differentially expressed differentially expressed genesgenes

1. WHAT is it a differentially expressed gene?

2. WHY should we search it?

3. WHERE can we search for it?

Page 6: Searching for Differentially Expressed Genes

Searching for Searching for differentially expressed differentially expressed genesgenes

1. WHAT is it a differentially expressed gene?

2. WHY should we search it?

3. WHERE can we search for it?

4. HOW to find it?

Page 7: Searching for Differentially Expressed Genes

Searching for Searching for differentially expressed differentially expressed genesgenes

1. WHAT is it a differentially expressed gene?

2. WHY should we search it?

3. WHERE can we search for it?

4. HOW to find it?

5. (WHO should do that?, WHY me!...)

Page 8: Searching for Differentially Expressed Genes

Searching for Searching for differentially expressed differentially expressed genesgenes

1. WHAT is it a differentially expressed gene?

2. WHY should we search it?

3. WHERE can we search for it?

4. HOW to find it?

5. (WHO should do that?, WHY me!...)

Page 9: Searching for Differentially Expressed Genes

WHATWHAT is a is an n expressed geneexpressed gene??

gene expression

a gene is expressed when it is being transcribed into mRNA

if a gene is expressed, we say it is active we can measure the gene expression by the amount

of mRNA

DNADNA

mRNAmRNA

ProteinProtein

transcription ~ expression

translation

Page 10: Searching for Differentially Expressed Genes

WHATWHAT is a is a differentially expressed genedifferentially expressed gene??

differentially expressed genecompare two samples

in one sample is expressed MORE than in the other

DNADNA

mRNAmRNA

DNADNA

mRNAmRNA

Healthy colon tissue Colon cancer tissue

Page 11: Searching for Differentially Expressed Genes

Searching for Searching for differentially expressed differentially expressed genesgenes

1. WHAT is it a differentially expressed gene?

2. WHY should we search it?

3. WHERE can we search for it?

4. HOW to find it?

5. (WHO should do that?, WHY me!...)

Page 12: Searching for Differentially Expressed Genes

WHYWHY should we search for it? should we search for it?

In MEDICINE

To understand the mechanism of diseases: DISEASE / HEALTHY TISSUEwhy some patients do respond to the therapy and some do

not: RESPONDERS / NON-RESPONDERS

new therapeutical targets, optimized therapy and prevention

In BIOLOGY

To studymechanisms of adaptation (bacteria in extreme conditions,

parasites in host organism, ...) ...

Page 13: Searching for Differentially Expressed Genes

Searching for Searching for differentially expressed differentially expressed genesgenes

1. WHAT is it a differentially expressed gene?

2. WHY should we search it?

3. WHERE can we search for it?

4. HOW to find it?

5. (WHO should do that?, WHY me!...)

Page 14: Searching for Differentially Expressed Genes

WHEREWHERE can we can we search forsearch for it? it?

we can measure the gene expression by the amount of mRNA

mRNA can be extracted from cells of any living organism

In medicine tissues/organs

extracted tumors, nodes

blood

bone marrow

In biology extracting mRNA from bacteria

tissues of plants

tissues/organs

Page 15: Searching for Differentially Expressed Genes

Searching for Searching for differentially expressed differentially expressed genesgenes

1. WHAT is it a differentially expressed gene?

2. WHY should we search it?

3. WHERE can we search for it?

4. HOW to find it?

5. (WHO should do that?, WHY me!...)

Page 16: Searching for Differentially Expressed Genes

HOW HOW can we find can we find differentially expressed differentially expressed genesgenes??

Exploring gene by geneRT-PCR

FISH

OR

Thousands of genes in one experimentMicroarrays

Page 17: Searching for Differentially Expressed Genes

MicroarraysMicroarrays

microarrays

Quantify the pixel intensity

of each gene in each channel (green, red)

~

Numbers equivalent to the amount of

mRNA

~

gene activity

~

gene expression

gene

Page 18: Searching for Differentially Expressed Genes

HOW HOW can we find a can we find a differentially expressed genedifferentially expressed gene??

DNADNA

mRNAmRNA

DNADNA

mRNAmRNA

A. Healthy colon tissue B. Colon cancer tissue

9/3 = 3

FOLD CHANGE

A gene is 3 times more expressed in colon cancer than in healthy colon tissue

>

Page 19: Searching for Differentially Expressed Genes

HOW HOW can we find can we find differentially expressed genedifferentially expressed geness??

METHODSMETHODS

Fold change rules

Hypothesis testing

Regression strategies

Page 20: Searching for Differentially Expressed Genes

HOW HOW can we find can we find differentially expressed genedifferentially expressed geness??

METHODSMETHODS

Fold change rules

Hypothesis testing

Regression strategies

Page 21: Searching for Differentially Expressed Genes

Fold change rulesFold change rules

All genes that have 2 fold change in expression (both directions) are considered differentially expressed between the two samples

Why to do it:

EASY

Why NOT: Smaller changes can be biologically significant!

(the small effects can be multiplied inside a group of genes from the same pathway)

The data come with biological and technical variability: What about 1.9?

The fold-changes can be biased to zero (mix of tumor and normal cells)

No assessment of statistical significance

Statistical testing

Page 22: Searching for Differentially Expressed Genes

HOW HOW can we find can we find differentially expressed genedifferentially expressed geness??

METHODSMETHODS

Fold change rules

Hypothesis testing

Regression strategies

Page 23: Searching for Differentially Expressed Genes

Hypothesis testing I.Hypothesis testing I.

Is the mean expression of a gene in group A different from the mean expression in group B?

Conduct a statistical test for each gene g = 1, . . . ,m,

giving test statistics Tg and corresponding p-values

Choosing a statistical test

Number of groups to compare

Data have Gaussian distribution

Data have Gaussian distribution

2 >2

T-testMann-Whitney

testANOVA

Kruskal-Wallistest

YES NO YES NO

Page 24: Searching for Differentially Expressed Genes

Hypothesis testing II.Hypothesis testing II.

Two–sample T-test can be used to test equality of the group means μ1, μ2.

The p-value pg - is the probability that the test statistic under the null hypothesis (here: μ1 = μ2) is at least as extreme as the observed value Tg. Under the null hypothesis, Pr(Tg ≤ T) = pg .

21

21

11nn

s

T

g

ggg

variability

Page 25: Searching for Differentially Expressed Genes

Multiple hypothesis Multiple hypothesis testing testing problemproblem

Thousands of genes on microarray slide

Thousands of hypotheses are tested simultaneously

Increased chance of false positives

Example: 10 000 genes on a chip, no differentially expressed => 0.05 x 10 000 = 500 with p-value < 0.05.

p–values <0.05 do not correspond to significant findings anymore

We need to ADJUST for this multiple testing problem

Page 26: Searching for Differentially Expressed Genes

Adjustment for multiple hypothesis testing Adjustment for multiple hypothesis testing problemproblem

# non–rejected # rejected

# non-diff. expressed genes

True Negatives (TN)False Positives (FP)

Type I. error

# diff.expressed genesFalse Negatives (FN)

Type II. errorTrue Positives (TP)

Type I. error rates

1. Family–wise error rate (FWER): The probability of at least one Type I error (false positive): FWER = Pr(FP > 0)

2. False discovery rate (FDR)(Benjamini & Hochberg,1995): The expected proportion of False positives among all positives.

Page 27: Searching for Differentially Expressed Genes

AdjustAdjusting p-valuesing p-values

Controlling the Family Wise Error Rate (FWER) Bonferroni correction (for independent testings)

p < / m (e.g. p < 0.05/10 000)

Controlling the False Discovery Rate (FDR) Benjamini/Hochberg procedure

Ordered unadjusted p–values: P(1),..., P(m)

To control FDR at level , For a given α, find the largest k such that

Reject the hypotheses Hj for j = 1, . . . , k.

FDR = 10% (from 100 rejected hypotheses we can expect 10 false positives)

Page 28: Searching for Differentially Expressed Genes

Adjustment for multiple hypothesis testing Adjustment for multiple hypothesis testing problemproblem

FWER if we want ALL selected genes to be significant. However, many differentially expressed genes may not appear significant

FDR if we prefer to pick up the majority of differentially expressed genes and do not care about some false positives.

Page 29: Searching for Differentially Expressed Genes

Significance analysis of microarraysSignificance analysis of microarrays- (2001 by Tusher, Tibshirani and Chu)

- Permutation algorithm for False discovery rate (FDR) estimation.

- Based on modified t-statistic:

- Statistical significance of observed score di is subsequently assessed with permutation of original data and calculating expected score de (d score distribution).

- Gene is considered statistically significant when satisfying |di - de | > Δ.

- Advantage: easy to use, methodologically simple

- Disadvantage: computationally intensive, high memory requirements

tuning constant (adjustment for variability in data)

Page 30: Searching for Differentially Expressed Genes

Significance analysis of microarraysSignificance analysis of microarrays- A gene is considered statistically significant when satisfying |di - de | > Δ.

de

di

Page 31: Searching for Differentially Expressed Genes

Volcano plots I.Volcano plots I.

Page 32: Searching for Differentially Expressed Genes

Volcano plots II.Volcano plots II.

- log10(q-value) ~ -log10(0.1)=2.3

Page 33: Searching for Differentially Expressed Genes

HOW HOW can we find can we find differentially expressed genedifferentially expressed geness??

METHODSMETHODS

Fold change rules

Hypothesis testing

Regression strategies

Page 34: Searching for Differentially Expressed Genes

Regression strategiesRegression strategies

When have more than 1 variable that can affect the gene expression gene expression ~ group + age + gender

Linear modelling

We try to find out how much the gene expression changes when the value of some continuous variable changes gene expression ~ overall survival

gene expression ~ age

Linear modelling, Cox proportional hazards model

We want to find the probability that the sample belongs to a certain group given the expression level of a gene.Logistic regression

Page 35: Searching for Differentially Expressed Genes

Searching for differentially expressed genesSearching for differentially expressed genes

Number of groups to compare

Data have Gaussian distribution

Data have Gaussian distribution

2 >2

Number of factors

Mann-Whitney test, SAM

ANOVA, Linear models,

SAM

Kruskal-Wallis test,SAM

Linear models,Cox proportional hazards

models (survival times)

continuous response variable

YES YES NONO

T-test,Linear models, SAM

Linear models

1>1

Page 36: Searching for Differentially Expressed Genes

What to do with a list of differentially expressed What to do with a list of differentially expressed genes?genes?

Ad-hoc pathway analysis

Clustering genes in order to determine the groups of genes

Clustering samples for control purposes

Compare to other datasets (meta-analysis)

Page 37: Searching for Differentially Expressed Genes

Ad-hoc pathway analysisAd-hoc pathway analysis

Page 38: Searching for Differentially Expressed Genes
Page 39: Searching for Differentially Expressed Genes
Page 40: Searching for Differentially Expressed Genes

EXAMPLEEXAMPLE

Page 41: Searching for Differentially Expressed Genes

Microsatellite instabilityMicrosatellite instability (MSI) (MSI) in colon in colon cancercancer

• MSI tumors are characteristic by:

– observed in ~15 % of sporadic colon cancers

– high microsatellite instability • due to mismatch repair gene epigenetic silencing (hypermethylation of

MSH1, MSH2, MSH6)

– increased immune response• infiltration of tumor epithelium by T-lymphocytes

• HLA class increased expression

– better survival

Page 42: Searching for Differentially Expressed Genes

Differential gene-expression analysis of MSI vsDifferential gene-expression analysis of MSI vs.. MSSMSS

• Aim: • Find MSI gene expression signatures stable across different datasets

• We have analyzed 3 publicly available datasets• Affymetrix HG-U133_Plus_2 (54675 probesets)

• Analysis:• Significance analysis of microarrays SEPARATELY on each of the datasets• Compared lists of differentially expressed genes on FDR = 10%• Significant genes in ALL THREE DATASETS were inserted into KEGG

pathway analysis

Dataset MSI/MSS

GSE4554 33/51 (39.3% / 60.7%)

GSE13067 11/62 (15.1% / 84.9%)

GSE13294 78/77 (50.3% / 49.7%)

Page 43: Searching for Differentially Expressed Genes

SAM resultsSAM resultsGSE 4554 GSE 13067

GSE 13294

685 differentially expressed genes in all three datasets

at FDR<=10%

Page 44: Searching for Differentially Expressed Genes

TOP 15 TOP 15 DOWN-regulated DOWN-regulated genes in MSI (significant in all 3 genes in MSI (significant in all 3 datasets)datasets)

Gene Symbol Gene Title

GSE4554 GSE13067 GSE13294

adj.p.val logFCH adj.p.val logFCH adj.p.val logFCH

TNNC2 troponin C type 2 (fast) 0.003 -1.01 0.003 -0.49 0.000 -1.12

7A5 metastasis associated in colon cancer 1 0.000 -1.33 0.019 -1.32 0.000 -1.54

ZMYND8 zinc finger, MYND-type containing 8 0.003 -0.88 0.000 -1.04 0.000 -0.98

RNF43 ring finger protein 43 0.007 -0.91 0.000 -1.60 0.000 -1.30

SYT7 synaptotagmin VII 0.306 -0.49 0.065 -0.40 0.000 -0.86

TSPAN6 tetraspanin 6 0.007 -0.93 0.000 -1.50 0.000 -1.01

ASCL2 achaete-scute complex homolog 2 0.002 -1.30 0.001 -1.90 0.000 -1.87

TDGF1 teratocarcinoma-derived growth factor 1 0.000 -1.87 0.000 -2.61 0.000 -2.32

ATP9A ATPase, class II, type 9A 0.000 -1.20 0.000 -1.40 0.000 -1.19

GABRE GABA A receptor, epsilon 0.062 -0.70 0.022 -0.89 0.000 -1.25

PROX1 prospero homeobox 1 0.004 -1.11 0.021 -1.15 0.000 -1.23

VIL1 villin 1 0.034 -0.76 0.057 -0.55 0.000 -1.09

NOX1 NADPH oxidase 1 0.014 -1.17 0.001 -2.44 0.000 -2.03

PLAGL2 pleiomorphic adenoma gene-like 2 0.007 -0.91 0.000 -1.49 0.000 -0.88

A1CF APOBEC1 complementation factor 0.160 -0.52 0.002 -1.18 0.000 -1.27

Page 45: Searching for Differentially Expressed Genes

TOP 15 UP-regulated genes in MSI (significant in all 3 TOP 15 UP-regulated genes in MSI (significant in all 3 datasets)datasets)

Gene Symbol Gene Title

GSE4554 GSE13067 GSE13294

adj.p.val logFCH adj.p.val logFCH adj.p.val logFCH

KDELR3Homo sapiens KDEL endoplasmic reticulum

protein0.002 1.01 0.000 0.90 0.000 0.90

TRIB2 tribbles homolog 2 (Drosophila) 0.000 1.33 0.000 1.41 0.000 1.69

TFAP2AHomo sapiens AP-2 gene for transcription factor

AP-20.000 1.52 0.000 2.35 0.000 1.78

TRIM7 tripartite motif-containing 7 0.000 2.46 0.000 1.55 0.000 2.01

KCNK1 potassium channel, subfamily K, member 1 0.041 0.79 0.002 1.33 0.000 0.86

CTSE cathepsin E 0.077 0.96 0.043 1.53 0.000 1.18

CATSPERB cation channel, sperm-associated, beta 0.151 0.55 0.109 0.27 0.000 1.03

DUSP4 dual specificity phosphatase 4 0.000 1.90 0.000 2.73 0.000 1.83

CCDC68 coiled-coil domain containing 68 0.030 0.76 0.002 1.17 0.000 1.29

LSMD1 LSM domain containing 1 0.001 1.03 0.000 1.00 0.000 0.86

SECTM1 secreted and transmembrane 1 0.006 0.94 0.105 0.59 0.000 1.30

LMO4 LIM domain only 4 0.018 0.75 0.001 1.03 0.000 0.79

CD55

Homo sapiens decay accelerating factor

for complement (CD55, Cromer

bloodgroup system) (DAF) gene, complete cds.

0.024 0.85 0.029 0.94 0.000 1.28

SPATA18 spermatogenesis associated 18 homolog (rat) 0.022 0.92 0.007 1.22 0.000 0.96

RPL22L1 ribosomal protein L22-like 1 0.000 1.32 0.000 1.80 0.000 1.68

Page 46: Searching for Differentially Expressed Genes

Immune responseImmune response in MSI in MSI– MSI-H associated with the higher frequency of activated tumour infiltrating

lymphocytes

Page 47: Searching for Differentially Expressed Genes

TGF-TGF- signaling pathway signaling pathway– Inhibition of TGF- growth suppression in MSI tumors results from the

frequent frameshift mutation of TGFBR2 – In MSS tumors by mutation/loss of SMAD4

ROCK2