mining for low-abundance transcripts in microarray data

Miningfor Low-abundance Transcripts

in Microarray Data

Yi Lin1, Samuel T. Nadler2,

Alan D. Attie2, Brian S. Yandell1,3

1Statistics, 2Biochemistry, 3Horticulture,University of Wisconsin-Madison

September 2000

Basic Idea

• DNA microarrays present tremendous opportunities to understand complex processes

• Transcription factors and receptors expressed at low levels, missed by most other methods

• Analyze low expression genes with normal scores• Robustly adapt to changing variability across

average expression levels• Basis for clustering and other exploratory methods• Data-driven p-value sensitive to low abundance &

changes in variability with gene expression

DNA Microarrays

• 100-100,000 items per “chip”– cDNA, oligonucleotides, proteins

– dozens of technologies, changing rapidly

• expose live tissue from organism– mRNA cDNA hybridize with chip

– read expression “signal” as intensity (fluorescence)

• design considerations– compare conditions across or within chips

– worry about image capture of signal

Low Abundance Genes• background adjustment

– remove local “geography”– comparing within and between chips

• negative values after adjustment– low abundance genes

• virtually absent in one condition• could be important genes: transcription factors, receptors

– large measurement variability• early technology (bleeding edge)

• prevalence across genes on a chip– 0-20% per chip– 10-50% across multiple conditions

Log Transformation?• tremendous scale range in mean intensities

– 100-1000 fold common

• concentrations of chemicals (pH)– fold changes have intuitive appeal

• looks pretty good in practice• want transformed data to be roughly normal

– easy to test if no difference across conditions

– looking for genes that are “outliers”• beyond edge of bell shaped curve

• provide formal or informal thresholds

Exploratory Methods• Clustering methods (Eisen 1998, Golub 1999)• Self-organizing maps (Tamayo 1999)

– search for genes with similar changes across conditions

– do not determine significance of changes in expression

– require extensive pre-filtering to eliminate• low intensity

• modest fold changes

– may detect patterns unrelated to fold change

• comparison of discrimination methods (Dudoit 2000)

Confirmatory Methods• ratio-based decisions for 2 conditions (Chen 1997)

– constant variance of ratio on log scale, use normality

• anova (Kerr 2000, Dudoit 2000)– handles multiple conditions in anova model– constant variance on log scale, use normality

• Bayesian inference (Newton 2000, Tsodikov 2000)– Gamma-Gamma model– variance proportional to squared intensity

• error model (Roberts 2000, Hughes 2000)– variance proportional to squared intensity– transform to log scale, use normality

2. rank order genesR=rank(A)/(n+1)

1. adjust forbackground

A=Q – B

3. normal scoresN=qnorm(R)

4. contrastconditionsY=N1 – N2

0. acquire dataQ, B

X = meanY

= c

ontr

ast 6. center & spread

5. mean intensityX=mean(N)

7. standardize

Z=Y – center

spread

Normal Scores Procedure

adjusted expression A = Q – B rank order R = rank(A) / (n+1)normal scores N = qnorm( R )

average intensity X = (N1+N2)/2

difference Y = N1 – N2

variance Var(Y | X) 2(X)standardization S = [Y – (X)]/(X)

Motivate Normal Scores• natural transformation to normality is log(A)• background intensity B = bd +B

– measured with error – attenuation d may depend on condition

• gene measurement Q = [exp(g+h+)+b]d+Q – gene signal g– degree of hybridization h:– intrinsic noise (variance may depend on g)– attenuation (depends thickness of sample, etc.)

• subtract background: A = Q – B– adjusted measurements A = dexp(g+h+)+– symmetric measurement error =B – Q

Motivate Normal Scores (cont.)• adjusted measurements: A = Q – B = dG+• log expression level: log(G) = g+h+• gene signal g confounded with hybridization h

– unless hybridization h independent of condition

• G is observed if – no measurement error – no dye or array effect (no attenuation d)

– no background intensity

• natural under this model to consider N=log(G)– normal scores almost as good

Robust Center & Spread• genes sorted based on X • partitioned into many (about 400) slices

– containing roughly the same number of genes

• slices summarized by median and MAD– MAD = median absolute deviation

– robust to outliers (e.g. changing genes)

• MAD ~ same distribution across X up to scale– MADi = i Zi, Zi ~ Z, i = 1,…,400

– log(MADi ) = log(i) + log( Zi)

• median ~ same idea

Robust Center & Spread

• MAD ~ same distribution across X up to scale– log(MADi ) = log(i) + log( Zi), I = 1,…,400

• regress log(MADi) on Xi with smoothing splines– smoothing parameter tuned automatically

• generalized cross validation (Wahba 1990)

• globally rescale anti-log of smooth curve– Var(Y|X) 2(X)

• can force 2(X) to be decreasing• similar idea for median

– E(Y|X) (X)

Motivation for Spread• log expression level: log(G) = g+h+

– hybridization h negligible or same across conditions

– intrinsic noise may depend on gene signal g

• compare two conditions 1 and 2– Y = N1 – N2 log(G1) – log(G2) = g1 – g2+ 1 -2

• no differential expression: g1=g2=g– Var(Y|g) = 2(g)

– g X suggests condition on X instead, but:• Var(Y | X) not exactly 2(X)

• cannot be determined without further assumptions

Simulation of Spread Recovery

• 10,000 genes: log expression Normal(4,2)

• 5% altered genes: add Normal(0,2)

• no measurement error, attenuation

• estimate robust spread

Robust Spread Estimation

-4 -2 0 2 4

0.2

0.4

0.6

0.8

Estimate of Spread

mean expression

spre

ad

-4 -2 0 2 4

-50

5

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Q-Q Plot

Bonferroni-corrected p-values• standardized normal scores

– S = [Y – (X)]/(X) ~ Normal(0,1) ?

– genes with differential expression more dispersed

• Zidak version of Bonferroni correction– p = 1 – (1 – p1)n

– 13,000 genes with an overall level p = 0.05• each gene should be tested at level 1.95*10-6

• differential expression if S > 4.62

– differential expression if |Y – (X)| > 4.62(X)

• too conservative? weight by X?– Dudoit (2000)

Simulation Study• simulations with two conditions

– 10,000 genes

– g1 ,g2 ~ Normal(4,2) for nonchanging genes

• 5% with differential expression– gc ~ Normal(4,2) + Normal(3-rank(X)/(n+1),1/2)

– up- or down-regulated with probability 1/2

• intrinsic noise ~ Normal(0,0.5)• attenuation = 1

• measurement error variance = 0, 1, 2, 5, 10, 20

Success in Capture

0 100 200 300 400 500

010

020

030

040

050

0

(a) Rank by Normal Scores

Num

ber

of C

hang

ed G

enes

Cap

ture

d 0 1 2 5 10 20

Comparison of Methods• differential expression for two conditions• Newton (2000) J Comp Biol

– Gamma-Gamma-Bernouli model– Bayesian odds of differential expression

• Chen (1997)– constant ratio of expressions– underlying log-normality

• normal scores (Lin 2000)– some (unknown) transformation to normality– robust, smooth estimate of spread & center– Bonferroni-style p-values

Capturing Changed Genes: No Noise

0.02 0.10 0.50 2.00 10.00 50.00

0.1

0.2

0.5

1.0

2.0

5.0

0

0

Fol

d C

hang

e

No Noise: Average Intensity

Capturing Changed Genes: Noise

0.05 0.20 1.00 5.00 20.00

0.05

0.20

1.00

5.00

0

0

Fol

d C

hang

e

Noise 10: Average Intensity

Comparison with E. coli Data

• 4,000+ genes (whole genome)

• Newton (2000) J Comp Biol– Bayesian odds of differential expression

• IPTG-b known to affect only a few genes– ~150 genes at low abundance– including key genes

E. coli with IPTG-b

1 e-02 1 e+00 1 e+02

1 e

-03

1 e

-01

1 e

+03

Raw IPTG-b

0

0

Cy3

/Cy5

rat

io

Mean Intensity0.05 0.20 1.00 5.00 20.00

0.1

0.5

5.0

50.0

500.

0

Normal Scores IPTG-b

0

0

Cy3

/Cy5

rat

io

Mean Intensity

Diabetes & Obesity Study• 13,000+ gene fragments (11,000+ genes)

– oligonuleotides, Affymetrix gene chips

– mean(PM) - mean(NM) adjusted expression levels

• six conditions in 2x3 factorial– lean vs. obese

– B6, F1, BTBR mouse genotype

• adipose tissue– influence whole-body fuel partitioning

– might be aberrant in obese and/or diabetic subjects

• Nadler (2000) PNAS

Low Abundance Obesity Genes• low mean expression on at least 1 of 6 conditions

– negative adjusted values

– ignored by clustering routines

• transcription factors– I-B modulates transcription - inflammatory processes

– RXR nuclear hormone receptor - forms heterodimers with several nuclear hormone receptors

• regulation proteins– protein kinase A

– glycogen synthase kinase-3

• roughly 100 genes– 90 new since Nadler (2000) PNAS

Table 1. Genes Differentially Expressed with Obesity as Identified bythe Normal Scores Procedure

Accesion # Description Obese/Lean Lean/Obese Average p-value

aa608251 Mus musculus Dp1l1 mRNA for polyposis locus protein 1-like 1 (TB2 protein-like1)

0 940.01 0.535 0.0000

x93999 M.musculus Gal beta-1,3-GalNAc-specific GalNAc alpha-2,6-sialyltransferasegene

0.01 74.78 0.513 0.0000

w45955 No significant homologies 0.01 80.52 0.356 0.0001U16297 Mus musculus cytochrome B561 (mcyt) 0.01 95.95 0.506 0.0000X72862* M.musculus gene for beta-3-adrenergic receptor 0.01 104.54 1.575 0.0000X12521 Mouse mRNA for transition protein 1 TP1 0.01 105.3 0.751 0.0000aa544897 No significant homologies 0.01 138.69 0.346 0.0000AA062269 Mus musculus protamine 0.02 41.58 0.877 0.0009AA144629 Mus musculus transition protein 1 (Tnp1) 0.02 42.71 0.71 0.0008X63349 M.musculus tyrosinase-related protein-2 0.02 43.42 1.599 0.0000AA116710 Mus musculus serum response factor (Srf) 0.02 57.02 0.847 0.0001AA105229 Mus musculus hepsin (Hpn) 0.02 59.34 0.318 0.0047aa267014 No significant homologies 0.02 62.72 1.154 0.0000aa408365 Mus musculus mRNA for SRp25 nuclear protein 0.02 65.53 0.38 0.0001j05663 Mouse vas deferens androgen related protein (MVDP) 0.03 30.75 0.682 0.0105w49331 No significant homologies 0.03 31.1 0.849 0.0093J02935 Mouse cAMP-dependent protein kinase type II regulatory subunit 0.03 32.02 0.399 0.0148aa190005 No significant homologies 0.03 34.86 1.056 0.0015m31810 Mouse 2',3'-cyclic-nucleotide 3'-phosphodiesterase 0.03 35.21 0.512 0.0037AA033381 No significant homologies 0.03 36.63 0.506 0.0028aa616759 No significant homologies 0.03 36.67 0.522 0.0027u02883 Mus musculus Balb/c mammary-derived growth inhibitor (MDGI) 0.03 38.1 1.002 0.0010AA198324 No significant homologies 0.03 39.31 0.428 0.0016D14883 Mouse C33/R2/IA4 0.04 25.01 1.134 0.0119W36455 Mus musculus adipsin (Adn) 0.04 25.63 12.089 0.0000m27501 Mus musculus protamine 2 0.04 27.47 0.99 0.0143

X87096 M.musculus brevican 51.31 0.02 0.34 0.0038AA048974 No significant homologies 51.32 0.02 0.565 0.0001ET63455 Mus musculus serum amyloid A-4 protein (Saa4) 51.57 0.02 0.31 0.0108X65026 M.musculus mRNA for GTP-binding protein 54.05 0.02 0.81 0.0001AA138292 Similar to Rat S6 kinase 54.25 0.02 0.419 0.0001ET62360 CALCIUM-DEPENDENT PHOSPHOLIPASE A2 PRECURSOR 56.31 0.02 0.417 0.0001w64688 No significant homologies 57.06 0.02 0.943 0.0000AA035870 No significant homologies 57.27 0.02 1.091 0.0000AA111277 Mus musculus hippocalcin-like 1 (Hpcal1) 61.46 0.02 0.348 0.0008w19022 No significant homologies 62.54 0.02 0.694 0.0000w48392 similar X75018 Id4 helix-loop-helix protein 66.48 0.02 0.573 0.0000AA168767 Sequence withdrawn 70.1 0.01 0.09 0.0072X99143 M.musculus hair keratin, mHb6 71.31 0.01 0.203 0.0064AA137962 similar to Human ras-related protein rab-14 81.78 0.01 1.076 0.0000W10926 Mus musculus mRNA for ubiquitin like protein 82.62 0.01 0.762 0.0000AA103862 Similar to Human hqp0256 84.63 0.01 0.268 0.0013J04179 Mouse chromatin nonhistone high mobility group protein (HGM-I(Y)) 88.29 0.01 1.138 0.0000X03505 Mouse gene exon 2 for serum amyloid A (SAA) 3 protein 88.76 0.01 0.693 0.0000U09507 Mus musculus p21 (Waf1) 92.52 0.01 0.724 0.0000AA023099 Mus musculus dUTPase 122.37 0.01 0.221 0.0002u77460 Mus musculus anaphylatoxin C3a receptor 130.36 0.01 1.191 0.0000x66224 M.musculus retinoid X receptor-beta 197.77 0.01 0.256 0.0000AA154294 Mus musculus protein tyrosine phosphatase (Ptpn16) 233.62 0 0.306 0.0000W85163 Mus musculus migration inhibitory factor 272.49 0 0.168 0.0000

Low Abundance Genes for Obesity

0.02 0.05 0.20 0.50 2.00 5.00 20.00

0.05

0.20

0.50

2.00

5.00

Average Intensity for Obesity

Obe

se v

s. L

ean

Microarray ANOVAs• Kerr (2000)• gene by condition interaction

– Nijk = genei + conditionj + gene*conditionij + rep errorijk

• conditions organized in factorial design– experimental units may be whole or part of array

• genes are random effects– focus on outliers (BLUPs), not variance components

– gene*conditionij = differential expression

– allow variance to depend on genei main effect

• replication to improve precision, catch gross errors

Microarray Random Effects• variance component for non-changing genes

– robust estimate of MS(G*C) using smoothed MAD

– rescale normal score response N by spread (X)

– look for differential expression• or use clustering methods

• variance component for replication– robust estimate of MSE using smoothed MAD

– look for outliers = gross errors

Obesity Genotype Main Effects

-10 -5 0 5

-50

510

Obesity Additive

Obe

sity

Dom

inan

ce

Microarray QTLs• condition may be genotype

– whole organism or pattern of genes

– genotype may be inferred rather than known exactly

• Nijk = genei + QTLj + gene*QTLij + individualijk

• QTL genotype depends on flanking markers– mixture model across possible QTL genotypes

– single vs. multiple QTL

• single QTL may influence numerous genes– epistasis = inter-genic interaction

– modification of biochemical pathway(s)

mining for low-abundance transcripts in microarray data

Documents

low abundance changes

log scale

important genes

adjustmentlow abundance

normal scoresrobustly

significance of changes

similar changes

low levels