linear and nonlinear modeling class web site: statistics for microarrays
Post on 22-Dec-2015
218 views
TRANSCRIPT
Linear and Nonlinear Modeling
Class web site: http://statwww.epfl.ch/davison/teaching/Microarrays/
Statistics for Microarrays
cDNA gene expression data
Data on G genes for n samples
Genes
mRNA samples
Gene expression level of gene i in mRNA sample j
= (normalized) Log( Red intensity / Green intensity)
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
Identifying Differentially Expressed Genes
• Goal: Identify genes associated with covariate or response of interest
• Examples:– Qualitative covariates or factors:
treatment, cell type, tumor class– Quantitative covariate: dose, time– Responses: survival, cholesterol level– Any combination of these!
Modeling Introduction
• Want to capture important features of the relationship between a (set of) variable(s) and one or more responses
• Many models are of the form
g(Y) = f(x) + error
• Differences in the form of g, f and distributional assumptions about the error term
Examples of Models• Linear: Y = 0 + 1x +
• Linear: Y = 0 + 1x + 2x2 +
• (Intrinsically) Nonlinear:
Y = x1x2
x3 +
• Generalized Linear Model (e.g. Binomial):
ln(p/[1-p]) = 0 + 1x1 + 2x2
• Proportional Hazards (in Survival Analysis):
h(t) = h0(t) exp(x)
Linear Modeling• A simple linear model:
E(Y) = 0 + 1x
• Gaussian measurement model:Y = 0 + 1x + ,
where ~ N(0, 2)
• More generally:Y = X + ,
where Y is n x 1, X is n x G, is G x 1, is n x 1, often assumed N(0, 2Inxn)
Analysis of Designed Experiments
• An important use of linear models
• Here, the response (Y) is the gene expression level
• Define a (design) matrix X so that E(Y) = X,
where is a vector of contrasts
• Many ways to define design matrix/contrasts
Contrasts (I)• Example: One-way layout, k classesYij = + j + ij, i = 1,…,nj; j = 1,…,k
X = 1 1 0 … 0 = [1 Xa]
1 0 1 … 0 … 1 0 0 … 1
• Problem: this specification is over-parametrized
Contrasts (II)
• Could resolve by removing column of 1’s:Yij = j + ij, i = 1,…,nj; j = 1,…,k
X = 1 0 … 0 = [Xa]
0 1 … 0 … 0 0 … 1
• Here, parameters are class means
Contrasts (III)• Define design matrix
X* = [1 XaCa],
with Ca (k x (k-1)) chosen so that X* has rank k (the number of columns)
• Parameters may become difficult to interpret
• In the balanced case (equal number of observations in each class), can choose orthogonal contrasts (Helmert)
Model Fitting
• For the standard (fixed effects) linear model, estimation is usually by least squares
• Can be more complicated with random effects or when x-variables subject to measurement error as well (so that estimates are not biased)
Model Checking
• Examination of residuals– Normality– Time effects– Nonconstant variance– Curvature
• Detection of influential observations
Do we need robust methods?
• Tukey (1962):“A tacit hope in ignoring deviations from
ideal models was that they would not matter; that statistical procedures which were optimal under the strict model would still be approximately optimal under the approximate model. Unfortunately it turned out that this hope was often drastically wrong; even mild deviations often have much larger effects than were anticipated by most statisticians.”
Robust Regression
• Idea: downweight observations that produce large residuals
• More computationally intensive than least squares regression (which gives equal weight to each observation)
• Use maximum likelihood if can assume specific error distribution
• When not, use M-estimators
M-estimators• ‘Maximum likelihood type’ estimators
• Assume independent errors with distribution f()
• Robust estimator minimizes
i(ei/s) = i{(Yi – xi’)/s},
where (.) is some function and s is an estimate of scale
(u) = u2 corresponds to minimizing the sum of squares
M-estimation Procedure
• To minimize i{(Yi – xi’)/s} wrt the ’s, take derivatives and equate to 0
• Resulting equations do not have an explicit solution in general
• Solve by iteratively reweighted least squares
Generalized Linear Models (GLM/GLIM)
• Response Y assumed to have exponential family distribution:
f(y) = exp[a(y)b() + c() + d(y)]
• Parameters and explanatory variables X; linear predictor = 1x1 + 2x2 + … pxp
• Mean response , link function l () =
• Allows unified treatment of statistical methods for several important classes of models
Some Examples
Link Binomial Gamma
Normal
Poisson
logit Default
probit X
cloglog X
identity X X X
inverse Default
Log X Default
Sqrt X
Survival Modeling
• Response T is a (nonnegative) lifetime• Cumulative distribution function (cdf)
F(t), density f(t)• More usual to work with the survivor
function S(t) = 1 – F(t) = P(T > t)
and the instantaneous failure rate, or hazard function h(t) = limt->0P(t T< t+t | T t)/ t
Relations Between Functions
• Cumulative hazard function
H(t) = 0t h(s) ds
• h(t) = f(t)/S(t)
• H(t) = -log S(t)
Censoring
• Incomplete information on the lifetime
• A censored observation is one whose value is incomplete due to random factors for each individual
• Most commonly, observation begins at time t = 0 and ends before the outcome of interest is observed (right-censoring)
Estimation of Survivor Function
• Most commonly used estimate is Kaplan-Meier (also called product limit) estimator
• Risk set r(t) = number of cases alive just before time t
• S(t) = tit [r(ti) – di]/r(ti) ^
Cox Proportional Hazards Model
• Baseline hazard function h0(t)
• Modified multiplicatively by covariates• Hazard function for individual case ish(t) = h0(t) exp(1x1 + 2x2 + … + pxp)
• If nonproportionality: – 1. Does it matter – 2. Is it real
Strategies for Gene Expression-based
Modeling• The biggest problem is the large
number of variables (genes)
• One possibility is to first reduce the number of genes under consideration (e.g. consider variability across samples, or coefficient of variation)
• Screening/Prioritizing: One gene at a time approach
• Two at a time, perhaps plus interaction
Example: Survival analysis with expression
data
• Bittner et al. dataset:
– 15 of the 31 melanomas had associated survival times
– 3613 ‘strongly detected’ genes
Association of Variables• Variables tested for association with
cluster: – Sex (p = .68, n = 16 + 11 = 27)Age (p = .14, n = 15 + 10 = 25) Mutation status (p = .17, n = 12 + 7 = 19)– Biopsy site (p = .88, n = 14 + 10 = 24)– Pigment (p = .26, n = 13 + 9 = 22)– Breslow thickness (p = .26, n = 6 + 3 = 9)– Clark level (p = .44, n = 6 + 5 = 11)Specimen type (p = .11, n = 11 + 12 = 23)
Survival analysis: Bittner et al.
• Bittner et al. also looked at differences in survival between the two groups (the ‘cluster’ and the ‘unclustered’ samples)
• ‘Cluster’ seemed associated with longer survival
Identification of Genes Associated with Survival
For each gene j, j = 1, …, 3613, model the instantaneous failure rate, or hazard function, h(t) with the Cox proportional hazards model:
h(t) = h0(t) exp(jxij)
and look for genes with both: • large effect size j • large standardized effect size j/SE(j)
^
^ ^
Sites Potentially Influencing Survival
Image Clone ID
UniGene Cluster
UniGene Cluster Title
137209 Hs.126076
Glutamate receptor interacting protein
240367 Hs.57419 Transcriptional repressor
838568 Hs.74649 Cytochrome c oxidase subunit Vlc
825470 Hs.247165
ESTs, Highly similar to topoisomerase
841501 Hs.77665 KIAA0102 gene product
Findings
• Top 5 genes by this method not in Bittner et al. ‘weighted gene list’ - Why?
• weighted gene list based on entire sample; our method only used half
• weighting relies on Bittner et al. cluster assignment
• other possibilities?
Advantages of Modeling
• Can address questions of interest directly– Contrast with what has become the
‘usual’ (and indirect) approach with microarrays: clustering, followed by tests of association between cluster group and variables of interest
• Great deal of existing machinery
• Quantitatively assess strength of evidence
Limitations of Single Gene Tests
• May be too noisy in general to show much
• Do not reveal coordinated effects of positively correlated genes
• Hard to relate to pathways
Not Covered…
• Careful followup– Assessment of proportionality – Inclusion of combinations of genes,
interactions– Consideration of alternative models
• Power assessment– Not worth it here, there can’t be
much!
Some ideas for further work
• Expand models to include more genes, possibly two-way interactions – Issue of automation– Still very small scale compared to
probable pathway size, number of genes involved, etc.
• Nonparametric tree-based modeling– Will require much larger sample sizes