statistical methods for analyzing ordered gene expression microarray data
DESCRIPTION
Statistical Methods for Analyzing Ordered Gene Expression Microarray Data. Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle Park, NC. An outline. Ordered gene expression data Common experimental designs - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/1.jpg)
Statistical Methods for Analyzing Statistical Methods for Analyzing Ordered Gene Expression Ordered Gene Expression
Microarray DataMicroarray Data
Shyamal D. PeddadaBiostatistics Branch
National Inst. Environmental Health Sciences (NIH)
Research Triangle Park, NC
![Page 2: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/2.jpg)
An outlineAn outline Ordered gene expression data
Common experimental designs
A review of some statistical methods
An example
Demonstration of ORIOGEN – a software for ordered gene expression data
![Page 3: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/3.jpg)
Some examples of ordered Some examples of ordered gene expression datagene expression data
Comparison of gene expression by:
– various stages of cancer Normal - Hyperplasia – Adenoma – Carcinoma
– tumor size New tumor – Middle Size – Large tumor (with necrosis)
– dose of a chemical (dose-response study)
– duration of exposure to a chemical (time-course experiments)
– dose & duration
![Page 4: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/4.jpg)
Some commonly used experimental Some commonly used experimental designsdesigns
Experimental unit: Tissues/cells/animals Single chemical/treatment
– Dose response study– Time course study
single dose but responses obtained at multiple time points after treatment
experimental units are treated at multiple time points using the same dose.
– Dose response x Time course study Multiple doses at multiple time points
Multi chemicals/treatments
![Page 5: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/5.jpg)
Possible objectivesPossible objectives
– Investigate changes in gene expression at certain biologically relevant category.
E.g. Hyperplasia to Adenoma to Carcinoma E.g. “early time point” to “late time point” since the
exposure to a chemical
– Identify/cluster genes with similar expression profiles over time/dose.
![Page 6: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/6.jpg)
Correlation coefficient based Correlation coefficient based methodsmethods
Correlation coefficient based methods match genes with similar observed patterns of expression across dose/time points.
Gene 1
Gene 2
![Page 7: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/7.jpg)
Correlation coefficient based Correlation coefficient based methodsmethods
A number of variations to this general principle exist in the literature. Here we outline some prominent ones.
A. Chu et al. (Science, 1998): Pre-select a set of biologically relevant patterns of gene
expressions over time. Identify a sample of about 3 to 8 genes for each pattern. Compute the correlation coefficient of each candidate
gene in the microarray data with the above pre-selected genes.
Cluster each candidate gene into the cluster with highest correlation coefficient
![Page 8: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/8.jpg)
Correlation coefficient based Correlation coefficient based methods …methods …
B. Kerr and Churchill (PNAS, 2001):
They correctly recognized the uncertainty associated with Chu et al. ‘s clustering algorithm. Hence they proposed a bootstrap methodology to evaluate Chu et al.’s clusters.
C. Heyer et al. (Genome Research, 1999):
Rather than using the standard correlation coefficient between genes, they employ jackknife version which robustifies against outliers.
Unlike Chu et al.’s strategy, they classify genes on the basis of pairwise correlation coefficients.
![Page 9: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/9.jpg)
Correlation coefficient based Correlation coefficient based methods …methods …
Strengths Familiarity among biologists Easy to compute and interpret (although it is often
misinterpreted too!)
Weakness Non-linearity in the data can lead to misinterpretation Outliers and influential observations can affect the
numerical value of the correlation coefficient. Heterogeneity between genes can also affect the numerical
value of the correlation coefficient. It is also important to note that correlation coefficient is
typically estimated on the basis of a very small number of points.
![Page 10: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/10.jpg)
![Page 11: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/11.jpg)
Regression based proceduresRegression based procedures
Basic assumption among these methods:Basic assumption among these methods:
The “conditions” are numerical, The “conditions” are numerical, e.g. dose or timee.g. dose or time
![Page 12: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/12.jpg)
Polynomial regressionPolynomial regression Liu et al. (BMC Bioinformatics, 2005)Liu et al. (BMC Bioinformatics, 2005)
For each gene Liu et al. fitted a quadratic regression model:
They cluster each gene into a particular cluster depending upon the sign and statistical significance of the regression parameters.
If for a gene none of the regression coefficients are significant then such a gene is declared un-important.
tggggtg ttY ,2
2,1,0,, εβββ +++=
![Page 13: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/13.jpg)
Polynomial regressionPolynomial regression Liu et al. (BMC Bioinformatics, Liu et al. (BMC Bioinformatics,
2005)2005) Strengths:
Biologists are reasonably familiar with quadratic regression analysis.
Regression coefficients are easy to interpret.
For small number of doses or time points and for evenly spaced doses, a quadratic model may be a reasonable approximation.
An easy to use EXCEL based software is available.
![Page 14: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/14.jpg)
Polynomial regressionPolynomial regression Liu et al. (BMC Bioinformatics, Liu et al. (BMC Bioinformatics,
2005)2005) Two major limitations because it is fully parametric:
1. Departure from quadratic model is common:
In such cases thequadratic modelmay not be correct.
2. Normality assumption need not be valid.Time
![Page 15: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/15.jpg)
““Semi-parametric” regression Semi-parametric” regression methodsmethods
Several authors have tried semi-parametric regression approach to gene expression data.
E.g. deHoon et al. (Bioinformatics, 2002) Bar-Joseph et al. (PNAS, 2003, Bioinformatics, 2004) Luan and Li et al. (Bioinformatics, 2003) Storey et al. (PNAS, 2005)
![Page 16: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/16.jpg)
Storey et al. (2005)Storey et al. (2005)
Basic idea:
For each gene, they fit mixed effects model with a B-spline basis. This methodology is largely based on Brumback and Rice (JASA, 1998).
Statistical significance of each gene is evaluated using an F like test statistic with P-value (q-value) determined by bootstrap.
![Page 17: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/17.jpg)
Storey et al. (2005)Storey et al. (2005)
Strengths:
It is semi-parametric A user friendly software called EDGE is available
Limitations: It does not perform well for “threshold” patterns of gene
expression The “conditions” should be numerical Unequal dose or time spacing can have an impact on
the performance of the procedure
![Page 18: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/18.jpg)
OOrder rder RRestricted estricted IInference for nference for OOrdered rdered GGene ene EExpressioxpressioNN
(ORIOGEN)(ORIOGEN)
Peddada et al. (Bioinformatics, 2003, 2005)Simmons and Peddada (Bioinformation, 2007)
![Page 19: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/19.jpg)
Temporal Profile /Dose ResponseTemporal Profile /Dose Response
Pattern of the (unknown) mean expression of a gene
over time (dose) is known as the temporal profile (dose response) of a gene.
– ORIOGEN: uses mathematical (in)equalities to describe a profile.
)(μ
![Page 20: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/20.jpg)
Some ExamplesSome Examples Null profile:
654321 μμμμμμ =====
![Page 21: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/21.jpg)
Examples Continued …Examples Continued …
Up-down profile with maximum at 3 hours
654321 μμμμμμ ≥≥≥≤≤
![Page 22: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/22.jpg)
Examples Continued …Examples Continued …
Non-increasing profile
Cyclical profile
654321 μμμμμμ ≥≥≥≥≥
654321 μμμμμμ ≤≥≥≤≤
![Page 23: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/23.jpg)
ORIOGENORIOGEN
Step 1 (Profile specification):
Pre-specify the shapes of profiles of interest.
![Page 24: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/24.jpg)
Some Examples Of Pre-specified Some Examples Of Pre-specified ProfilesProfiles
![Page 25: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/25.jpg)
ORIOGEN …ORIOGEN …
Step 2 (profile fitting): Fit each pre-specified profile to each gene using the estimation procedure described in:
Hwang and Peddada (1994, Ann. of Stat.)
![Page 26: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/26.jpg)
A Brief Description Of The Estimation A Brief Description Of The Estimation Procedure …Procedure …
![Page 27: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/27.jpg)
DefinitionsDefinitions
Linked parameters: Two parameters are said to be linked if the inequality between them is known a priori.
Nodal parameter: A parameter is said to be nodal if it is linked to all parameters in the graph.
For any given profile, the estimation always starts at the nodal parameter.
![Page 28: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/28.jpg)
Pool the Adjacent Violator Algorithm Pool the Adjacent Violator Algorithm (PAVA)(PAVA)
Hypothesis:
Observed data
Isotonized data (PAVA)
![Page 29: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/29.jpg)
Estimation: The General IdeaEstimation: The General Idea
1
2 4
5
3
1
25
4
3
3 is the only nodal parameter
![Page 30: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/30.jpg)
Estimation Continued …Estimation Continued …
From this sub-graph we estimate 1 and 2.
1
2
3
![Page 31: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/31.jpg)
Step 3: Determine the norm of a gene corresponding
to each temporal profile.
This is defined as the maximum (studentized) difference between estimates corresponding to linked parameters.
Peddada et al. (2001, Biometrics).
A Measure of “Goodness-of-fit” A Measure of “Goodness-of-fit” NormNorm∞l
∞l
![Page 32: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/32.jpg)
An ExampleAn Example Observed data:
1, 1.5, 2, 2.5, 1.5, 2.25
Two pre-specified temporal profiles:
(a) (b)
![Page 33: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/33.jpg)
Example Continued …Example Continued …
Fit under profile (a)
1, 1.5, 2.25, 2.25, 1.875, 1.875
Fit under profile (b)
1, 1.5, 2, 2.5, 1.875, 1.875
![Page 34: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/34.jpg)
Example Continued …Example Continued …
norm for profile (a) is:
2.25 - 1 = 1.25
norm for profile (b) is:
2.5 - 1 = 1.5
∞l
∞l
![Page 35: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/35.jpg)
““Best Fitting” ProfileBest Fitting” Profile
Step 4: Identify the profile with the largest norm.
In the example, profile (b) has larger norm than profile (a) .
Hence profile (b) is a better fit than (a).
![Page 36: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/36.jpg)
Statistical SignificanceStatistical Significance
Step 5: Statistical significance:
P-value for statistical significance is obtained using the bootstrap methodology:
![Page 37: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/37.jpg)
Illustration …Illustration …
![Page 38: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/38.jpg)
MCF-7 breast cancer cell treated MCF-7 breast cancer cell treated with 17 -estradiol (Lobenhofer et with 17 -estradiol (Lobenhofer et al., 2002, al., 2002, Mol. EndocrinMol. Endocrin.)..).
Gene expressions were measured after: 1hr, 4hrs, 12hrs, 24hrs, 36hrs and 48hrs of treatment.
# of genes on each chip = 1900.
# of samples at each time point = 8
β
![Page 39: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/39.jpg)
Available softwaresAvailable softwares
Linear Regression Method (Liu et al., 2005) EDGE (Storey et al., 2005) EPIG (Chao et al., 2008) ORIOGEN (Peddada et al., 2006)
![Page 40: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/40.jpg)
Concluding remarksConcluding remarks
Methodology Freely available software
Applicable to ordinal “conditions”
Repeated measures and correlated data
Model assumptions
Linear Regression Yes No No Linear regression
EPIG Yes No ? No
EDGE Yes No Yes No
ORIOGEN Yes Yes Yes No
![Page 41: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/41.jpg)
Some open problemsSome open problems
ORIOGEN is potentially subject to Type III error. How do we control FDR & Type III error.
How to deal with
– Dependent samples?– Covariates?
Order restricted inference in the context of mixed effects linear models.
![Page 42: Statistical Methods for Analyzing Ordered Gene Expression Microarray Data](https://reader036.vdocuments.net/reader036/viewer/2022070410/56814680550346895db3a1b0/html5/thumbnails/42.jpg)
AcknowledgmentsAcknowledgments
– Leping Li – David Umbach– Clare Weinberg– Ed Lobenhofer – Cynthia Afshari
Software developers at Constella Group
– (late) John Zajd– Shawn Harris