statistics for microarray data analysis with r
DESCRIPTION
Statistics for Microarray Data Analysis with R. Session 8: Discrimination. Class web site: http://ludwig-sun2.unil.ch/~darlene/. Today’s Outline. Discrimination (classification) Some classification rules Performance assessment Discrimination (classification) in R Exercises. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/1.jpg)
Statistics for Microarray Data Analysis with R
Session 8: Discrimination
Class web site: http://ludwig-sun2.unil.ch/~darlene/
![Page 2: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/2.jpg)
Today’s Outline
• Discrimination (classification)
• Some classification rules
• Performance assessment
• Discrimination (classification) in R
• Exercises
![Page 3: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/3.jpg)
cDNA gene expression data
Data on G genes for n samples
Genes
mRNA samples
Gene expression level of gene i in mRNA sample j
= (normalized) Log( Red intensity / Green intensity)
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
![Page 4: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/4.jpg)
Classification
• Task: assign objects to classes (groups) on the basis of measurements made on the objects
• Unsupervised: classes unknown, want to discover them from the data (cluster analysis)
• Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations
![Page 5: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/5.jpg)
Discrimination
• Objects (e.g. arrays) are to be classified as belonging to one of a number of predefined classes {1, 2, …, K}
• Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)
• Aim: predict Y from X
![Page 6: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/6.jpg)
Example: Tumor Classification• Reliable and precise classification essential for
successful cancer treatment
• Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables
• Uncertainties in diagnosis remain; likely that existing classes are heterogeneous
• Characterize molecular variations among tumors by monitoring gene expression (microarray)
• Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)
![Page 7: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/7.jpg)
Tumor Classification Using Gene Expression Data
Three main types of statistical problems associated with tumor classification:
• Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering)
• Classification of malignancies into known classes (supervised learning – discrimination)
• Identification of “marker” genes that characterize the different tumor classes (feature or variable selection)
![Page 8: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/8.jpg)
Classifiers• A predictor or classifier partitions (divides) the
space of gene expression profiles into K disjoint subsets, A1, ..., AK, such that for a sample with expression profile X=(X1, ...,XG) in Ak the predicted class is k
• Classifiers are built from a learning set (LS) L = (X1, Y1), ..., (Xn,Yn)
• Classifier C built from a learning set L: C( . ,L): X {1,2, ... ,K}
• Predicted class for observation X:C(X,L) = k if X is in Ak
![Page 9: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/9.jpg)
Maximum likelihood discriminant rule
• A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest
• For example, if I toss a coin 10 times and I see 7 heads, the MLE for p = probability of heads is .7
• If the histograms for the individual classes are known, the maximum likelihood (ML) discriminant rule predicts the class of an observation X to be the one with the highest bar on the histogram (density scale)
![Page 10: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/10.jpg)
Fisher Linear Discriminant Analysis
First applied in 1935 by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA):
1. finds linear combinations of the gene expression profiles X=X1,...,XG with large ratios of between-groups to within-groups sums of squares - discriminant variables;
2. predicts the class of an observation X by the class whose mean vector is closest to X in terms of the discriminant variables
![Page 11: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/11.jpg)
Gaussian ML Discriminant Rules
• For multivariate Gaussian (normal) class densities X|Y= k ~ N(k,k), the ML classifier is
C(X) = argmink {(X - k) k-1
(X - k)’ + log| k |}
• In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA)
• In practice, population mean vectors k and covariance matrices k are estimated by corresponding sample quantities
![Page 12: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/12.jpg)
Gaussian ML Discriminant Rules
• When all class densities have the same covariance matrix, k = the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2)
• When all class densities have the same diagonal covariance matrix =diag(1
2… G2),
the discriminant rule is again linear (Diagonal linear discriminant analysis, or DLDA)
![Page 13: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/13.jpg)
Nearest Neighbor Classification
• Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation)
• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:– find the k observations in the learning set closest to
X– predict the class of X by majority vote, i.e., choose
the class that is most common among those k observations.
• The number of neighbors k can be chosen by cross-validation (more on this later)
![Page 14: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/14.jpg)
Classification Trees
• Partition the feature space into a set of rectangles, then fit a simple model in each one
• Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself)
• Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier
![Page 15: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/15.jpg)
Classification Tree
![Page 16: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/16.jpg)
Three Aspects of Tree Construction
• Split Selection Rule
• Split-stopping Rule
• Class assignment Rule
Different approaches to these three issues (e.g. CART: Classification And Regression Trees, Breiman et al. (1984); C4.5 and C5.0, Quinlan (1993))
![Page 17: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/17.jpg)
Three Rules (CART)
• Splitting: At each node, choose split maximizing decrease in impurity (e.g. misclassification error)
• Split-stopping: Grow large tree, prune to obtain a sequence of subtrees, then use cross-validation to identify the subtree with lowest misclassification rate
• Class assignment: For each terminal node, choose the class minimizing the resubstitution estimate of misclassification probability, given that a case falls into this node
![Page 18: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/18.jpg)
![Page 19: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/19.jpg)
![Page 20: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/20.jpg)
![Page 21: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/21.jpg)
![Page 22: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/22.jpg)
Other Classifiers Include…
• Support vector machines (SVMs)
• Neural networks
• Bayesian regression methods
![Page 23: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/23.jpg)
Features
• Feature selection– Automatic with trees– For DA, NN need preliminary selection– Need to account for selection when
assessing performance
• Missing data– Automatic imputation with trees– Otherwise, impute (or ignore)
![Page 24: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/24.jpg)
Performance assessment (I)
• Resubstitution estimation: error rate on the learning set– Problem: downward bias
• Test set estimation: divide cases in learning set into two sets, L1 and L2; classifier built using L1, error rate computed for L2. L1 and L2 must be iid.
– Problem: reduced effective sample size
![Page 25: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/25.jpg)
Performance assessment (II)• V-fold cross-validation (CV) estimation:
Cases in learning set randomly divided into V subsets of (nearly) equal size (for example, 5 or 10 subsets). Build classifiers leaving one set out; test set error rates computed on left out set and averaged to get overall error– Bias-variance tradeoff: smaller V can
give larger bias but smaller variance
• Out-of-bag estimation: covered below
• ROC curves: not covered at all, I’m still learning about this!
![Page 26: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/26.jpg)
Cross-validation
learning learning test
learning test learning
test learning learning
error
error
error
average error
![Page 27: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/27.jpg)
Performance assessment (III)
• Common to do feature selection using all of the data, then CV only for model building and classification
• However, usually features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to be downward biased
• Features should be selected only from the learning set used to build the model (and not the entire set)
![Page 28: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/28.jpg)
How NOT to estimate error
• ** DON’T DO THIS ** DON’T DO THIS **• Use the whole data set to choose which
variables (features) to use in the classifier
• Divide the data into (10, say) subsets for CV
• Leave out a subset and build a classifier with features chosen from the whole data set
• Use the classifier to predict the left out subset
• Average over left out subsets to estimate error
• ** DON’T DO THIS ** DON’T DO THIS **
![Page 29: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/29.jpg)
Aggregating classifiers
• Breiman (1996, 1998) found that gains in accuracy could be obtained by aggregating predictors built from perturbed versions of the learning set
• Many possibilities for perturbing
• The multiple versions of the predictor are aggregated by (weighted) voting
![Page 30: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/30.jpg)
Bagging
• Bagging = Bootstrap aggregating
• Nonparametric Bootstrap (standard bagging): perturbed learning sets drawn at random with replacement from the learning sets; predictors built for each perturbed dataset and aggregated by plurality voting (wb = 1)
• Parametric Bootstrap: perturbed learning sets are some known distribution (e.g. multivariate Gaussian)
• Convex pseudo-data (Breiman 1996)
![Page 31: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/31.jpg)
Aggregation By-products: Out-of-bag estimation of error
rate
• Out-of-bag error rate estimate: unbiased
• Expect about 1/3 of cases to be left out of each bootstrap sample
• Use these left out cases from each bootstrap sample as a test set
• Classify these test set cases, and compare to the true class labels to get the out-of-bag estimate of the error rate
![Page 32: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/32.jpg)
Boosting
• Freund and Schapire (1997), Breiman (1998)
• Data resampled adaptively so that the weights in the resampling are increased for those cases most often misclassified
• Predictor aggregation done by weighted voting
![Page 33: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/33.jpg)
Comparison of classifiers
• Dudoit, Fridlyand, Speed (JASA, 2002)
• FLDA
• DLDA
• DQDA
• NN
• CART
• Bagging and boosting
![Page 34: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/34.jpg)
Comparison study datasets
• Leukemia – Golub et al. (1999)n = 72 samples, G = 3,571 genes3 classes (B-cell ALL, T-cell ALL, AML)
• Lymphoma – Alizadeh et al. (2000)n = 81 samples, G = 4,682 genes3 classes (B-CLL, FL, DLBCL)
• NCI 60 – Ross et al. (2000)N = 64 samples, p = 5,244 genes8 classes
![Page 35: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/35.jpg)
Leukemia data, 2 classes: Test set error rates;150 LS/TS runs
![Page 36: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/36.jpg)
Leukemia data, 3 classes: Test set error rates;150 LS/TS runs
![Page 37: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/37.jpg)
Lymphoma data, 3 classes: Test set error rates; N=150 LS/TS runs
![Page 38: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/38.jpg)
NCI 60 data :Test set error rates;150 LS/TS runs
![Page 39: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/39.jpg)
Results
• In the main comparison, NN and DLDA had the smallest error rates, FLDA had the highest
• Aggregation improved the performance of CART classifiers, the largest gains being with boosting and bagging with convex pseudo-data
• Dettling and Bühlmann, improved the performance of boosting (LogitBoost)
![Page 40: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/40.jpg)
Results, cont
• For the lymphoma and leukemia datasets, increasing the number of genes to G=200 didn't greatly affect the performance of the various classifiers; there was an improvement for the NCI 60 dataset.
• More careful selection of a small number of genes (10) improved the performance of FLDA dramatically
![Page 41: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/41.jpg)
Comparison study – Discussion (I)
• “Diagonal” LDA: ignoring correlation between genes helped here
• Unlike classification trees and nearest neighbors, LDA is unable to take into account gene interactions
• Although nearest neighbors are simple and intuitive classifiers, their main limitation is that they give very little insight into mechanisms underlying the class distinctions
![Page 42: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/42.jpg)
Comparison study – Discussion (II)
• Classification trees are capable of handling and revealing interactions between variables
• Useful by-product of aggregated classifiers: prediction votes, variable importance statistics
• Variable selection: A crude criterion such as BSS/WSS may not identify the genes that discriminate between all the classes and may not reveal interactions between genes
• With larger training sets, expect improvement in performance of aggregated classifiers
![Page 43: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/43.jpg)
Acknowledgements
• Sandrine Dudoit
• Jane Fridlyand
• Yee Hwa (Jean) Yang
• Terry Speed
• www.bioconductor.org
![Page 44: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/44.jpg)
R: discrimination
• A number of R packages (libraries) contain functions to carry out discrimination, including: – MASS: lda, qda
– sma: dlda
– class: knn
– rpart: classification and regression trees (recursive partitioning)
– ipred: bagging
– e1071: svm
– LogitBoost: boosting
![Page 45: Statistics for Microarray Data Analysis with R](https://reader035.vdocuments.net/reader035/viewer/2022062519/568150ee550346895dbf07fa/html5/thumbnails/45.jpg)
Exercise: discrimination
• These ideas have been hard!
• The handout gives a very brief guide to using some of the classification routines in R, including a little bit of cross-validation
• There are many others as well, feel free to explore...