supervised classification. selection bias in gene extraction on the basis of microarray...
TRANSCRIPT
![Page 1: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/1.jpg)
Supervised Classification
![Page 2: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/2.jpg)
Selection bias in gene extraction on the basis of microarray gene-expression data
Ambroise and McLachlan
Proceedings of the National Academy of SciencesVol. 99, Issue 10, 6562-6566, May 14, 2002
http://www.pnas.org/cgi/content/full/99/10/6562
![Page 3: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/3.jpg)
Supervised Classification of Tissue Samples
AIM: TO CONSTRUCT A CLASSIFIER c(y) FOR PREDICTING THE UNKNOWN CLASS LABEL z OF A TISSUE SAMPLE y.
e.g. g = 2 classes C1 - DISEASE-FREE C2 - METASTASES
We OBSERVE the CLASS LABELS z1, …, zn where zj = i if jth tissue sample comes from the ith class (i=1,…,g).
![Page 4: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/4.jpg)
Sample 1 Sample 2 Sample M
Gene 1Gene 2
Gene N
Expression ProfileE
xpression S
ignature
![Page 5: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/5.jpg)
Sample 1 Sample n. . . . . . .
Gene 1
. . .
. . .
.
Gene p
Class 1(good prognosis)
Class 2(poor prognosis)
Supervised Classification (Two Classes)
![Page 6: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/6.jpg)
Microarray to be used as routine clinical screenby C. M. Schubert
Nature Medicine 9, 9, 2003.
The Netherlands Cancer Institute in Amsterdam is to become the first institution in the world to use microarray techniques for the routine prognostic screening of cancer patients. Aiming for a June 2003 start date, the center will use a panoply of 70 genes to assess the tumor profile of breast cancer patients and to determine which women will receive adjuvant treatment after surgery.
![Page 7: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/7.jpg)
Selection Bias
Bias that occurs when a subset of the variables is selected (dimension reduction) in some “optimal” way, and then the predictive capability of this subset is assessed in the usual way; i.e. using an ordinary measure for a set of variables.
![Page 8: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/8.jpg)
Selection Bias
Discriminant Analysis:McLachlan (1992 & 2004, Wiley, Chapter 12)
Regression:Breiman (1992, JASA)
“This usage (i.e. use of residual of SS’s etc.) has long been a quiet scandal in the statistical community.”
![Page 9: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/9.jpg)
Nature Reviews Cancer, Feb. 2005
![Page 10: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/10.jpg)
LINEAR CLASSIFIER
FORM
yyc Tβ 0)(
for the production of the group label z of a future entity with feature vector y.
pp yβyββ 110
![Page 11: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/11.jpg)
FISHER’S LINEAR DISCRIMINANT FUNCTION
)(sign ycz
)()(2
1
)(
211
210
211
yyyy
yy
T
S
Sβ
and , , 21 yycovariance matrix found from the training data
where
and S are the sample means and pooled sample
![Page 12: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/12.jpg)
Microarrays also to be used in the prediction of breast cancer by Mike West (Duke University) and the Koo Foundation Sun Yat-Sen Cancer Centre, Taipei
Huang et al. (2003, The Lancet, Gene expression predictors of breast cancer).
![Page 13: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/13.jpg)
LINEAR CLASSIFIER
FORM
yy TC β 0)(
for the production of the group label z of a future entity with feature vector y.
pp yβyββ 110
![Page 14: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/14.jpg)
FISHER’S LINEAR DISCRIMINANT FUNCTION
)(sign yCz
)()(2
1
)(
211
210
211
yyyy
yy
S
Sβ
T
and , , 21 yycovariance matrix found from the training data
where
and S are the sample means and pooled sample
![Page 15: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/15.jpg)
SUPPORT VECTOR CLASSIFIERVapnik (1995)
)( yC
n
jj
1 ,
2
2
1
0
min
ββ
subject to
jjj )C(z 1y,0j
,,1
n
),,1( nj
where β0 and β are obtained as follows:
relate to the slack variables
separable case
pp yβyββ 110
![Page 16: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/16.jpg)
jj
n
jj z y
1
ˆˆ β
with non-zero j only for those observations j for which theconstraints are exactly met (the support vectors).
01
01
ˆ ,ˆ
ˆ ˆ)(
n
jjjj
n
j
Tjjj
z
zC
yy
yyy
![Page 17: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/17.jpg)
Support Vector Machine (SVM)
REPLACE )( yy h
01
01
ˆ ),(ˆ
ˆ )(),(ˆ)(
n
jjj
n
jjj
K
hhC
yy
yyy
where the kernel function )(),(),( yyyy hhK jj is the inner product in the transformed feature space.
by
![Page 18: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/18.jpg)
HASTIE et al. (2001, Chapter 12)
The Lagrange (primal function) is
(1) )1()(111
2
2
1
n
jjjjjj
n
jj
n
jjP CzL yβ
which we maximize w.r.t. β, β0, and ξj.
Setting the respective derivatives to zero, we get
).,,1( 0 ,0 ,0
(4) ).,,1(
(3)
(2)
1
1
nj
nj
z
z
jjj
jj
n
jjj
n
jjjj
yβ
with and
![Page 19: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/19.jpg)
(5) 1 11 2
1k
Tjkjk
n
j
n
kj
n
jjD zzL yy
We maximize (5) subject to
n
jjjj z
1
.0 and 0
In addition to (2) to (4), the constraints include
.,,1for
(8) 0)1()(
(7) 0
(6) 0)1()(
j
nj
Cz
Cz
jjj
j
jjjj
x
x
Together these equations (2) to (8) uniquely characterize the solution to the primal and dual problem.
By substituting (2) to (4) into (1), we obtain the Lagrangian dual function
![Page 20: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/20.jpg)
Leo Breiman (2001)
Statistical modeling: the two cultures (with discussion).
Statistical Science 16, 199-231.
Discussants include Brad Efron and David Cox
![Page 21: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/21.jpg)
GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning)
LEUKAEMIA DATA:
Only 2 genes are needed to obtain a zero CVE (cross-validated error rate)
COLON DATA:
Using only 4 genes, CVE is 2%
![Page 22: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/22.jpg)
Since p>>n, consideration given to selection of suitable genes
SVM: FORWARD or BACKWARD (in terms of magnitude of weight βi)
RECURSIVE FEATURE ELIMINATION (RFE)
FISHER: FORWARD ONLY (in terms of CVE)
![Page 23: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/23.jpg)
GUYON et al. (2002)
LEUKAEMIA DATA:
Only 2 genes are needed to obtain a zero CVE (cross-validated error rate)
COLON DATA:
Using only 4 genes, CVE is 2%
![Page 24: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/24.jpg)
GUYON et al. (2002)
“The success of the RFE indicates that RFE has a built in regularization mechanism that we do not understand yet that prevents overfitting the training data in its selection of gene subsets.”
![Page 25: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/25.jpg)
Example: Microarray DataColon Data of Alon et al. (1999)
n=62 (40 tumours; 22 normals)
tissue samples of
p=2,000 genes in a
2,000 62 matrix.
![Page 26: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/26.jpg)
![Page 27: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/27.jpg)
Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples
![Page 28: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/28.jpg)
Figure 2: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of leukemia tissue samples
![Page 29: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/29.jpg)
Figure 3: Error rates of Fisher’s rule with stepwise forward selection procedure using all the colon data
![Page 30: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/30.jpg)
Figure 4: Error rates of Fisher’s rule with stepwise forward selection procedure using all the leukemia data
![Page 31: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/31.jpg)
Figure 5: Error rates of the SVM rule averaged over 20 noninformative samples generated by random permutations of the class labels of the
colon tumor tissues
![Page 32: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/32.jpg)
Aware of selection bias:
SPANG et al. (2001, Silico Biology)
WEST et al. (2001, PNAS)
NGUYEN and ROCKE (2002)
ADDITIONAL REFERENCES
Selection bias ignored:
XIONG et al. (2001, Molecular Genetics and Metabolism)
XIONG et al. (2001, Genome Research)
ZHANG et al. (2001, PNAS)
![Page 33: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/33.jpg)
Error Rate Estimation
(y1, y2, y3,……………, yn)
Suppose there are two groups G1 and G2
c(y) is a classifier formed from the data set
The apparent error is the proportion of the data set misallocated by c(y).
![Page 34: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/34.jpg)
Use c(1)(y1) to allocate y1 to either G1 or G2.
From the original data set, remove y1 to give the reduced set
(y2, y3,……………, yn)Then form the classifier c(1)(y ) from this reduced set.
Cross-Validation
![Page 35: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/35.jpg)
Repeat this process for the second data point, y2.
So that this point is assigned to either G1 or G2 on the basis of the classifier c(2)(y2).
And so on up to yn.
![Page 36: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/36.jpg)
Ten-Fold Cross Validation
1 2 3 4 5 6 7 8 9 10
T r a i n i n gTest
![Page 37: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/37.jpg)
BOOTSTRAP APPROACH
Efron’s (1983, JASA) .632 estimator
B1.632 AE.368 632. B
where B1 is the bootstrap when rule is applied to a point not in the training sample.
A Monte Carlo estimate of B1 is
otherwise 0 esmisallocat * if 1
otherwise 0sample bootstrapth if 1
1
and
with
11
1
jjk
jjk
jkjkjkj
j
x
kx
Rk
K
k
K
k
n
j
Q
I
IQIE
nEB
Rk*
where
![Page 38: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/38.jpg)
Toussaint & Sharpe (1975) proposed the ERROR RATE ESTIMATOR
CV2E )AE 1( A ww-(w)
5.0w
McLachlan (1977) proposed w=wo where wo is chosen to minimize asymptotic bias of A(w) in the case of two homoscedastic normal groups.
Value of w0 was found to range between 0.6 and 0.7, depending on the values of . and , ,
2
1
n
np
where
![Page 39: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/39.jpg)
B1 )AE 1( 632. ww-B
.632+ estimate of Efron & Tibshirani (1997, JASA)
rw
368.1
632.
AE
AE1
B
r
g
i
ii qp1
)1(
where
(relative overfitting rate)
(estimate of no information error rate)
If r = 0, w = .632, and so B.632+ = B.632
r = 1, w = 1, and so B.632+ = B1
![Page 40: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/40.jpg)
Ten-Fold Cross Validation
1 2 3 4 5 6 7 8 9 10
T r a i n i n gTest
![Page 41: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/41.jpg)
MARKER GENES FOR HARVARD DATA
For a SVM based on 64 genes, and using 10-fold CV,we noted the number of times a gene was selected.
No. of genes Times selected 55 1 18 2 11 3 7 4 8 5 6 6 10 7 8 8 12 9 17 10
![Page 42: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/42.jpg)
No. of Timesgenes selected 55 1 18 2 11 3 7 4 8 5 6 6 10 7 8 8 12 9 17 10
tubulin, alpha, ubiquitous
Cluster Incl N90862
cyclin-dependent kinase inhibitor 2C (p18, inhibits CDK4)
DEK oncogene (DNA binding)
Cluster Incl AF035316
transducin-like enhancer of split 2, homolog of Drosophila E(sp1)
ADP-ribosyltransferase (NAD+; poly (ADP-ribose) polymerase)
benzodiazapine receptor (peripheral)
Cluster Incl D21063
galactosidase, beta 1
high-mobility group (nonhistone chromosomal) protein 2
cold inducible RNA-binding protein
Cluster Incl U79287
BAF53
tubulin, beta polypeptide
thromboxane A2 receptor
H1 histone family, member X
Fc fragment of IgG, receptor, transporter, alpha
sine oculis homeobox (Drosophila) homolog 3
transcriptional intermediary factor 1 gamma
transcription elongation factor A (SII)-like 1
like mouse brain protein E46
minichromosome maintenance deficient (mis5, S. pombe) 6
transcription factor 12 (HTF4, helix-loop-helix transcription factors 4)
guanine nucleotide binding protein (G protein), gamma 3, linked
dihydropyrimidinase-like 2
Cluster Incl AI951946
transforming growth factor, beta receptor II (70-80kD)
protein kinase C-like 1
MARKER GENES FOR HARVARD DATA
![Page 43: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/43.jpg)
Breast cancer data set in van’t Veer et al. (van’t Veer et al., 2002, Gene Expression Profiling Predicts Clinical Outcome Of Breast Cancer, Nature 415)
These data were the result of microarray experiments on three patient groups with different classes of breast cancer tumours.
The overall goal was to identify a set of genes that could distinguish between the different tumour groups based upon the gene expression information for these groups.
![Page 44: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/44.jpg)
Breast tumours have a genetic signature. The expressionpattern of a set of 70 genes can predict whether a tumouris going to prove lethal, despite treatment, or not.
“This gene expression profile will outperform all currently used clinical parameters in predicting disease outcome.”
van ’t Veer et al. (2002), van de Vijver et al. (2002)
![Page 45: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/45.jpg)
Number of Genes Error Rate for Top 70 Genes (without correction for Selection Bias as Top 70)
Error Rate for Top 70 Genes (with correction for Selection Bias as Top 70)
Error Rate for 5422 Genes (with correction for Selection Bias)
1 0.50 0.53 0.56
2 0.32 0.41 0.44
4 0.26 0.40 0.41
8 0.27 0.32 0.43
16 0.28 0.31 0.35
32 0.22 0.35 0.34
64 0.20 0.34 0.35
70 0.19 0.33 -
128 - - 0.39
256 - - 0.33
512 - - 0.34
1024 - - 0.33
2048 - - 0.37
4096 - - 0.40
5422 - - 0.44
![Page 46: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/46.jpg)
![Page 47: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/47.jpg)
van de Vijver et al. (2002) considered a further 234 breast cancer tumours but have only made available the data for the top 70 genes based on the previous study of van ‘t Veer et al. (2002)
![Page 48: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/48.jpg)
Number of Genes
From 70 genes From original 24481 genes (set missing values to 0)
From original 24481 genes ( using KNN for missing values, k=10)
1 0.29491525 0.4023327 0.4199797
2 0.17288136 0.3850913 0.3825558
4 0.20000000 0.3747465 0.3756592
8 0.13220339 0.3033469 0.3061866
16 0.10508475 0.2314402 0.2319473
32 0.08474576 0.2038540 0.2240365
64 0.09491525 0.2038540 0.1915822
70 0.09491525
128 0.1634888 0.1600406
256 0.1462475 0.1507099
512 0.1359026 0.1438134
1024 0.1324544 0.1496957
2048 0.1521298 0.1364097
4096 0.1481744 0.1403651
8192 0.1550710 0.1605477
16384 0.1683570 0.1738337
24481 0.1683570 0.1772819
![Page 49: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/49.jpg)
0 5 10 15
0.0
0.1
0.2
0.3
0.4
0.5
log2(Number of Genes)
CV
Err
or
Ra
tes
Unbiased (kNN)biased
![Page 50: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/50.jpg)
Nearest-Shrunken Centroids(Tibshirani et al., 2002)
The usual estimates of the class means are shrunk toward the overall mean of the data, where
n
j
ijiji nyzy1
/
iyy
./1
n
j
j nyyand
![Page 51: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/51.jpg)
The nearest-centroid rule is given by
where yv is the vth element of the feature vector y and .viv yy )(
![Page 52: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/52.jpg)
v
In the previous definition, we replace the sample mean of the vth gene by its shrunken estimate
ivy
where
2
111 )( nnm ii
![Page 53: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/53.jpg)
Comparison of Nearest-Shrunken Centroids with SVM
Apply (i) nearest-shrunken centroids and (ii) the SVM with RFEto colon data set of Alon et al. (1999), withN = 2000 genes and M = 62 tissues (40 tumours, 22 normals)
![Page 54: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/54.jpg)
Nearest-Shrunken Centroids applied to Alon data
(b) Class-specific Error Rates
(a) Overall Error Rates
![Page 55: Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062314/56649ea75503460f94ba980c/html5/thumbnails/55.jpg)
SVM with RFE applied to Alon data
(a) Overall Error Rates
(b) Class-specific Error Rates